Keyword Extractor and Text Analyzer - Help

In this Post

Brief Tutorial and Help instructions for interactive online Keyword Extractor and Text Analyzer

Table of Content


Intro

Try free online Keyword Extractor and Text Analyzer

New sophisticated tool to automatically extract keywords and other content words from documents. Paste texts online to detect, extract and analyze keywords with this advanced probabilistic text mining program. Instantly get various statistics for every keyword. Extract classical measures, like entropy and TF-IDF, term frequency-inverse document frequency, and more: keyword extractor and text analyzer is not limited to traditional statistics common in information retrieval and corpus linguistics. Get highly original measures that give deeper insights into the meaning of any text or corpus, in any language.

The tool is used by researchers, content creators, in SEO keyword optimization tasks and text mining. Text is analyzed by simply pasting it inside online prompt. Keyword Extractor and Text Analyzer automatically detects keywords and makes a tag cloud, or word cloud, from your input text (texts). In such way you get instant text visualization. Keyword is a relative concept, so you can mine for different sets of keywords with different mining methods and parameters: every time you receive a new tag cloud updated accordingly to mining method.

For example, you can paste your text and rewrite it according to the tag cloud results in order to highlight certain keywords or content words. Rewrite or rephrase the text until you see required keywords in the tag cloud, so to make sure they are detectable by search engines and information retrieval systems.

The interactive Keyword Extractor and Text Analyzer can be used in exegesis. It is also made for those studying literary and sacred texts to help you find not so obvious relations between words otherwise overlooked. Prose, poetry, novels, from fiction to Holy Scriptures, investigate the meaning of words with the help of this online keyword detector and extractor.

Read Also How to automatically tag posts in Hugo Static Site Generator with Python

User interface

The tool is build for anyone interested in corpus linguistics, mathematical and computer linguistics, stylometry, literary studies - where ever there is place for quantitative research of text.

Analyzing Text for Word Frequency

Analyze text for lexical richness and word frequency. Perform advanced probabilistic calculations for word frequency distributions.

Analyzing Text

Paste the text you want to mine in Paste text here window. You will instantly see basic text statistics: Number of Words (number of word tokens), Number of Unique Words (word types, i.e. the size of vocabulary), and Type-to-Token Ratio (TTR). In such way the app instantly calculates size of text and it’s lexical richness, or diversity.

The app will alert you if the text is too short. In such case, you can proceed with your analysis anyway, or you can simply paste the texts twice to improve sample size. Otherwise, you can paste several different texts together - as many as you want. There is no limit of the number of texts you can paste together. Such scenario is likely when you want to retrieve words from a collection of posts or reviews, an entire website, a library of documents, an archive, etc. Keyword extractor and text analyzer works best with books and collections of documents: just paste the entire set of texts in online prompt to get results for your corpus.

Once you press Run, the app starts analysis by fetching user input, including the text and user-defined parameters like chunk size and minimum frequency. Default minimum frequency is 5. If you analyze a book or a larger set of documents, try setting minimum frequency to values like 10, 20, and 50. Now press Run again to fetch meaningful content words.

You can manually define several parameters of analysis.

Method: choose the method for content word extraction and text mining, leave for default.

Min Word Frequency: cutoff less frequent words by setting the number, default is 5, meaning that only words that occur at least five times are analyzed. For larger texts, like a book or, say, over 50 documents, choose 10 or more.

Text Chunk Size: leave as it is for automatic calculation of optimal chunk size, or set values to sizes like 100, 600, 1000, 3000 (larger ones for larger input). Your input text is divided into chunks of specified size, and the frequency of each word within each chunk is calculated. Text chunk size is similar to reader’s attention span and is generally proportionate to the overall size of your input. Each chunk is a kind of «chapter», or automatically derived part of your input text(s).

Clearing Input

Press Clear button to reset all input fields and output elements, allowing users to clear previous data and start keyword detection and text mining afresh. You can also reload the page to clear settings.

Setting Defaults

The Default button initializes default values for parameters such as chunk size and minimum frequency, ensuring a consistent starting point for word extraction and mining of same text input. After pressing Default, press Run again to start analysis afresh with default values.

Save Results

The Save button allows you to save your results in a csv file. This is useful if you want to calculate correlations, do any other kind of analysis with our key mining results in your PC. Notice, you can upload saved data, without modifying it, to our semascope viewer to see interactive 3D plot of how words relate to each other.

You can extract keywords online interactively with Keyword Extractor and Text Analyzer. Simply copy and paste the text that you want to analyze, save results as csv data file by pressing Save on the Keyword Extractor page, then go to the 3d plot viewer page and upload it to semascope.

View tag cloud

After running the test the program prepares tag cloud, or word cloud, automatically. Tag clouds, based on the keywords that we have extracted, are updated when you change the method for extracting and detecting them. Inside the tag cloud area you can see the button to view keywords in contexts (KWIC).

View results in context

After running the test, you can view keywords in context (KWIC) for top words. To do that, press View KWIC Context.

View results in 3d plot

After running the test, make sure to save the data as csv file by pressing Save. The file is downloaded to your device. Do not modify the comma-separated data file. Now you can upload results to Semascope viewer and explore 3D plot of how words relate to each other. The 3d plot viewer is available here: View 3d plot.

Methods

The app calculates various statistics for word mining from the text data, utilizing parameters from the Negative Binomial Distribution (NBD) to estimate the shape of the distribution of word frequencies, alone with traditional probabilistic methods from information retrieval and corpus linguistics. Take you time and try experimenting with literary texts to get better intuition and feel of what keyword extraction methods do. Here are brief explanations of text mining / corpus analysis methods used by the app.

  1. Term Frequency (TF): This represents the number of times a word appears in a document, the most basic statistic of frequency count or term count, same as Word Frequency.

  2. Mean: It signifies the average number of times a word appears in each text chunk.

  3. Variance: A measure of how much the TF values vary across different text chunks.

  4. k: This parameter from the Negative Binomial Distribution is used to estimate the «number of successes» of the TF distribution.

  5. p: Another parameter from the Negative Binomial Distribution, p measures «the probability of success». It influences the tail behavior of the distribution.

  6. sqrt_kp: «burstiness» index, indicating the degree of word aggregation. This combined measure provides insight into both k and p of the TF distribution, this is our recommended mining method to extract named entities. For example, when certain words (such as proper and place names, or the so-called named entities) have small k and p values, then variance is large, and mean is small: these words are over-dispersed and appear in only a few chunks of input. But when they do, they appear a lot («rare frequent events»). Such words are highly aggregated and usually convey some special meaning: names of main heroes, a concept that is important to the writer, etc. See our research for details.

  7. Ratio of q to p: This ratio offers a perspective on the relative importance of skewness compared to its complement. A more «esoteric» statistic, it is called «ratio of probability of failures to probability of success».

  8. Fisher Information: Derived from k and p, this measure provides further insights into the distribution’s characteristics. Smaller values are for interesting words. A noteworthy statistic showing background content words, words «you can not do without».

  9. Document Frequency (DF): The number of word chunks containing the word, similar to the number of «chapters» of a book where the word is attested at least once.

  10. DF-TF Ratio: This ratio of total TF to DF indicates the importance of a word in the corpus, a traditional measure from corpus linguistics.

  11. Term Frequency-Inverse Document Frequency (TF-IDF): A widely used measure of the importance of a word in the corpus.

  12. Entropy: A measure of the uncertainty or randomness of the word distribution.

  13. Composite Meaning Index (CMI): This experimental measure assesses word meaning dynamically based on several parameters weighed against each other.

Try free online Keyword Extractor and Text Analyzer

At the end of the output table (which you can download as a csv document) the app calculates averages for statistics across the entire input, so you can compare different texts based on their averages.

Alexander Sotov

Text: Alexandre Sotov
Comments or Questions? Contact me on LinkedIn

𝕏   Facebook   Telegram

Other Posts:

Sentiment Analysis API

Semascope: Tool for Text Mining and Analysis

Track media sentiment with this app

How AI sees Dante's Divine Comedy in 27 words

Exploring Sacred Texts with Probabilistic Keyword Extractor

FAQ: Automated keyword detection, content extraction and text visualization

Make ChatGPT Content Undetectable with this App

ChatGPT Detector, a free online tool

The Intricate Tapestry of ChatGPT Texts: Why LLM overuses some words at the expense of others?

How to build word frequency matrix using AWK or Python

How to prepare your texts for creating a word frequency matrix

Intro to Automated Keyword Extraction

How to automatically tag posts in Hugo Static Site Generator with Python

Using Hugo and Goaccess to show most read posts of a static website

How Textvisualization.app and its semascope 👁️ compare with traditional tag clouds?

Services

What is this website?

Help