TextVisualization.app is a resouce for text mining, analyzis and visualization. Our aim is to facilitate fast human understanding of vast collections of documents (such as chapters of a novel, news articles, academic papers and so on). Here, we demonstrate probabilistic keyword extraction methods that can be used for automatically tagging vast collections of texts.
There are two repos that you can use:
Python
and R
scipts to mine collections of texts for keywords, based on a linguistically motivated statistical method. The scipts are called corpus_utils.
Plotly Javascript
viewer app to interactively view word properties that were mined with the help of corpus_utils
. This JS viewer is called semascope.
This website is a demo of how these components, keyword extraction and visualization, work together. We also develop interactive online tools that use statistical methods of text mining:
Try our free online Keyword Extractor and Text Analyzer
Try our ChatGPT detector
On the home page of this website, press View text visualization to enter 3d plot page for a pre-processed book or collection of texts.
Generate text mining datasets interactively from your own texts with an online tool: Keyword Extractor and Text Analyzer. Save your results and upload them to the 3d plot viewer: Semascope Viewer.
Each dot is a word. Point to dots to view words. Rotate the plot, small control icons appear on top right corner. They are:
On top of the plot you can see large control buttons to filter and explore data.
Press Heroes and then Background, and then Background again.
Pressing Heroes will show you words that deal with characters, actors and main concepts of the book or collection. Typically, these are named entities. The larger the markers, the more frequent is the word. The higher on the plot, the more documents (chapters) contain the word. Pressing Heroes+ (with the plus sign) will expand the Hero area to include more potentially interesting words that are important to the collection.
Pressing Background will give you words that are «background» to the collection. These words represent concepts and features of narrative that are necessary building blocks, but sometimes overthough or taken for granted and not noticed.
Pressing Reduce will always load less common words by reducing Document Frequency.
Pressing All will load all data. You can also reload the page to hard reset the filter.
Under the plot you can find Search and Filter Control fields.
Use Search to look for part or whole word of interest. Point to the marker to see corrdinates of the word on the plot.
Use Filter Control to filter words out. You can manually set filters to any desired value. Parameters k and p are decimals. DF is Document Frequency. FR is Word Frequency (the sum). Use Cluster field to enter cluster number.
The color of each marker corresponds to word cluster. Cluster legend is in the color scale at the left side of the plot. Marker size is bigger for more frequent words.
We have talked about filters and contols that are used. Now let us go into further details on the measurments used in this application.
It is simple sum of word usage, a straightforward count of how many times a specific word appears in a document without normalization. It is the total count of occurrences of the word-form.
The formula for the word frequency ((W_f)) of a word in a document is:
[ W_f(t, d) = \text{Number of times word } t \text{ appears in document } d ]
This measure doesn’t consider the total number of words in the document, so it doesn’t account for document length normalization.
The Simple Sum serves as the observed data that is used to fit data to a model. Once the distribution is fitted, it can be compared to the actual word frequency distribution to evaluate how well the model represents the data. This comparison helps assess the goodness-of-fit and the validity of the chosen distributional assumption.
Why would anyone want to fit word frequency counts to a distribution model? - Because fitting a statistical distribution, such as the Negative Binomial Distribution (NBD), to word frequency data can reveal the underlying patterns of discourse and hidden regularity in word usage.
DF measures how many documents in a collection contain a specific term. A high DF indicates that a term is widespread across the document collection. DF is calculated by counting the number of documents that contain a specific term.
It is a measure that provides insights into the significance of a term within a document collection. This ratio is derived by comparing how frequently a term occurs across all documents (DF) to its simple sum of word usage (Word Frequency) within a specific document.
The DF to Word Frequency ratio is calculated by dividing the Document Frequency (DF) by the Word Frequency (simple sum) for a specific term within a document.
Interpretation:
A higher DF to Word Frequency ratio indicates that a term is widespread across many documents in the collection relative to its frequency within a specific document.
A lower ratio suggests that while the term might be frequent in the specific document, it is not as widely distributed across the entire collection.
Terms with a high DF to Word Frequency ratio are often common, generic words or stopwords that appear in many documents but may not carry specific topical significance within individual documents. These terms may not be helpful in distinguishing the content of a particular document.
In information retrieval and document ranking, the DF to Word Frequency ratio can be used to assess the importance of terms. Terms with a balanced ratio may have more discriminative power in distinguishing documents, as they are both frequent within a document and relatively specific to that document compared to the entire collection.
The Document Frequency (DF) to Word Frequency (simple sum) ratio is a measure that helps assess the prevalence of a word within a document collection. This ratio indicates how widely a word is distributed across multiple documents (Document Frequency) relative to its frequency within a specific document (Word Frequency).
In the context of the Negative Binomial Distribution (NBD), the parameters (k) and (p) can be used for characterizing word usage in a corpus, and their values can be interpreted to understand the distribution of word frequencies. Let’s delve into how these parameters are used. The Negative Binomial Distribution is characterized by two parameters: (k) and (p).
(k) (the shape or dispersion parameter): Represents the degree of overdispersion. Higher values of (k) indicate less dispersion, making the distribution more Poisson-like. Lower values of (k) imply greater overdispersion.
(p) (the probability of success parameter): Represents the probability of a success (in this context, the occurrence of a word) in a series of independent Bernoulli trials. It’s related to the mean and variance of the distribution.
Interpretation of NBD parameters for NLP:
(k) Parameter:
Smaller values of (k) indicate greater overdispersion, suggesting that a few words are used frequently in a very few texts. In the context of «hero words» (named entities or important actors in a text), smaller (k) values could imply that a few entities dominate the discourse, making them stand out: these are «bursty», or «contageous» words. Similar to ecology, where (k) is used as an aggregation parameter, decrease of (k) towards zero corresponds to increase in word aggregation, or burstiness.
Larger (k) values, on the other hand, suggest less overdispersion, indicating a more uniform distribution of the word across the corpus, i.e. increase of (k) towards infinity corresponds to the absence of burstiness and suggests a Poisson distribution of word frequency counts.
(p) Parameter:
(p) values that approach 1 lead to distributions that resemble a Poisson distribution, and imply a more regular distribution where words occur with more consistent frequencies.
Smaller (p) values might suggest that the «hero words» occur with varying frequencies, creating a distribution with a more pronounced tail.
By estimating (k) and (p) for the word frequency distribution, you can identify words that stand out based on their values. What we call «Hero words» with smaller (k) values may be those that significantly contribute to the overdispersion. Words with (p) values approaching 1 may exhibit different patterns of usage, with more regular occurrence.
Plotting and Visualization:
Plotting the word frequency distribution with the fitted NBD parameters can visually reveal the characteristics of word usage.
By estimating and interpreting these parameters, you can gain insights into the characteristics of word usage in a corpus, for example, you can highlight what we call «hero words», often composed of named entities. Accordingly, these values can be used for Named Entity Recognition (NER), especially in combination with word n-gram analysis.
We cluster words based on a combination of NBD parameters and Document Frequency (DF) to Word Frequency ratio. Let’s break down how each component contributes to the overall goal of distinguishing different words in the collection:
Negative Binomial Distribution (NBD) Parameters (k) and (p):
(k) and (p) parameters of the NBD are useful for capturing the distributional characteristics of word frequencies. Smaller (k) values suggest overdispersion, indicating that a few words are used very frequently, potentially capturing distinctive terms. Meanwhile, (p) influences the shape of the distribution, allowing you to identify regularities or variabilities in word usage patterns.
Document Frequency (DF) to Word Frequency Ratio:
The DF to FR ratio helps in distinguishing between words that are widespread across many documents and those that are more specific to individual texts. Higher ratios might indicate words that are common across documents but less important within specific texts, while lower ratios might highlight words that are more distinctive to each text.
Clustering Approach:
By combining these metrics, we can cluster words based on their behavior across the corpus. For example:
Words with smaller (k) values and distinctive (p) values might be grouped together, indicating terms that are used uniquely and variably across documents.
Words with high DF to Word Frequency ratios might be clustered separately, identifying common terms with lower specificity to individual documents.
Identification of Unique and Prevalent Words:
Clustering allows to identify words that are unique or prevalent within different texts. Separate clusters might include proper names, terms, or concepts that distinguish one document from another. On the other hand, clusters with prevalent words might include common grammatical or service words that are found consistently across the collection.
By distinguishing words like this, we can enhance text understanding by providing insights into the «frequent rare words» and, on the other hand, common elements, such as grammatical words. It helps in identifying both content-specific terms and general language elements, regardless of the language.
Visualization and Interpretation:
Word clusters are visualized with different marker colours with the scale next to the plot. Visualizing the clusters can aid in interpreting the results of the tests. Words within the same cluster share similar characteristics, allowing for a more nuanced understanding of their role in different documents. It’s often beneficial to iteratively refine clustering based on the insights gained. With the help of manual filters you can adjust parameters or further enhance the precision of your analysis.
Which clustering is used?
You can use datasets prepared by us to try out any clustering method that you like. At present, we use KMeans clustering, please, consult the source code in our repo.
Incorporating word n-grams, such as bi-grams (two-word sequences) and tri-grams (three-word sequences), into our app can indeed yield noteworthy results, particularly in tasks like named entity recognition (NER). Here are a few comments on the benefits of using n-grams:
Contextual Information: N-grams capture contextual information by considering sequences of words rather than individual tokens. This is particularly valuable for NER, as named entities often consist of multiple words (e.g., «New York City» or a name and a surname). Bi-grams and tri-grams are effective in recognizing multi-word named entities, which are common in various domains. For example, a bi-gram might capture «United States,» and a tri-gram might capture «Machine Learning Algorithm.» By feeding n-gram data into the model, you can automatically extract and view «heros» with these multi-word entities, see examples.
Reduced Ambiguity: N-grams can help reduce ambiguity by providing more context. Certain words might have different meanings in different contexts, and examining neighboring words can aid in disambiguation.
Improved Accuracy: The use of n-grams can lead to improved accuracy in tasks such as entity recognition. The additional context helps in better understanding the semantic relationships between words, this is especially true for rare or uncommon languages.
Syntax and Structure: Bi-grams and tri-grams contribute to capturing the syntactic and structural aspects of language. This is valuable for tasks where the arrangement of words is essential, such as in identifying phrases or expressions.
Leveraging Word Dependencies: N-grams allow you to leverage dependencies between words. For instance, in the context of named entities, the words within an n-gram are often closely related, providing a stronger signal for entity recognition.
Flexibility in Tokenization: Depending on the specific requirements, you can adjust the granularity of tokenization. For example, we might tokenize your text into individual words or use bi-grams and tri-grams, offering flexibility based on the nature of the task.
Machine Learning Feature Representation: When using machine learning models for named entity recognition, n-grams can serve as valuable features. The models can learn patterns and associations within these sequences to improve entity recognition performance.
Normalization methods often involve scaling or adjusting frequencies based on document length or other factors. In some cases, this may distort the inherent patterns and structures in the raw frequency data. For certain analyses, maintaining the context provided by the raw frequencies is essential. Raw frequencies (Word Frequency and Document Frequency) preserve the original, unaltered information about how often a word occurs in a document and how many documents contain that word. This preservation of raw data can be crucial for certain tasks where the original counts are of primary interest. This is especially true when dealing with «organic corpora», such as those analyzed here.
Moreover, in distribution fitting tasks, especially when fitting statistical distributions like the Negative Binomial Distribution or other models, using raw frequencies allows for a straightforward and direct application of the data to the chosen distribution. These statistical models are designed to work with raw counts.
You can download datasets and raw text data at the end of each 3d viewing page. There was minimal text parsing envolved and the only requirement for a text collection is that each text (chapter) is separated with new line. After parsing, texts are broken down into raw freqency counts for individual word tokens, to obtain word matrix. You can use a Python script for this tast, see our repo. Distribution fitting is done over these word matrices. Please, check out our corpus_utils on Github.
Yes. Mine text data interactively in Keyword Extractor and Text Analyzer. Save it. Upload it to the 3d viewer on the Semascope Viewer page.