In this post I am going to show you how to build a word frequncy matrix. This can be done using AWK
which exists on any UNIX
system, but we will also do the same with a modern scripting language, Python
.
Try free online Keyword Extractor and Text Analyzer
Try ChatGPT detector
Make sure to check out the post on compressing your collection of texts into a single file which can be processed by these scripts in order to create text matrix.
A few words on what we are talking about. A word frequency matrix, often referred to as a document-term matrix, is a mathematical representation that captures the frequency of terms occurring in a collection of documents. This matrix is a fundamental tool in natural language processing (NLP), corpus linguistics, and computational text analysis, providing a structured way to analyze and understand unstructured textual data.
The matrix represents the counts of each word in the individual documents. Each row of the matrix represents a single document; while each column of the matrix represents a single word. There exist columns for every document word, which is not a stop word, thus the matrix can be somewhat sparse if you are blacklisting some words. Same in a document-term matrix, each row corresponds to a document within the collection, and each column corresponds to a unique term present in the documents.
It’s worth noting that the term «document-feature matrix» is a more general concept, where «features» can encompass various properties of a document beyond just terms. However, the document-term matrix is a specific instance of this broader concept, focusing specifically on term frequencies. Additionally, one may encounter the transpose of the document-term matrix, known as the term-document matrix. In this alternate representation, documents become columns, and terms become rows. Both representations are valuable in different contexts, offering flexibility in analysis approaches.
So, the word frequency matrix, often synonymous with the document-term matrix, plays a pivotal role in extracting meaningful information from text data, providing a foundation for tasks ranging from sentiment analysis to topic modeling in the realm of natural language processing and computational text analysis. Moreover, one can analyze word frequency distributions and model them.
Here is an example of a words frequency matrix based on Andersen’s Fairy Tales (first n words):
Document | the | emperor | new | clothes | many | years | ago | there | was |
---|---|---|---|---|---|---|---|---|---|
1 | 164 | 28 | 11 | 12 | 1 | 1 | 1 | 6 | 21 |
2 | 130 | 12 | 0 | 1 | 1 | 1 | 0 | 5 | 23 |
3 | 29 | 0 | 0 | 2 | 0 | 0 | 0 | 3 | 14 |
4 | 930 | 0 | 14 | 2 | 10 | 5 | 2 | 43 | 195 |
5 | 288 | 0 | 0 | 0 | 5 | 1 | 0 | 15 | 57 |
6 | 867 | 0 | 5 | 3 | 12 | 0 | 0 | 78 | 217 |
7 | 46 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 17 |
8 | 245 | 0 | 7 | 0 | 6 | 2 | 1 | 15 | 50 |
9 | 193 | 0 | 0 | 1 | 1 | 0 | 0 | 13 | 42 |
10 | 236 | 0 | 2 | 2 | 4 | 6 | 0 | 36 | 59 |
11 | 72 | 0 | 0 | 0 | 1 | 0 | 0 | 19 | 23 |
12 | 108 | 0 | 0 | 2 | 3 | 0 | 0 | 22 | 26 |
13 | 61 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 23 |
14 | 331 | 0 | 4 | 3 | 4 | 5 | 0 | 46 | 96 |
15 | 75 | 0 | 2 | 0 | 0 | 0 | 0 | 5 | 20 |
16 | 107 | 0 | 0 | 0 | 9 | 0 | 0 | 14 | 44 |
17 | 55 | 0 | 0 | 0 | 2 | 0 | 2 | 4 | 6 |
18 | 178 | 0 | 2 | 1 | 2 | 0 | 0 | 14 | 51 |
Each row represents a different document, and each column represents a unique word. The numbers in the cells indicate the frequency of each word in the corresponding document. This matrix provides a quantitative overview of word occurrences in Andersen’s Fairy Tales, facilitating further analysis and insights into the document collection.
A side note: even looking at this actual example, one can see that word frequency matrices have a lot of zeros. The abundance of zeros in word frequency matrices signifies overdispersion, a statistical phenomenon where the variance in word occurrences exceeds the nominal mean. This sparsity arises from the nature of textual data, where documents typically utilize only a fraction of the entire vocabulary. The overdispersed distribution of word frequencies across documents poses challenges to traditional statistical models, necessitating specialized techniques like zero-inflated models or negative binomial regression to appropriately account for the variability in word occurrences.
Understanding and addressing overdispersion are essential for accurate interpretation and inference from word frequency matrices, ensuring that statistical analyses align with the unique characteristics of the data and provide reliable insights into the underlying text corpus, but let’s get back to our task at hand.
Clone the repository containing the script, called corpus_utils.
Open a terminal or command prompt on your local machine.
Navigate to the directory where you want to clone the repository.
Use the following command to clone the repository:
git clone https://github.com/roverbird/corpus_utils.git
Once the cloning process is complete, navigate into the cloned directory:
cd corpus_utils
Now you have access to the contents of the repository, including the scripts wordstats.awk
and wordstats.py
. You will also find sample texts, such as Andersen’s Fairy Tales, in /corpus_utils/examples/corpora
. Let’s take a look inside the scripts.
Let’s consider an AWK script that efficiently constructs a word frequency matrix. The script takes a text file as input, where each line represents a single document. Here’s a copy-paste ready version of the script called wordstats.awk
in our repo:
{
$0 = tolower($0); # Convert the entire line to lowercase
gsub("[:;.,()!?-]", " "); # Replace certain punctuation marks with spaces
t++;
for (w = 1; w <= NF; w++) {
l[t, $w]++; # Count occurrences of each word for each line
g[$w]++; # Count total occurrences of each word across all lines
}
}
END {
for (w in g)
if (g[w] < 10 || g[w] > 100000)
delete g[w]; # Delete words occurring less than 10 times or more than 100,000 times
else
printf w " "; # Print words that meet the frequency criteria
print "";
for (i = 1; i <= t; i++) {
for (w in g)
printf +l[i, w] " "; # Print the frequency of each word for each line
print "";
}
}
In summary, the program:
l[t, $w]++
).g[$w]++
).END
block, it filters out words occurring less than 10 times or more than 100,000 times (you can adjust these values).To run this AWK script from the Linux command line with the provided input and output files, follow the steps below:
Open a terminal on your Linux system.
Navigate to the directory where the AWK script is located using the cd
command. Assuming the script is at ~/corpus_utils/wordstats.awk
, use the following command:
cd ~/corpus_utils
Execute the AWK script by providing the input and output files:
awk -f ~/corpus_utils/wordstats.awk ~/corpus_utils/examples/corpora/andersen.txt > word_matrix.txt
This command specifies the AWK script (
-f wordstats.awk
), the input file (~/corpus_utils/examples/corpora/andersen.txt
), and directs the output to a file namedword_matrix.txt
. Adjust the file paths as needed based on your actual directory structure.
Once the command is executed, the word frequency matrix will be written to the specified output file (word_matrix.txt
).
Same functionality, but with Python, so that you can integrate it easily with your workflow.
import sys
import re
# Check that both input and output filenames, as well as min and max frequencies, were provided
if len(sys.argv) != 5:
print('Usage: python program_name.py input_file_name output_file_name min_frequency max_frequency')
sys.exit()
# Get the filenames and frequency values from the command-line arguments
input_file_name = sys.argv[1]
output_file_name = sys.argv[2]
min_frequency = int(sys.argv[3])
max_frequency = int(sys.argv[4])
# Initialize dictionaries for storing word frequencies
l = {}
g = {}
# Initialize counter for the number of lines processed
t = 0
# Open the input file
with open(input_file_name, 'r') as input_file:
# Loop through each line of input
for line in input_file:
# Convert line to lowercase
line = line.lower()
# Remove all punctuation marks and replace them with spaces
line = re.sub(r'[^\w\s]+', ' ', line)
# Remove all numeric chars and replace them with spaces
line = re.sub(r'[0-9]+', ' ', line)
# Remove words starting with 'X'(remove trash)
line = ' '.join(word for word in line.split() if not word.startswith('X'))
# Increment line counter
t += 1
# Loop through each word in the line
for word in line.split():
# Increment frequency of word in line
l[(t, word)] = l.get((t, word), 0) + 1
# Increment frequency of word in entire text
g[word] = g.get(word, 0) + 1
# Open the output file
with open(output_file_name, 'w') as output_file:
# Loop through each word in the g dictionary
for word in list(g.keys()):
# Delete words with frequency outside the specified range
if g[word] < min_frequency or g[word] > max_frequency:
del g[word]
else:
# Write remaining words separated by spaces to the output file
output_file.write(word + ' ')
output_file.write('\n')
# Loop through each line processed
for i in range(1, t + 1):
# Loop through each word in the g dictionary
for word in list(g.keys()):
# Write frequency of word in line to the output file
output_file.write(str(l.get((i, word), 0)) + ' ')
output_file.write('\n')
# Now, open the file in write mode and remove the last column using strip()
with open(output_file.name, "r+") as file:
lines = file.readlines()
file.seek(0)
for line in lines:
# Remove the last column by stripping the trailing whitespace
file.write(line.rstrip() + '\n')
file.truncate()
print("Word frequency calculation complete. Output saved to", output_file)
The provided Python script, named wordstats.py
, is a tool for analyzing and extracting word frequency information from a collection of documents. Here’s a brief explanation of the input data and the script’s functionality:
Input Data Format: The input for the script is a single UTF-8 text file where each line represents an individual document in the collection. This format facilitates the handling of the entire document collection as a single file, streamlining the processing of text data.
The script performs the following tasks:
Lowercasing and Punctuation Handling: It converts all text to lowercase, ensuring uniformity. Punctuation characters are replaced with spaces, contributing to the script’s ability to accurately count word frequencies.
Word Frequency Counting:
The script counts the frequency of each word in each line and maintains an overall count of word frequencies across the entire text. It uses two dictionaries (l
and g
) to store these counts.
Filtering Based on Frequency:
Words are filtered based on user-defined minimum and maximum frequency thresholds (min_frequency
and max_frequency
). Words falling outside this range are excluded from the final output.
Output Format:
The resulting word frequency information is saved to an output file (result.txt
). This file is formatted as a space-separated list of words that meet the specified frequency criteria. Additionally, the script outputs a matrix of word frequencies for each line of the input text.
Usage Example: The script is invoked from the command line with the following usage pattern:
python wordstats.py input.txt result.txt 3 100000
Here, input.txt
is the input file containing the document collection, result.txt
is the output file, and 3
and 100000
are the minimum and maximum word frequency thresholds, respectively.
This script, combined with the provided AWK script in the previous example, demonstrates the flexibility and adaptability of different programming languages in processing and analyzing text data.
If you cloned the repo, you can run this script and instantly get a working example with Andersen’s Fairy Tales.
python3 ~/corpus_utils/wordstats.py ~/corpus_utils/examples/corpora/andersen.txt word_matrix.txt 5 100000
After we have build word matrix for a collection of text files, we can analyze word frequncies and extract keywords. We will cover the topic of automated keyword extraction in the next post.
Also Read How to prepare your texts for creating a word frequncy matrix