How to build word frequency matrix using AWK or Python

In this Post

Here are two scripts, one in AWK and one in Python, that you can use to build word frequency matrix, often referred to as a document-term matrices.

Table of Content


Intro

In this post I am going to show you how to build a word frequncy matrix. This can be done using AWK which exists on any UNIX system, but we will also do the same with a modern scripting language, Python.

Try free online Keyword Extractor and Text Analyzer

Try ChatGPT detector

Make sure to check out the post on compressing your collection of texts into a single file which can be processed by these scripts in order to create text matrix.

A few words on what we are talking about. A word frequency matrix, often referred to as a document-term matrix, is a mathematical representation that captures the frequency of terms occurring in a collection of documents. This matrix is a fundamental tool in natural language processing (NLP), corpus linguistics, and computational text analysis, providing a structured way to analyze and understand unstructured textual data.

The matrix represents the counts of each word in the individual documents. Each row of the matrix represents a single document; while each column of the matrix represents a single word. There exist columns for every document word, which is not a stop word, thus the matrix can be somewhat sparse if you are blacklisting some words. Same in a document-term matrix, each row corresponds to a document within the collection, and each column corresponds to a unique term present in the documents.

It’s worth noting that the term «document-feature matrix» is a more general concept, where «features» can encompass various properties of a document beyond just terms. However, the document-term matrix is a specific instance of this broader concept, focusing specifically on term frequencies. Additionally, one may encounter the transpose of the document-term matrix, known as the term-document matrix. In this alternate representation, documents become columns, and terms become rows. Both representations are valuable in different contexts, offering flexibility in analysis approaches.

So, the word frequency matrix, often synonymous with the document-term matrix, plays a pivotal role in extracting meaningful information from text data, providing a foundation for tasks ranging from sentiment analysis to topic modeling in the realm of natural language processing and computational text analysis. Moreover, one can analyze word frequency distributions and model them.

Example of a word frequency matrix

Here is an example of a words frequency matrix based on Andersen’s Fairy Tales (first n words):

Documenttheemperornewclothesmanyyearsagotherewas
1164281112111621
21301201110523
329002000314
49300142105243195
52880005101557
6867053120078217
746000000317
82450706211550
91930011001342
102360224603659
11720001001923
121080023002226
1361000000523
143310434504696
1575020000520
161070009001444
175500020246
181780212001451

Each row represents a different document, and each column represents a unique word. The numbers in the cells indicate the frequency of each word in the corresponding document. This matrix provides a quantitative overview of word occurrences in Andersen’s Fairy Tales, facilitating further analysis and insights into the document collection.

A side note: even looking at this actual example, one can see that word frequency matrices have a lot of zeros. The abundance of zeros in word frequency matrices signifies overdispersion, a statistical phenomenon where the variance in word occurrences exceeds the nominal mean. This sparsity arises from the nature of textual data, where documents typically utilize only a fraction of the entire vocabulary. The overdispersed distribution of word frequencies across documents poses challenges to traditional statistical models, necessitating specialized techniques like zero-inflated models or negative binomial regression to appropriately account for the variability in word occurrences.

Understanding and addressing overdispersion are essential for accurate interpretation and inference from word frequency matrices, ensuring that statistical analyses align with the unique characteristics of the data and provide reliable insights into the underlying text corpus, but let’s get back to our task at hand.

Grab the Scripts

Clone the repository containing the script, called corpus_utils.

Open a terminal or command prompt on your local machine.

Navigate to the directory where you want to clone the repository.

Use the following command to clone the repository:

   git clone https://github.com/roverbird/corpus_utils.git

Once the cloning process is complete, navigate into the cloned directory:

   cd corpus_utils

Now you have access to the contents of the repository, including the scripts wordstats.awk and wordstats.py. You will also find sample texts, such as Andersen’s Fairy Tales, in /corpus_utils/examples/corpora. Let’s take a look inside the scripts.

Word Matrix with AWK

Let’s consider an AWK script that efficiently constructs a word frequency matrix. The script takes a text file as input, where each line represents a single document. Here’s a copy-paste ready version of the script called wordstats.awk in our repo:

{
    $0 = tolower($0);            # Convert the entire line to lowercase
    gsub("[:;.,()!?-]", " ");    # Replace certain punctuation marks with spaces
    t++;
    for (w = 1; w <= NF; w++) {
        l[t, $w]++;              # Count occurrences of each word for each line
        g[$w]++;                 # Count total occurrences of each word across all lines
    }
}

END {
    for (w in g)
        if (g[w] < 10 || g[w] > 100000)
            delete g[w];          # Delete words occurring less than 10 times or more than 100,000 times
        else
            printf w " ";         # Print words that meet the frequency criteria
    print "";

    for (i = 1; i <= t; i++) {
        for (w in g)
            printf +l[i, w] " ";  # Print the frequency of each word for each line
        print "";
    }
}

In summary, the program:

  1. Converts the entire input to lowercase.
  2. Replaces certain punctuation marks with spaces.
  3. Counts the occurrences of each word for each line (l[t, $w]++).
  4. Counts the total occurrences of each word across all lines (g[$w]++).
  5. In the END block, it filters out words occurring less than 10 times or more than 100,000 times (you can adjust these values).
  6. Prints the filtered words.
  7. Prints the frequency of each word for each line.

To run this AWK script from the Linux command line with the provided input and output files, follow the steps below:

Open a terminal on your Linux system.

Navigate to the directory where the AWK script is located using the cd command. Assuming the script is at ~/corpus_utils/wordstats.awk, use the following command:

   cd ~/corpus_utils

Execute the AWK script by providing the input and output files:

   awk -f ~/corpus_utils/wordstats.awk ~/corpus_utils/examples/corpora/andersen.txt > word_matrix.txt

This command specifies the AWK script (-f wordstats.awk), the input file (~/corpus_utils/examples/corpora/andersen.txt), and directs the output to a file named word_matrix.txt. Adjust the file paths as needed based on your actual directory structure.

Once the command is executed, the word frequency matrix will be written to the specified output file (word_matrix.txt).

Word Matrix with Python

Same functionality, but with Python, so that you can integrate it easily with your workflow.


import sys
import re

# Check that both input and output filenames, as well as min and max frequencies, were provided
if len(sys.argv) != 5:
    print('Usage: python program_name.py input_file_name output_file_name min_frequency max_frequency')
    sys.exit()

# Get the filenames and frequency values from the command-line arguments
input_file_name = sys.argv[1]
output_file_name = sys.argv[2]
min_frequency = int(sys.argv[3])
max_frequency = int(sys.argv[4])

# Initialize dictionaries for storing word frequencies
l = {}
g = {}

# Initialize counter for the number of lines processed
t = 0

# Open the input file
with open(input_file_name, 'r') as input_file:
    # Loop through each line of input
    for line in input_file:
        # Convert line to lowercase
        line = line.lower()
        # Remove all punctuation marks and replace them with spaces
        line = re.sub(r'[^\w\s]+', ' ', line)
        # Remove all numeric chars and replace them with spaces
        line = re.sub(r'[0-9]+', ' ', line)
        # Remove words starting with 'X'(remove trash)
        line = ' '.join(word for word in line.split() if not word.startswith('X'))

        # Increment line counter
        t += 1

        # Loop through each word in the line
        for word in line.split():
            # Increment frequency of word in line
            l[(t, word)] = l.get((t, word), 0) + 1
            # Increment frequency of word in entire text
            g[word] = g.get(word, 0) + 1

# Open the output file
with open(output_file_name, 'w') as output_file:
    # Loop through each word in the g dictionary
    for word in list(g.keys()):
        # Delete words with frequency outside the specified range
        if g[word] < min_frequency or g[word] > max_frequency:
            del g[word]
        else:
            # Write remaining words separated by spaces to the output file
            output_file.write(word + ' ')
    output_file.write('\n')

    # Loop through each line processed
    for i in range(1, t + 1):
        # Loop through each word in the g dictionary
        for word in list(g.keys()):
            # Write frequency of word in line to the output file
            output_file.write(str(l.get((i, word), 0)) + ' ')
        output_file.write('\n')

# Now, open the file in write mode and remove the last column using strip()
with open(output_file.name, "r+") as file:
    lines = file.readlines()
    file.seek(0)
    for line in lines:
        # Remove the last column by stripping the trailing whitespace
        file.write(line.rstrip() + '\n')
    file.truncate()

    print("Word frequency calculation complete. Output saved to", output_file)

The provided Python script, named wordstats.py, is a tool for analyzing and extracting word frequency information from a collection of documents. Here’s a brief explanation of the input data and the script’s functionality:

Input Data Format: The input for the script is a single UTF-8 text file where each line represents an individual document in the collection. This format facilitates the handling of the entire document collection as a single file, streamlining the processing of text data.

The script performs the following tasks:

  1. Lowercasing and Punctuation Handling: It converts all text to lowercase, ensuring uniformity. Punctuation characters are replaced with spaces, contributing to the script’s ability to accurately count word frequencies.

  2. Word Frequency Counting: The script counts the frequency of each word in each line and maintains an overall count of word frequencies across the entire text. It uses two dictionaries (l and g) to store these counts.

  3. Filtering Based on Frequency: Words are filtered based on user-defined minimum and maximum frequency thresholds (min_frequency and max_frequency). Words falling outside this range are excluded from the final output.

  4. Output Format: The resulting word frequency information is saved to an output file (result.txt). This file is formatted as a space-separated list of words that meet the specified frequency criteria. Additionally, the script outputs a matrix of word frequencies for each line of the input text.

  5. Usage Example: The script is invoked from the command line with the following usage pattern:

    python wordstats.py input.txt result.txt 3 100000
    

    Here, input.txt is the input file containing the document collection, result.txt is the output file, and 3 and 100000 are the minimum and maximum word frequency thresholds, respectively.

    This script, combined with the provided AWK script in the previous example, demonstrates the flexibility and adaptability of different programming languages in processing and analyzing text data.

    If you cloned the repo, you can run this script and instantly get a working example with Andersen’s Fairy Tales.

    python3 ~/corpus_utils/wordstats.py ~/corpus_utils/examples/corpora/andersen.txt word_matrix.txt 5 100000
    

After we have build word matrix for a collection of text files, we can analyze word frequncies and extract keywords. We will cover the topic of automated keyword extraction in the next post.

Also Read How to prepare your texts for creating a word frequncy matrix

Alexander Sotov

Text: Alexandre Sotov
Comments or Questions? Contact me on LinkedIn

𝕏   Facebook   Telegram

Other Posts:

Sentiment Analysis API

Semascope: Tool for Text Mining and Analysis

Track media sentiment with this app

How AI sees Dante's Divine Comedy in 27 words

Keyword Extractor and Text Analyzer - Help

Exploring Sacred Texts with Probabilistic Keyword Extractor

FAQ: Automated keyword detection, content extraction and text visualization

Make ChatGPT Content Undetectable with this App

ChatGPT Detector, a free online tool

The Intricate Tapestry of ChatGPT Texts: Why LLM overuses some words at the expense of others?

How to prepare your texts for creating a word frequency matrix

Intro to Automated Keyword Extraction

How to automatically tag posts in Hugo Static Site Generator with Python

Using Hugo and Goaccess to show most read posts of a static website

How Textvisualization.app and its semascope 👁️ compare with traditional tag clouds?

Services

What is this website?

Help