How to prepare your texts for creating a word frequency matrix

In this Post

In this post we discuss a Python script to compress your collection of texts (corpus) into a single file which you can later use to build word frequency matrix.

Table of Content

Intro

In a previous post we talked about how to build a word frequncy matrix with Python. In this post you can find instructions how to parse a collection of texts and put them together into a single file which can be later used for production of word frequncy matrix.

Try free online Keyword Extractor and Text Analyzer
Try ChatGPT detector

In natural language processing (NLP) and text analysis, preparing textual data is a crucial step. Oftentimes, raw data comes in various formats, with HTML tags, line breaks, and other elements that may interfere with subsequent analyses. To streamline this process, a Python script named compress.py has been developed, which we will explore in this article. The script will parse your documents into one single file, and this file can later be used to represent your corpus or collection as a word frequency matrix.

Also Read How to build word frequency matrix using AWK or Python

So, you have a directory with tons of text files with website posts, news articles, e-mails, reviews… And your task is to prepare them for in-depth statistical analysis. In many scenarios that implies that you must create a word-frequency matrix from your sourse texts (i.e. from your text corpus). You will need to compress your collection of texts into a single text file, where each line is a separate post or article or ‘chapter’.

After that is done, it will be easy for you to built the word (or term) matrix, and we will walk you though the process.

Understanding the Script

The compress.py script is designed to traverse a specified directory, extract text from HTML or text files, and merge them into a single file. The resulting output is a clean and concatenated text file ready for statistical analysis. In the resulting file, each line is a text (‘chapter’) and the whole file is the collection (corpus, or ‘book’).

Here is the code:

import os
import re
import string
import sys

def clean_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

def process_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
        content = clean_html_tags(content)

        # Replace punctuation with spaces
        punctuation_chars = string.punctuation
        translator = str.maketrans(punctuation_chars, ' ' * len(punctuation_chars))
        content = content.translate(translator)

        content = content.replace('\n', ' ').replace('\r', ' ')  # Remove new lines and line returns
        return content

def process_directory(input_directory, output_file):
    with open(output_file, 'w', encoding='utf-8') as output:
        # List all files in the directory
        filenames = os.listdir(input_directory)

        # Sort files numerically and then alphabetically
        filenames = sorted(filenames, key=lambda x: (int(re.search(r'\d+', x).group()) if re.search(r'\d+', x) else float('inf'), x))

        for filename in filenames:
            if filename.endswith('.txt'):  # Process only text files
                file_path = os.path.join(input_directory, filename)
                processed_content = process_file(file_path)
                output.write(processed_content + '\n')

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python script.py <input_directory> <output_file>")
        sys.exit(1)

    input_directory = sys.argv[1]
    output_file = sys.argv[2]

    if not os.path.exists(input_directory):
        print(f"The specified input directory '{input_directory}' does not exist.")
        sys.exit(1)

    process_directory(input_directory, output_file)

    print("Texts processing complete. Output saved to", output_file)

Features of the Script

HTML Tag Removal: The script employs regular expressions to eliminate HTML tags from the text. This is crucial when dealing with datasets that may contain web-scraped content or documents with HTML formatting.
Punctuation Removal: Punctuation can often be noise in textual data. The script utilizes Python’s string translation method to replace punctuation with spaces, ensuring cleaner and more focused text.
New Line and Line Return Removal: To facilitate the creation of a document term matrix, the script removes new lines and line returns, ensuring that the text is formatted as a single line.
Sorting and Processing Text Files: The script sorts files numerically and alphabetically before processing them. This is particularly useful when dealing with datasets where the order of files matters.

Using the Script

The script is user-friendly and can be executed from the command line. The usage is as follows:

python compress.py /path/to/your/directory output.txt

Ensure that the specified directory contains only HTML or text files. Examples of suitable data include collections of news articles, chapters of novels, poems by an author, or reviews for the same film, each in a separate file.

Getting Started

To begin, clone the repository containing the script from https://github.com/roverbird/corpus_utils. Once cloned, navigate to the directory and execute the script using the provided usage instructions.

git clone https://github.com/roverbird/corpus_utils
cd corpus_utils
python compress.py /path/to/your/directory output.txt

Conclusion

In the ever-expanding field of text analysis, preparing data for statistical analysis is a critical step. The compress.py script simplifies the process, making it easier to convert a collection of documents into a single, clean text file. After the task is completed, go to the next step - prepare word frequency matrix that we will use for more sophisticated text mining, such as keyword extraction. Interested in an keyword extraction techniques? Try our Free Online Keyword extractor.

Thanks for reading!