How to automatically tag posts in Hugo Static Site Generator with Python

In this Post

This Python script provides an automated way to update the tag fields in the YAML frontmatter of Hugo static site markdown files based on a predefined list of keywords. It streamlines the process of managing metadata, making it easier for Hugo users to organize and categorize their content.

Table of Content

Intro

Static site generators like Hugo provide a powerful way to create and maintain websites. One essential aspect of managing content in Hugo is the use of YAML or TOML frontmatter to store metadata about each post. This includes information like the post’s title, date, and tags. In this blog post, we’ll explore a Python script that automates the process of updating tag fields in the YAML frontmatter of Hugo static site markdown files.

Try free online Keyword Extractor and Text Analyzer
Try ChatGPT detector

A good thing about Hugo is that it simplifies the process of content organization through the use of taxonomies. Taxonomies like tags enable the categorization and classification of content, providing a structured way to navigate and present information on your website. What makes Hugo particularly user-friendly is its automatic creation of taxonomies, including tags, without the need for manual configuration in the site configuration file. More about it at Hugo’s official website.

Here is an example of a Hugo YAML frontmatter for a post.md with the title ‘Hugo: A fast and flexible static site generator’:

---
categories:
- Development
project_url: https://github.com/gohugoio/hugo
series:
- Go Web Dev
slug: hugo
tags:
- Development
- Go
- fast
- Blogging
title: 'Hugo: A fast and flexible static site generator'
---

Imagine, you have a lot of texts that you want to convert to Hugo posts, or maybe you have posts md files without any tags assigned in the frontmatter. So you will need to tag these posts with keywords, which is a lot of work.
First, extract keywords: cope texts that you need to process into Keyword Extractor and Text Analyzer and press Run. Save your results as csv and paste required keywords into a list as described below.

Now you have a list of keywords. Luckily, task of tagging posts with keywords can be automated, since you got the list with the help of keyword extractor. So let us go straight to tagging, and for that purpose use you can use the script below.

The Code

Clone the repository containing the script, called corpus_utils.

Open a terminal or command prompt on your local machine.

Navigate to the directory where you want to clone the repository.

Use the following command to clone the repository:

   git clone https://github.com/roverbird/corpus_utils.git

Once the cloning process is complete, navigate into the cloned directory:

   cd corpus_utils

Now you have access to the contents of the repository, including the script keyword.py. Let’s break down the Python code that achieves tagging automation:

import re
import os
import yaml

def tag_keywords_in_md_files(directory_path, keywords_file_path):
    # Load keywords from the file
    with open(keywords_file_path, 'r', encoding='utf-8') as keywords_file:
        keywords = [line.strip() for line in keywords_file]

    # Iterate through MD files in the directory
    for filename in os.listdir(directory_path):
        if filename.endswith(".md"):
            file_path = os.path.join(directory_path, filename)

            with open(file_path, 'r', encoding='utf-8') as md_file:
                md_content = md_file.read()

            # Split YAML frontmatter and content
            yaml_match = re.match(r'---\n(.*?\n)---\n\n(.*)', md_content, re.DOTALL)
            if yaml_match:
                yaml_frontmatter = yaml_match.group(1)
                md_text = yaml_match.group(2)

                # Load existing YAML data
                existing_data = yaml.safe_load(yaml_frontmatter) or {}

                # Check for keywords in the main text
                tags = []
                for keyword in keywords:
                    if re.search(rf'\b{re.escape(keyword)}\b', md_text, flags=re.IGNORECASE):
                        tags.append(keyword)

                # Add Tags field to existing data
                existing_data['Tags'] = tags

                # Write the updated content back to the file
                with open(file_path, 'w', encoding='utf-8') as updated_md_file:
                    updated_md_file.write("---\n\n")
                    updated_md_file.write(yaml.dump(existing_data, default_style="'", allow_unicode=True))
                    updated_md_file.write("---\n\n")
                    updated_md_file.write(md_text)

if __name__ == "__main__":
    out_directory = "OUT"
    keywords_file_path = "keywords.txt"  # Specify the path to your keywords file
    tag_keywords_in_md_files(out_directory, keywords_file_path)

Comments

Keyword Loading: The script starts by loading keywords from the specified file (keywords.txt). Each keyword is stripped of leading and trailing whitespace. Each line of the (keywords.txt) file contains a keyword (one word or more).
File Iteration: It then iterates through all the markdown files (*.md) in the specified directory (OUT in this case).
YAML Frontmatter Parsing: For each markdown file, the script uses a regular expression to match and extract the YAML frontmatter and the main text content.
Existing YAML Data Loading: The existing YAML data is loaded using PyYAML. If there’s no existing data, an empty dictionary is used.
Keyword Matching: The script then checks for each keyword in the main text using regular expressions. If a keyword is found, it is added to the list of tags.
Updating YAML Data: The ‘Tags’ field is added to the existing YAML data, containing the list of tags.
Writing Back to File: The script then writes the updated YAML frontmatter and the original text content back to the markdown file. Updated content back to the same directory as the input markdown files, so make a backup of your md files before you start.

And that was the end of an easy part of tagging posts for keywords, applicable if you already have a list of them. You might be wondering, is it possible to automatically extract actual keywords from a collection of text? The answer is, you can do that with a variety of methods, but it is not a trivial task. For meaningful keyword extraction we recommend a linguistically motivated statistical technique called the Negative Binomial Distribution keyword extraction method. In the next post we will demonstrate how to implement it with scripts from corpus_utils.

Not only Hugo

Before we conclude, a few words on markdown for websites. Content Management Systems (CMS) that store content as markdown files with YAML, TOML, or JSON frontmatter are not uncommon as they offer a flexible and portable approach to managing website content. Unlike traditional database-driven CMS, these systems embrace a «headless» or «flat-file» architecture, allowing content to be stored in plain text files with metadata specified in frontmatter. It is a very good thing.

Recommended: try our free anonymous ChatGPT detector

Advantages of CMS with Frontmatter

Version Control: Markdown files, along with frontmatter, can be easily tracked and managed using version control systems like Git. This facilitates collaboration among content creators and provides a transparent history of changes.
Portability: Content stored in plain text files is highly portable. Developers can easily migrate or transfer content between different CMS platforms or hosting environments without dealing with complex database structures.
Ease of Editing: Markdown is a simple and human-readable markup language, making it easy for non-technical users to create and edit content. Frontmatter allows for the inclusion of metadata without cluttering the main content.
Customization: YAML, TOML, or JSON frontmatter provides a structured way to define metadata, enabling customization of content fields based on specific requirements. This flexibility allows developers to extend and tailor the CMS to meet the unique needs of a project.

Applicability Beyond Hugo

The implementation discussed earlier, which automatically updates tag fields in YAML frontmatter, is not limited to Hugo alone. Any CMS or static site generator that adopts a similar file-based content storage approach can benefit from such a script. Popular CMS platforms like Jekyll, Gatsby, and Eleventy follow a similar pattern of using markdown files with frontmatter.

Jekyll: Jekyll, a widely used static site generator, also utilizes markdown files with YAML frontmatter. This script could be adapted to work seamlessly with Jekyll-based projects.
Gatsby: Gatsby, a React-based static site generator, supports markdown files with frontmatter. The script’s logic can be extended or modified to suit the specific structure of Gatsby projects.
Eleventy: Eleventy is a flexible static site generator that supports multiple template languages. Its reliance on markdown files with frontmatter makes it compatible with the script’s approach.
PicoCMS: Your Content, Your Way. Creating content with Pico is easy. You simply create Markdown files and they become pages on your website.

The script’s utility extends to a variety of CMS and static site generators that adopt the practice of storing content in markdown files with YAML, TOML, or JSON frontmatter. This file-based approach not only simplifies content management but also enhances the interoperability and adaptability of websites across different platforms. Thanks for reading!