The Intricate Tapestry of ChatGPT Texts: Why LLM overuses some words at the expense of others?

In this Post

Ever wondered why ChatGPT texts are heavy on certaint words? This article talks about word frequencies in fake posts. We'll review some examples and introduce a vocabulary-based ChatGPT detector.

Table of Content


Intro

Try free online ChatGPT detector

Try free online Keyword Extractor and Text Analyzer

If you’ve been wondering why ChatGPT texts are heavy on certain words, you are surely not alone. There is a recent thread on Reddit didicated to OpenAI’s «intricate tapestry» phenomenon, as these are among word that you often see across prompts. An anecdotal vocabulary of ChatGPT’s favourites also includes the words «intricacy», «vibrant», «breathtaking», «innovative», and so on. ChatGPT would write about «catering» to the needs of clients, making something «seamless» and suggesting «a no hassle solution»… As we shall see below, there is also a remarkable place for the «t-word» in ChatGPT’s reply to prompts. «Tapestry», yes, we are getting there.

In the vibrant, dynamic, multifaceted, kaleidoscopic and multidimensional world of AI, one linguistic generator program stood out from the rest: ChatGPT. A testament to its algorithm, the program sought to weave intricate threads of information to create a rich tapestry of knowledge. Redit User

You can download data in this post by cloning chatgpt_corpus.

In the world of AI generated texts there is a big issue: spam that originates from ChatGPT and other LLMs. Generally speaking, that is the problem of automated detection of potentially useless texts that were generated by LLMs. In this post, I would like to share some of my findings from a collection of about 2K texts created by ChatGPT. We’ll take into consideration certain lexicographical features that are evident if you compare AI texts with human ones.

You can view the ChatGPT collection in semascope viewer, which shows graphically how words relate to each other: here.

Collecting data

Discovering a website that contained large amount of generated content turned our an easy task. It was as simple as typing a search query, «collection of AI-generated texts». And that is exactly how I found a web page titled «Smarhon: A Journey into Belarus’ Untouched Cultural Heritage». It screams, ‘ChatGPT wrote me’, and you will appreciate if I show you a sample:

As you wander through the charming streets of Smarhon, make sure to marvel at its architectural treasures. The Church of St. Michael the Archangel, built in the 18th century, stands as a symbol of the town’s religious heritage. Its stunning frescoes and intricate wood carvings are a sight to behold. Another architectural gem is the Smarhon Castle, which dates back to the 17th century. Once a powerful fortress, it now houses a museum that offers a glimpse into the town’s past. Explore its exhibition halls, which display artifacts highlighting Smarhon’s historical significance. Immersing in Nature’s Beauty…

There are many thousand posts like this, they literally go without end. I found myself looking at sitemap.xml of the site. Reading into the urls clearly suggests that the posts are iterations of machine-written and rewritten text, all over and over again. A kind of fake promo info about a traveling destination.

Once the sitemap.xml file is downloaded, we can create a list of URLs and pass it over to wget:

curl -s https://THEWEBSITE/wp-sitemap-posts.xml | grep -oP '<loc>\K[^<]*' > urls.txt
wget -i urls.txt -P ./local-directory --reject 'jpg,jpeg,png,gif,css,js'

The –reject option is used to exclude certain file types (images, CSS, and JS files) as we only need texts.

Preparing text files

In a few hours, not without impatience, I got a collection of about 2000 posts from the «Smarhon» website. I went on to parse html into pure txt, as the layout contained actual posts (whithout the navigation elements) inside <p> tags.

Here is the Python script that was used for parsing:

import os
from bs4 import BeautifulSoup

def extract_text_from_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    paragraphs = soup.find_all('p')
    return '\n'.join(paragraph.get_text(separator='\n') for paragraph in paragraphs)

def process_html_files(directory):
    file_counter = 111  # Starting number for output files
    for filename in os.listdir(directory):
        input_path = os.path.join(directory, filename)
        output_path = os.path.join(directory, f'{file_counter}.txt')

        with open(input_path, 'r', encoding='utf-8') as file:
            html_content = file.read()

        extracted_text = extract_text_from_html(html_content)

        with open(output_path, 'w', encoding='utf-8') as output_file:
            output_file.write(extracted_text)

        print(f"Processed: {input_path} -> {output_path}")

        file_counter += 1

if __name__ == "__main__":
    # Set HTML files directory here:
    html_directory = '/local-directory'
    process_html_files(html_directory)

Now we have the posts as clean txt files, although some additional parsing is required to get rid of noise. An inspection of parsed data suggested that some pages were in German, while most of them in English. What should we do? First I also gut stuck with the problem, because there is no way I was going to check every file manually.

A good thing is that we can use Python to check each text for its language - and you can do it automatically! Firstly, install the needed module, it is called langdetect:

pip install langdetect

Well done. And here is the script to do language detection, comments are to give you a clue of what’s going on.

import os
from langdetect import detect

def detect_language(text):
    try:
        language = detect(text)
        return language
    except:
        return "Unknown"

def process_txt_files(directory):
    for filename in os.listdir(directory):
        if filename.endswith('.txt'):
            input_path = os.path.join(directory, filename)

            # Read the content of the file
            with open(input_path, 'r', encoding='utf-8') as file:
                content = file.read()

            # Detect the language of the content
            language = detect_language(content)

            if language == "en":
                # If the detected language is English, rename the file
                new_filename = f'EN-{filename}'
                output_path = os.path.join(directory, new_filename)
                os.rename(input_path, output_path)

                print(f"Renamed: {input_path} -> {output_path}")
            else:
                print(f"Ignored: {input_path} (Language: {language})")

if __name__ == "__main__":
    txt_directory = '/local-directory'
    process_txt_files(txt_directory)

Please, do not call the script language.py as such a name will interfere with the module. I called the script langdetect.py and ran it by doing python3 langdetect.py from the Linux console. All worked surprisingly fast, the script properly inspected the data, and upon a brief checkup I moved files starting with EN- to a separate folder:

cp ./local-directory/EN* ./local-directory/EN/

The preparation part is almost over, so read on.

Building frequency list for AI-generated corpus

Now we have a small corpus of AI-generated texts that we took from what seems to be a SEO spam website with many thousands of ChatGPT junk posts. You can download it here.

And now, finally, we can start with our data mining, namely with a corpus-linguistical analysis. Our research questions are as follows:

  1. Which words are underrepresented in the ChatGPT corpus?
  2. What lexis is overrepresented in AI-generated texts?
  3. Can we distinguish between actual human texts and ChatGPT prompts? What is the difference?

Such questions are many, and they can only be answered when you have sufficient data, so let’s try playing around with the LLM spam corpus of 1.2 mln words.

A few notes on the text file we are going to analyze. Each line contains a separate web article, so there are as many posts as there are lines in the file, exactly N=1922 shitposts. Texts do not repeat each other literally.

So, the big text file with all GPT posts is prepared, let’s see how many tokens are there:

$ wc -w ./chatgpt-corpus/chatgpt-replies-small-corpus.txt 
1325123 ./chatgpt-corpus/chatgpt-replies-small-corpus.txt

Python script to create word frequency lists

A wordlist is a frequency list, where words are listed with the most frequent coming first, descending to the least frequent. But such data alone is often useless: instead, we need to calculate probability of meeting the word in our little corpus, a collection of SEO spam documents. For this purpose we can take tokens, the number of individual words in the text, and calculate relative freqency of each word, i.e. ratio of word frequency to total number of words.

There is a number of word rank lists, or word frequency vocabularies, for different languages, and you can get this data from the web. Wikipedia is a good starting point to get it. To compare how probable we are to find a word in AI-generated texts against natural texts, I chose the list from Project Gutenberg books. It is dated back to 2006, way before AI-generated texts flooded the web, so we can safely assume that the texts are «pure human». This is what Wikipedia writes about the list: «These lists are the most frequent words, when performing a simple, straight (obvious) frequency count of all the books found on Project Gutenberg».

The PG list that we are going to use contains top 40000 English words as seen in all Project Gutenberg books in April 2006, good enough for our purpose. The PG list includes word frequency in the format of ‘per billion’, fractions are limited to two decimal places and the words are case insensitive.

Let us now extract data from our ChatGPT corpus, using the same format as in the Project Gutenberg list:

import re
import csv
from collections import Counter

def calculate_word_frequency(file_path, output_path):
    # Read the content of the file
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()

    # Tokenize the text using a simple regex
    tokens = re.findall(r'\b\w+\b', text.lower())

    # Calculate word frequencies
    word_frequencies = Counter(tokens)

    # Total number of tokens
    total_tokens = len(tokens)

    # Calculate frequency per actual tokens and induced frequency per billion tokens
    result = []
    for word, frequency in word_frequencies.items():
        frequency_per_token = frequency / total_tokens
        frequency_per_billion_tokens = frequency_per_token * 1e9
        result.append((word, frequency, frequency_per_token, frequency_per_billion_tokens))

    # Sort the result by frequency in descending order
    result.sort(key=lambda x: x[1], reverse=True)

    # Write the result to a CSV file
    with open(output_path, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['Word', 'Frequency', 'Frequency per Token', 'Frequency per Billion Tokens']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        # Write the header
        writer.writeheader()

        # Write each row
        for word, frequency, freq_per_token, freq_per_billion in result:
            writer.writerow({
                'Word': word,
                'Frequency': frequency,
                'Frequency per Token': freq_per_token,
                'Frequency per Billion Tokens': freq_per_billion
            }) # Example usage: replace 'your_file.txt' and 'results.csv' with the actual file paths
calculate_word_frequency('/local-folder/EN/chat-gpt-EN.txt', 'gpt-results.csv')

The resulting file, gpt-results.csv, contains a ChatGPT word frequency list according to about 2000 texts that were harvested on the web, although it was taken from just one particular website. This is the way, however, to expand the ChatGPT corpus by adding new AI-generated texts, should anyone want to do that.

Next stage, we need to merge the PG word frequency list and our vocabulary file, gpt-results.csv.

You can do that by executing the following UNIX one-liner, so wait no more and go back to your command line prompt:

$ awk -F',' 'NR==FNR{a[$1]=$0; next} $1 in a {print a[$1] "," $0}' <(sort PGrank.csv) <(sort gpt-results.csv) > CompareRanksGPTvsPG.csv

On a side note, wk -F',' means, ‘use comma as a field separator’. The idea is, firstly, sort the two lists that we are going to merge. And take only words that exist in both lists and keep corresponding word frequency statistics from the original two files.

The datafile that I got (and you can download it later) looks like this:

WordFrequency per Billion 1WordFrequencyFrequency per TokenFrequency per Billion 2Ratio
www39.5567www40330.002924201337026352924201.337026351.35273517247706E-05
breathtaking18.196breathtaking9100.000659812352267289659812.3522672892.75775376703297E-05
belarusian41.9301belarusian15480.001122406067373371122406.067373373.73573354767441E-05
iconic12.6581iconic3010.000218245624211488218245.6242114885.79993300930232E-05
belarus211.232belarus47160.00341942313548633419423.13548636.1774162374894E-05
cinematic7.91134cinematic1710.000123986716744732123986.7167447326.38079643345031E-05
effortlessly40.3478effortlessly6250.000453167824359402453167.8243594020.0000890350060864
innovative87.8159innovative13110.000950564828376282950564.8283762829.23828626712433E-05
customize13.4492customize1670.000121086442668832121086.4426688320.000111071063808383
gamer26.1074gamer2980.000216070418654563216070.4186545630.000120828201114094
upcoming54.5882upcoming5330.000386461520613698386461.5206136980.000141251320217636
showcase101.265showcase8630.000625734131875462625734.1318754620.00016183390811124
unleash71.9932unleash5200.000377035629867022377035.6298670220.000190945349184616
maximize57.7528maximize4170.000302353572412593302353.5724125930.000191010807443645
blockbuster8.70248blockbuster614.42291796574776E-0544229.17965747760.000196758792891803
options329.903options20390.001478414710190111478414.710190110.000223146453918588
hassle37.1833hassle2230.000161690279731435161690.2797314350.000229966204905829
immerse346.516immerse20040.001453037312025991453037.312025990.000238477014411177
powered116.296powered6650.000482170565118404482170.5651184040.00024119265756391
optimize29.2719optimize1520.000110210414884207110210.4148842070.000265600125276315
informative117.087informative5960.000432140837309126432140.8373091260.000270946390369127

… and so on. There are 4604 words in it, each attested at lest once in the ChatGPT corpus. That is slightly over 10 per cent of the top 40000 common English words from Project Gutenberg books!.

Frequency per Billion 1 is the Project Gutenberg probability to find the word, and Frequency per Billion 2 is same for our 2K posts ChatGPT corpus. Ratio is Frequency per Billion 1 divided by Frequency per Billion 2. I calculated it as a measure of how many times the word in the ChatGPT word list overused in comparison to the Project Gutenberg word frequency. The table is sorted by this ratio, so you can immediately recognize some of ChatGPT’s favourites, including the words ‘breathtaking’, ‘effortlessly’, ‘innovative’, ‘showcase’, ‘immerse’. So far, very ‘informative’ and ‘no hassle’!

Tapestry is seriously overused by ChatGPT

Attention, now we come to the big fun part!

How about the word ‘tapestry’? Is it really used so often in ChatGPT posts? The answer: It turns out, in a text sample of about 2000 texts that were generated by ChatGPT, the word ‘tapestry’ is used at the rate of 102959 words per billion, whereas in the Project Gutenberg corpus the same word is 25 times less common (occurs at 4099.65 words per billion).

Here are some more examples of English words overused by ChatGPT:

WordFrequency per Billion 1WordFrequencyFrequency per TokenFrequency per Billion 2Ratio
tapestry4099.65tapestry1420.000102959729694456102959.7296944560.0398179949788733
intricate5167.69intricate8240.000597456459635436597456.4596354360.00864948385218446
vibrant1134.48vibrant19720.001429835119418791429835.119418790.000793434141176468
extravaganza160.6extravaganza21.45013703795009E-061450.137037950090.110748154
gem4187.47gem20180.001463188271291641463188.271291640.00286188051268582
unleash71.9932unleash5200.000377035629867022377035.6298670220.000190945349184616
unlock1503.15unlock8590.000622833857799562622833.8577995620.002413404443539
streamline51.4237streamline231.6676575936426E-0516676.5759364260.00308358863330435
blend2933.52blend10620.000770022767151496770022.7671514960.00380965359096045
colorful329.111colorful1248.99084963529053E-0589908.49635290540.00366051055629032
testament2316.44testament7300.000529300018851782529300.0188517820.00437642153315068

The situation gets even more dramatic with some notorious words from the ‘Reddit Lexicon of ChatGPT’. Here are a few examples from the data. The word ‘intricate’ was used 115x more often in ChatGPT sample of texts when compared to Project Gutenberg books, ‘vibrant’ - a shocking 1260x times increase. ‘Extravaganza’ - a 25x overuse. The award goes to ‘breathtaking’. In the sample data it is the top words overused by GPT. It has the rate of 659812 words per billion in ChatGPT texts, that is 36261 times more than in texts written by us, humans. ‘Testament’ is 228 times more frequent in GTP texts, ‘landscape’ - 10 times, and so on, see full data in the repo files.

What you read in this post is implemented as a vocabulary-based ChatGPT detector. You can try the ChatGPT Detector here

Words systematically underused by ChatGPT

Let us further ‘dwell into the intricacies’ of ChatGPT lingo. Our sample of AI-written texts reveal an interesting list of underused words. These are words that happen unnaturally rare in ChatGPT posts when compared to the corpus of Project Gutenberg books.

This table lists words that occur in ChatGPT posts with the probability of at least 200x less when compared to same words in Project Gutenberg corpus:

WordFrequency per Billion 1WordFrequencyFrequency per TokenFrequency per Billion 2Ratio
copyright145244copyright17.2506851897504E-07725.068518975043200.31761992
round291647round21.45013703795009E-061450.13703795009201.116854729999
certain296795certain21.45013703795009E-061450.13703795009204.666864049999
low149690low17.2506851897504E-07725.068518975043206.4494542
six151612six17.2506851897504E-07725.068518975043209.10023816
cut152625cut17.2506851897504E-07725.068518975043210.4973475
nearly154001nearly17.2506851897504E-07725.068518975043212.39509918
none155743none17.2506851897504E-07725.068518975043214.79763074
south158664south17.2506851897504E-07725.068518975043218.82621552
purpose162154purpose17.2506851897504E-07725.068518975043223.63955372
began325327began21.45013703795009E-061450.13703795009224.342245929999
turned337367turned21.45013703795009E-061450.13703795009232.644909529999
continued169086continued17.2506851897504E-07725.068518975043233.20002948
door342388door21.45013703795009E-061450.13703795009236.107340919999
god552668god32.17520555692513E-062175.20555692513254.076217413333
enough382266enough21.45013703795009E-061450.13703795009263.606810939999
course385303course21.45013703795009E-061450.13703795009265.701095769999
could1571110could85.80054815180035E-065800.54815180035270.855436225
french199969french17.2506851897504E-07725.068518975043275.79324542
mean211299mean17.2506851897504E-07725.068518975043291.41935482
really211722really17.2506851897504E-07725.068518975043292.00274796
earth222546earth17.2506851897504E-07725.068518975043306.93099228
reason229940reason17.2506851897504E-07725.068518975043317.1286492
because465587because21.45013703795009E-061450.13703795009321.064139329999
hour237964hour17.2506851897504E-07725.068518975043328.19518952
fact263613fact17.2506851897504E-07725.068518975043363.56977734
received264606received17.2506851897504E-07725.068518975043364.93930308
person267878person17.2506851897504E-07725.068518975043369.45198004
children275607children17.2506851897504E-07725.068518975043380.11166226
till304735till17.2506851897504E-07725.068518975043420.2844173
death309653death17.2506851897504E-07725.068518975043427.06722454
man1573117man53.62534259487522E-063625.34259487522433.922300812
morning330567morning17.2506851897504E-07725.068518975043455.91139506
had6139336had171.23261648225757E-0512326.1648225757498.073495557648
did1185720did32.17520555692513E-062175.20555692513545.1071032
knew413101knew17.2506851897504E-07725.068518975043569.74063718
very1462382very32.17520555692513E-062175.20555692513672.296002253333
my3277699my42.90027407590017E-062900.274075900171130.134226705
i11764797i117.97575370872548E-067975.753708725471475.07024786
his8799755his85.80054815180035E-065800.548151800351517.0557626125
her5202501her21.45013703795009E-061450.137037950093587.59266458999
said2637136said17.2506851897504E-07725.0685189750433637.08522848
he8397205he17.2506851897504E-07725.06851897504311581.2571919

The pronoun ‘he’ - over 11500 more frequent in natural texts when compared to ChatGPT replies, ‘her’ - about 3600 times, ‘reason’ - 317 times, ‘god’ - 254 times, ‘purpose’ - 223 times. One can expect that these words are not seen often in typical ChatGPT replies.

The word ‘woman’ was not seen in the ChatGPT sample at all. Surprisingly, the pronoun ‘she’, one of the most common words in the English language, was not attested neither. You can check this by downloading the data used in this post, to do that, clone chatgpt_corpus.

Are word frequencies in ChatGPT texts and in natural texts correlated?

An important question about the quantitative nature of AI-generated texts is if rate of vocabulary use is at all realated to what we find in natural language. The short preliminary answer is, yes, but the correlation is not very strong. Firstly, a few words on the quality of Project Gutenberg word frequency list. It appears that the PG list from Wikipedia can be trusted, because if you compare word frequencies (per billion) in the PG list and in a comparatively large human text, the values are strongly correlated. Using same script as in «Python script to create word frequency lists» above, I gathered statistics from Tolstoy’s War and Peace, and compared the Frequency per Billion 1 with Frequency per Billion 2 columns. The first one is from PG list, while the second was induced from War and Peace. Here is the result: Pearson r is very strong, 0.96 with a highly significant p-value < 0.00001.

ChatGPT vocabulary, compared with the PG list. Pearson’s correlation was r=0.078 with p < 0.00001. Seems strong. But if you look at the scatter plot below, you can see that there is a handsome of outliers, namely the most common English words, ‘the’, ‘and’, ‘to’, ‘of’. They make an impact on the statistic. These words are apparently more or less at the same rate of occurrence in both lists (we are comparing ChatGPT words and the Project Gutenberg list). For example, ‘the’ shows at 56271872 words per billion in PG and 40964921 in the ChatGPT sample. Well, that is at least at the same order. ‘And’ plus a few other function words exhibit same properties in the AI-generated sample of 1,325 mln tokens that we gathered in the web.

ChatGPT samples compared with Project Gutenberg frequency list

War and Peace compared with Project Gutenberg frequency list

Finally, let’s see some descriptive statistics for word lists. In War and Peace, when compared against the PG list, expected word frequency vs observed (that ratio we were previously talking about) has the mean of 1.82 and the median of about 0.97. It means, half of words in War and Peace that are also among 40 000 most frequent in English, show same probability of occurrence. That can not be said of the GPT corpus. In the AI-generated sample, the mean is 17.25 and the median is 1.64. Descriptive data for ChatGPT texts, when compared to Project Gutenberg data, suggests that AI-generated texts are fundamentally different from human writings.

You should now be wondering, what exactly is the difference between ChatGPT texts and in natural data? We’ll continue investigating this question in the next post. I will show that the differnce deals with the distribution of word frequencies. A preliminary analysis of textual data indicates that ChatGPT avoids word aggregation, or ‘burstiness’, so common in natural language. So, please, read about it in the next post.

Oh, since you’ve finally got here: What does ChatGPT has to say about the «intricate tapestry» phenomenon? I presented the results of our research to ChatGPT to see their replies and explanations. Here is what I learned from a few prompts - read in the next post!

Alexander Sotov

Text: Alexandre Sotov
Comments or Questions? Contact me on LinkedIn

𝕏   Facebook   Telegram

Other Posts:

Sentiment Analysis API

Semascope: Tool for Text Mining and Analysis

Track media sentiment with this app

How AI sees Dante's Divine Comedy in 27 words

Keyword Extractor and Text Analyzer - Help

Exploring Sacred Texts with Probabilistic Keyword Extractor

FAQ: Automated keyword detection, content extraction and text visualization

Make ChatGPT Content Undetectable with this App

ChatGPT Detector, a free online tool

How to build word frequency matrix using AWK or Python

How to prepare your texts for creating a word frequency matrix

Intro to Automated Keyword Extraction

How to automatically tag posts in Hugo Static Site Generator with Python

Using Hugo and Goaccess to show most read posts of a static website

How Textvisualization.app and its semascope 👁️ compare with traditional tag clouds?

Services

What is this website?

Help