The Intricate Tapestry of ChatGPT Texts: Why LLM overuses some words at the expense of others?

In this Post

Ever wondered why ChatGPT texts are heavy on certaint words? This article talks about word frequencies in fake posts. We'll review some examples and introduce a vocabulary-based ChatGPT detector.

Table of Content


Intro

Try free online ChatGPT detector

Try free online Keyword Extractor and Text Analyzer

If you’ve been wondering why ChatGPT texts are heavy on certain words, you are surely not alone. There is a recent thread on Reddit didicated to OpenAI’s «intricate tapestry» phenomenon, as these are among word that you often see across prompts. An anecdotal vocabulary of ChatGPT’s favourites also includes the words «intricacy», «vibrant», «breathtaking», «innovative», and so on. ChatGPT would write about «catering» to the needs of clients, making something «seamless» and suggesting «a no hassle solution»… As we shall see below, there is also a remarkable place for the «t-word» in ChatGPT’s reply to prompts. «Tapestry», yes, we are getting there.

In the vibrant, dynamic, multifaceted, kaleidoscopic and multidimensional world of AI, one linguistic generator program stood out from the rest: ChatGPT. A testament to its algorithm, the program sought to weave intricate threads of information to create a rich tapestry of knowledge. Redit User

You can download data in this post by cloning chatgpt_corpus.

In the world of AI generated texts there is a big issue: spam that originates from ChatGPT and other LLMs. Generally speaking, that is the problem of automated detection of potentially useless texts that were generated by LLMs. In this post, I would like to share some of my findings from a collection of about 2K texts created by ChatGPT. We’ll take into consideration certain lexicographical features that are evident if you compare AI texts with human ones.

You can view the ChatGPT collection in semascope viewer, which shows graphically how words relate to each other: here.

Collecting data

Discovering a website that contained large amount of generated content turned our an easy task. It was as simple as typing a search query, «collection of AI-generated texts». And that is exactly how I found a web page titled «Smarhon: A Journey into Belarus’ Untouched Cultural Heritage». It screams, ‘ChatGPT wrote me’, and you will appreciate if I show you a sample:

As you wander through the charming streets of Smarhon, make sure to marvel at its architectural treasures. The Church of St. Michael the Archangel, built in the 18th century, stands as a symbol of the town’s religious heritage. Its stunning frescoes and intricate wood carvings are a sight to behold. Another architectural gem is the Smarhon Castle, which dates back to the 17th century. Once a powerful fortress, it now houses a museum that offers a glimpse into the town’s past. Explore its exhibition halls, which display artifacts highlighting Smarhon’s historical significance. Immersing in Nature’s Beauty…

There are many thousand posts like this, they literally go without end. I found myself looking at sitemap.xml of the site. Reading into the urls clearly suggests that the posts are iterations of machine-written and rewritten text, all over and over again. A kind of fake promo info about a traveling destination.

Once the sitemap.xml file is downloaded, we can create a list of URLs and pass it over to wget:

curl -s https://THEWEBSITE/wp-sitemap-posts.xml | grep -oP '<loc>\K[^<]*' > urls.txt
wget -i urls.txt -P ./local-directory --reject 'jpg,jpeg,png,gif,css,js'

The –reject option is used to exclude certain file types (images, CSS, and JS files) as we only need texts.

Preparing text files

In a few hours, not without impatience, I got a collection of about 2000 posts from the «Smarhon» website. I went on to parse html into pure txt, as the layout contained actual posts (whithout the navigation elements) inside <p> tags.

Here is the Python script that was used for parsing:

import os
from bs4 import BeautifulSoup

def extract_text_from_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    paragraphs = soup.find_all('p')
    return '\n'.join(paragraph.get_text(separator='\n') for paragraph in paragraphs)

def process_html_files(directory):
    file_counter = 111  # Starting number for output files
    for filename in os.listdir(directory):
        input_path = os.path.join(directory, filename)
        output_path = os.path.join(directory, f'{file_counter}.txt')

        with open(input_path, 'r', encoding='utf-8') as file:
            html_content = file.read()

        extracted_text = extract_text_from_html(html_content)

        with open(output_path, 'w', encoding='utf-8') as output_file:
            output_file.write(extracted_text)

        print(f"Processed: {input_path} -> {output_path}")

        file_counter += 1

if __name__ == "__main__":
    # Set HTML files directory here:
    html_directory = '/local-directory'
    process_html_files(html_directory)

Now we have the posts as clean txt files, although some additional parsing is required to get rid of noise. An inspection of parsed data suggested that some pages were in German, while most of them in English. What should we do? First I also gut stuck with the problem, because there is no way I was going to check every file manually.

A good thing is that we can use Python to check each text for its language - and you can do it automatically! Firstly, install the needed module, it is called langdetect:

pip install langdetect

Well done. And here is the script to do language detection, comments are to give you a clue of what’s going on.

import os
from langdetect import detect

def detect_language(text):
    try:
        language = detect(text)
        return language
    except:
        return "Unknown"

def process_txt_files(directory):
    for filename in os.listdir(directory):
        if filename.endswith('.txt'):
            input_path = os.path.join(directory, filename)

            # Read the content of the file
            with open(input_path, 'r', encoding='utf-8') as file:
                content = file.read()

            # Detect the language of the content
            language = detect_language(content)

            if language == "en":
                # If the detected language is English, rename the file
                new_filename = f'EN-{filename}'
                output_path = os.path.join(directory, new_filename)
                os.rename(input_path, output_path)

                print(f"Renamed: {input_path} -> {output_path}")
            else:
                print(f"Ignored: {input_path} (Language: {language})")

if __name__ == "__main__":
    txt_directory = '/local-directory'
    process_txt_files(txt_directory)

Please, do not call the script language.py as such a name will interfere with the module. I called the script langdetect.py and ran it by doing python3 langdetect.py from the Linux console. All worked surprisingly fast, the script properly inspected the data, and upon a brief checkup I moved files starting with EN- to a separate folder:

cp ./local-directory/EN* ./local-directory/EN/

The preparation part is almost over, so read on.

Building frequency list for AI-generated corpus

Now we have a small corpus of AI-generated texts that we took from what seems to be a SEO spam website with many thousands of ChatGPT junk posts. You can download it here.

And now, finally, we can start with our data mining, namely with a corpus-linguistical analysis. Our research questions are as follows:

  1. Which words are underrepresented in the ChatGPT corpus?
  2. What lexis is overrepresented in AI-generated texts?
  3. Can we distinguish between actual human texts and ChatGPT prompts? What is the difference?

Such questions are many, and they can only be answered when you have sufficient data, so let’s try playing around with the LLM spam corpus of 1.2 mln words.

A few notes on the text file we are going to analyze. Each line contains a separate web article, so there are as many posts as there are lines in the file, exactly N=1922 shitposts. Texts do not repeat each other literally.

So, the big text file with all GPT posts is prepared, let’s see how many tokens are there:

$ wc -w ./chatgpt-corpus/chatgpt-replies-small-corpus.txt 
1325123 ./chatgpt-corpus/chatgpt-replies-small-corpus.txt

Python script to create word frequency lists

A wordlist is a frequency list, where words are listed with the most frequent coming first, descending to the least frequent. But such data alone is often useless: instead, we need to calculate probability of meeting the word in our little corpus, a collection of SEO spam documents. For this purpose we can take tokens, the number of individual words in the text, and calculate relative freqency of each word, i.e. ratio of word frequency to total number of words.

There is a number of word rank lists, or word frequency vocabularies, for different languages, and you can get this data from the web. Wikipedia is a good starting point to get it. To compare how probable we are to find a word in AI-generated texts against natural texts, I chose the list from Project Gutenberg books. It is dated back to 2006, way before AI-generated texts flooded the web, so we can safely assume that the texts are «pure human». This is what Wikipedia writes about the list: «These lists are the most frequent words, when performing a simple, straight (obvious) frequency count of all the books found on Project Gutenberg».

The PG list that we are going to use contains top 40000 English words as seen in all Project Gutenberg books in April 2006, good enough for our purpose. The PG list includes word frequency in the format of ‘per billion’, fractions are limited to two decimal places and the words are case insensitive.

Let us now extract data from our ChatGPT corpus, using the same format as in the Project Gutenberg list:

import re
import csv
from collections import Counter

def calculate_word_frequency(file_path, output_path):
    # Read the content of the file
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()

    # Tokenize the text using a simple regex
    tokens = re.findall(r'\b\w+\b', text.lower())

    # Calculate word frequencies
    word_frequencies = Counter(tokens)

    # Total number of tokens
    total_tokens = len(tokens)

    # Calculate frequency per actual tokens and induced frequency per billion tokens
    result = []
    for word, frequency in word_frequencies.items():
        frequency_per_token = frequency / total_tokens
        frequency_per_billion_tokens = frequency_per_token * 1e9
        result.append((word, frequency, frequency_per_token, frequency_per_billion_tokens))

    # Sort the result by frequency in descending order
    result.sort(key=lambda x: x[1], reverse=True)

    # Write the result to a CSV file
    with open(output_path, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['Word', 'Frequency', 'Frequency per Token', 'Frequency per Billion Tokens']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        # Write the header
        writer.writeheader()

        # Write each row
        for word, frequency, freq_per_token, freq_per_billion in result:
            writer.writerow({
                'Word': word,
                'Frequency': frequency,
                'Frequency per Token': freq_per_token,
                'Frequency per Billion Tokens': freq_per_billion
            }) # Example usage: replace 'your_file.txt' and 'results.csv' with the actual file paths
calculate_word_frequency('/local-folder/EN/chat-gpt-EN.txt', 'gpt-results.csv')

The resulting file, gpt-results.csv, contains a ChatGPT word frequency list according to about 2000 texts that were harvested on the web, although it was taken from just one particular website. This is the way, however, to expand the ChatGPT corpus by adding new AI-generated texts, should anyone want to do that.

Next stage, we need to merge the PG word frequency list and our vocabulary file, gpt-results.csv.

You can do that by executing the following UNIX one-liner, so wait no more and go back to your command line prompt:

$ awk -F',' 'NR==FNR{a[$1]=$0; next} $1 in a {print a[$1] "," $0}' <(sort PGrank.csv) <(sort gpt-results.csv) > CompareRanksGPTvsPG.csv

On a side note, wk -F',' means, ‘use comma as a field separator’. The idea is, firstly, sort the two lists that we are going to merge. And take only words that exist in both lists and keep corresponding word frequency statistics from the original two files.

The datafile that I got (and you can download it later) looks like this:

Word Frequency per Billion 1 Word Frequency Frequency per Token Frequency per Billion 2 Ratio
www 39.5567 www 4033 0.00292420133702635 2924201.33702635 1.35273517247706E-05
breathtaking 18.196 breathtaking 910 0.000659812352267289 659812.352267289 2.75775376703297E-05
belarusian 41.9301 belarusian 1548 0.00112240606737337 1122406.06737337 3.73573354767441E-05
iconic 12.6581 iconic 301 0.000218245624211488 218245.624211488 5.79993300930232E-05
belarus 211.232 belarus 4716 0.0034194231354863 3419423.1354863 6.1774162374894E-05
cinematic 7.91134 cinematic 171 0.000123986716744732 123986.716744732 6.38079643345031E-05
effortlessly 40.3478 effortlessly 625 0.000453167824359402 453167.824359402 0.0000890350060864
innovative 87.8159 innovative 1311 0.000950564828376282 950564.828376282 9.23828626712433E-05
customize 13.4492 customize 167 0.000121086442668832 121086.442668832 0.000111071063808383
gamer 26.1074 gamer 298 0.000216070418654563 216070.418654563 0.000120828201114094
upcoming 54.5882 upcoming 533 0.000386461520613698 386461.520613698 0.000141251320217636
showcase 101.265 showcase 863 0.000625734131875462 625734.131875462 0.00016183390811124
unleash 71.9932 unleash 520 0.000377035629867022 377035.629867022 0.000190945349184616
maximize 57.7528 maximize 417 0.000302353572412593 302353.572412593 0.000191010807443645
blockbuster 8.70248 blockbuster 61 4.42291796574776E-05 44229.1796574776 0.000196758792891803
options 329.903 options 2039 0.00147841471019011 1478414.71019011 0.000223146453918588
hassle 37.1833 hassle 223 0.000161690279731435 161690.279731435 0.000229966204905829
immerse 346.516 immerse 2004 0.00145303731202599 1453037.31202599 0.000238477014411177
powered 116.296 powered 665 0.000482170565118404 482170.565118404 0.00024119265756391
optimize 29.2719 optimize 152 0.000110210414884207 110210.414884207 0.000265600125276315
informative 117.087 informative 596 0.000432140837309126 432140.837309126 0.000270946390369127

… and so on. There are 4604 words in it, each attested at lest once in the ChatGPT corpus. That is slightly over 10 per cent of the top 40000 common English words from Project Gutenberg books!.

Frequency per Billion 1 is the Project Gutenberg probability to find the word, and Frequency per Billion 2 is same for our 2K posts ChatGPT corpus. Ratio is Frequency per Billion 1 divided by Frequency per Billion 2. I calculated it as a measure of how many times the word in the ChatGPT word list overused in comparison to the Project Gutenberg word frequency. The table is sorted by this ratio, so you can immediately recognize some of ChatGPT’s favourites, including the words ‘breathtaking’, ‘effortlessly’, ‘innovative’, ‘showcase’, ‘immerse’. So far, very ‘informative’ and ‘no hassle’!

Tapestry is seriously overused by ChatGPT

Attention, now we come to the big fun part!

How about the word ‘tapestry’? Is it really used so often in ChatGPT posts? The answer: It turns out, in a text sample of about 2000 texts that were generated by ChatGPT, the word ‘tapestry’ is used at the rate of 102959 words per billion, whereas in the Project Gutenberg corpus the same word is 25 times less common (occurs at 4099.65 words per billion).

Here are some more examples of English words overused by ChatGPT:

Word Frequency per Billion 1 Word Frequency Frequency per Token Frequency per Billion 2 Ratio
tapestry 4099.65 tapestry 142 0.000102959729694456 102959.729694456 0.0398179949788733
intricate 5167.69 intricate 824 0.000597456459635436 597456.459635436 0.00864948385218446
vibrant 1134.48 vibrant 1972 0.00142983511941879 1429835.11941879 0.000793434141176468
extravaganza 160.6 extravaganza 2 1.45013703795009E-06 1450.13703795009 0.110748154
gem 4187.47 gem 2018 0.00146318827129164 1463188.27129164 0.00286188051268582
unleash 71.9932 unleash 520 0.000377035629867022 377035.629867022 0.000190945349184616
unlock 1503.15 unlock 859 0.000622833857799562 622833.857799562 0.002413404443539
streamline 51.4237 streamline 23 1.6676575936426E-05 16676.575936426 0.00308358863330435
blend 2933.52 blend 1062 0.000770022767151496 770022.767151496 0.00380965359096045
colorful 329.111 colorful 124 8.99084963529053E-05 89908.4963529054 0.00366051055629032
testament 2316.44 testament 730 0.000529300018851782 529300.018851782 0.00437642153315068

The situation gets even more dramatic with some notorious words from the ‘Reddit Lexicon of ChatGPT’. Here are a few examples from the data. The word ‘intricate’ was used 115x more often in ChatGPT sample of texts when compared to Project Gutenberg books, ‘vibrant’ - a shocking 1260x times increase. ‘Extravaganza’ - a 25x overuse. The award goes to ‘breathtaking’. In the sample data it is the top words overused by GPT. It has the rate of 659812 words per billion in ChatGPT texts, that is 36261 times more than in texts written by us, humans. ‘Testament’ is 228 times more frequent in GTP texts, ‘landscape’ - 10 times, and so on, see full data in the repo files.

What you read in this post is implemented as a vocabulary-based ChatGPT detector. You can try the ChatGPT Detector here

Words systematically underused by ChatGPT

Let us further ‘dwell into the intricacies’ of ChatGPT lingo. Our sample of AI-written texts reveal an interesting list of underused words. These are words that happen unnaturally rare in ChatGPT posts when compared to the corpus of Project Gutenberg books.

This table lists words that occur in ChatGPT posts with the probability of at least 200x less when compared to same words in Project Gutenberg corpus:

Word Frequency per Billion 1 Word Frequency Frequency per Token Frequency per Billion 2 Ratio
copyright 145244 copyright 1 7.2506851897504E-07 725.068518975043 200.31761992
round 291647 round 2 1.45013703795009E-06 1450.13703795009 201.116854729999
certain 296795 certain 2 1.45013703795009E-06 1450.13703795009 204.666864049999
low 149690 low 1 7.2506851897504E-07 725.068518975043 206.4494542
six 151612 six 1 7.2506851897504E-07 725.068518975043 209.10023816
cut 152625 cut 1 7.2506851897504E-07 725.068518975043 210.4973475
nearly 154001 nearly 1 7.2506851897504E-07 725.068518975043 212.39509918
none 155743 none 1 7.2506851897504E-07 725.068518975043 214.79763074
south 158664 south 1 7.2506851897504E-07 725.068518975043 218.82621552
purpose 162154 purpose 1 7.2506851897504E-07 725.068518975043 223.63955372
began 325327 began 2 1.45013703795009E-06 1450.13703795009 224.342245929999
turned 337367 turned 2 1.45013703795009E-06 1450.13703795009 232.644909529999
continued 169086 continued 1 7.2506851897504E-07 725.068518975043 233.20002948
door 342388 door 2 1.45013703795009E-06 1450.13703795009 236.107340919999
god 552668 god 3 2.17520555692513E-06 2175.20555692513 254.076217413333
enough 382266 enough 2 1.45013703795009E-06 1450.13703795009 263.606810939999
course 385303 course 2 1.45013703795009E-06 1450.13703795009 265.701095769999
could 1571110 could 8 5.80054815180035E-06 5800.54815180035 270.855436225
french 199969 french 1 7.2506851897504E-07 725.068518975043 275.79324542
mean 211299 mean 1 7.2506851897504E-07 725.068518975043 291.41935482
really 211722 really 1 7.2506851897504E-07 725.068518975043 292.00274796
earth 222546 earth 1 7.2506851897504E-07 725.068518975043 306.93099228
reason 229940 reason 1 7.2506851897504E-07 725.068518975043 317.1286492
because 465587 because 2 1.45013703795009E-06 1450.13703795009 321.064139329999
hour 237964 hour 1 7.2506851897504E-07 725.068518975043 328.19518952
fact 263613 fact 1 7.2506851897504E-07 725.068518975043 363.56977734
received 264606 received 1 7.2506851897504E-07 725.068518975043 364.93930308
person 267878 person 1 7.2506851897504E-07 725.068518975043 369.45198004
children 275607 children 1 7.2506851897504E-07 725.068518975043 380.11166226
till 304735 till 1 7.2506851897504E-07 725.068518975043 420.2844173
death 309653 death 1 7.2506851897504E-07 725.068518975043 427.06722454
man 1573117 man 5 3.62534259487522E-06 3625.34259487522 433.922300812
morning 330567 morning 1 7.2506851897504E-07 725.068518975043 455.91139506
had 6139336 had 17 1.23261648225757E-05 12326.1648225757 498.073495557648
did 1185720 did 3 2.17520555692513E-06 2175.20555692513 545.1071032
knew 413101 knew 1 7.2506851897504E-07 725.068518975043 569.74063718
very 1462382 very 3 2.17520555692513E-06 2175.20555692513 672.296002253333
my 3277699 my 4 2.90027407590017E-06 2900.27407590017 1130.134226705
i 11764797 i 11 7.97575370872548E-06 7975.75370872547 1475.07024786
his 8799755 his 8 5.80054815180035E-06 5800.54815180035 1517.0557626125
her 5202501 her 2 1.45013703795009E-06 1450.13703795009 3587.59266458999
said 2637136 said 1 7.2506851897504E-07 725.068518975043 3637.08522848
he 8397205 he 1 7.2506851897504E-07 725.068518975043 11581.2571919

The pronoun ‘he’ - over 11500 more frequent in natural texts when compared to ChatGPT replies, ‘her’ - about 3600 times, ‘reason’ - 317 times, ‘god’ - 254 times, ‘purpose’ - 223 times. One can expect that these words are not seen often in typical ChatGPT replies.

The word ‘woman’ was not seen in the ChatGPT sample at all. Surprisingly, the pronoun ‘she’, one of the most common words in the English language, was not attested neither. You can check this by downloading the data used in this post, to do that, clone chatgpt_corpus.

Are word frequencies in ChatGPT texts and in natural texts correlated?

An important question about the quantitative nature of AI-generated texts is if rate of vocabulary use is at all realated to what we find in natural language. The short preliminary answer is, yes, but the correlation is not very strong. Firstly, a few words on the quality of Project Gutenberg word frequency list. It appears that the PG list from Wikipedia can be trusted, because if you compare word frequencies (per billion) in the PG list and in a comparatively large human text, the values are strongly correlated. Using same script as in «Python script to create word frequency lists» above, I gathered statistics from Tolstoy’s War and Peace, and compared the Frequency per Billion 1 with Frequency per Billion 2 columns. The first one is from PG list, while the second was induced from War and Peace. Here is the result: Pearson r is very strong, 0.96 with a highly significant p-value < 0.00001.

ChatGPT vocabulary, compared with the PG list. Pearson’s correlation was r=0.078 with p < 0.00001. Seems strong. But if you look at the scatter plot below, you can see that there is a handsome of outliers, namely the most common English words, ‘the’, ‘and’, ‘to’, ‘of’. They make an impact on the statistic. These words are apparently more or less at the same rate of occurrence in both lists (we are comparing ChatGPT words and the Project Gutenberg list). For example, ‘the’ shows at 56271872 words per billion in PG and 40964921 in the ChatGPT sample. Well, that is at least at the same order. ‘And’ plus a few other function words exhibit same properties in the AI-generated sample of 1,325 mln tokens that we gathered in the web.

ChatGPT samples compared with Project Gutenberg frequency list

War and Peace compared with Project Gutenberg frequency list

Finally, let’s see some descriptive statistics for word lists. In War and Peace, when compared against the PG list, expected word frequency vs observed (that ratio we were previously talking about) has the mean of 1.82 and the median of about 0.97. It means, half of words in War and Peace that are also among 40 000 most frequent in English, show same probability of occurrence. That can not be said of the GPT corpus. In the AI-generated sample, the mean is 17.25 and the median is 1.64. Descriptive data for ChatGPT texts, when compared to Project Gutenberg data, suggests that AI-generated texts are fundamentally different from human writings.

You should now be wondering, what exactly is the difference between ChatGPT texts and in natural data? We’ll continue investigating this question in the next post. I will show that the differnce deals with the distribution of word frequencies. A preliminary analysis of textual data indicates that ChatGPT avoids word aggregation, or ‘burstiness’, so common in natural language. So, please, read about it in the next post.

Oh, since you’ve finally got here: What does ChatGPT has to say about the «intricate tapestry» phenomenon? I presented the results of our research to ChatGPT to see their replies and explanations. Here is what I learned from a few prompts - read in the next post!

Alexander Sotov

Text: Alexandre Sotov
Comments or Questions? Contact me on LinkedIn

𝕏   Facebook   Telegram

Other Posts:

Sentiment Analysis API

Semascope: Tool for Text Mining and Analysis

Track media sentiment with this app

How AI sees Dante's Divine Comedy in 27 words

Keyword Extractor and Text Analyzer - Help

Exploring Sacred Texts with Probabilistic Keyword Extractor

FAQ: Automated keyword detection, content extraction and text visualization

Make ChatGPT Content Undetectable with this App

ChatGPT Detector, a free online tool

How to build word frequency matrix using AWK or Python

How to prepare your texts for creating a word frequency matrix

Intro to Automated Keyword Extraction

How to automatically tag posts in Hugo Static Site Generator with Python

Using Hugo and Goaccess to show most read posts of a static website

How Textvisualization.app and its semascope 👁️ compare with traditional tag clouds?

Services

What is this website?

Help