We have all the ingredients we need to check if a piece of text is AI-generated. Here’s everything we need:
- The text (sentence or paragraph) we wish to check.
- The tokenized version of this text, tokenized using the tokenizer that was used to tokenize the training dataset for this model.
- The trained language model.
Using 1, 2, and 3 above, we can compute the following:
- Per-token probability as predicted by the model.
- Per-token perplexity using the per-token probability.
- Total perplexity for the entire sentence.
- The perplexity of the model on the training dataset.
To check if a text is AI-generated, we need to compare the sentence perplexity with the model’s perplexity scaled by a fudge-factor, alpha. If the sentence perplexity is more than the model’s perplexity with scaling, then it’s probably human-written text (i.e. not AI-generated). Otherwise, it’s probably AI-generated. The reason for this is that we expect the model to not be perplexed by text it would generate itself, so if it encounters some text that it itself would not generate, then there’s reason to believe that the text isn’t AI-generated. If the perplexity of the sentence is less than or equal to the model’s training perplexity with scaling, then it’s likely that it was generated using this language model, but we can’t be very sure. This is because it’s possible for a human to have written that text, and it just happens to be something that the model could also have generated. After all, the model was trained on a lot of human-written text so in some sense, the model represents an “average human’s writing”.
ppx(x) in the formula above means the perplexity of the input “x”.
Next, let’s take a look at examples of human-written v/s AI-generated text.
Examples of AI-generated v/s human written text
We’ve written some Python code that colors each token in a sentence based on its perplexity relative to the model’s perplexity. The first token is always coloured black if we don’t consider its perplexity. Tokens that have a perplexity that is less than or equal to the model’s perplexity with scaling are coloured red, indicating that they may be AI-generated, whereas the tokens with higher perplexity are coloured green, indicating that they were definitely not AI-generated.
The numbers in the square brackets before the sentence indicate the perplexity of the sentence as computed using the language model. Note that some words are part red and part blue. This is due to the fact that we used a subword tokenizer.
Here’s the code that generates the HTML above.
def get_html_for_token_perplexity(tok, sentence, tok_ppx, model_ppx):
tokens = tok.encode(sentence).tokens
ids = tok.encode(sentence).ids
cleaned_tokens = []
for word in tokens:
m = list(map(ord, word))
m = list(map(lambda x: x if x != 288 else ord(' '), m))
m = list(map(chr, m))
m = ''.join(m)
cleaned_tokens.append(m)
#
html = [
f"<span>{cleaned_tokens[0]}</span>",
]
for ct, ppx in zip(cleaned_tokens[1:], tok_ppx):
color = "black"
if ppx.item() >= 0:
if ppx.item() <= model_ppx * 1.1:
color = "red"
else:
color = "green"
#
#
html.append(f"<span style='color:{color};'>{ct}</span>")
#
return "".join(html)
#
As we can see from the examples above, if a model detects some text as human-generated, it’s definitely human-generated, but if it detects the text as AI-generated, there’s a chance that it’s not AI-generated. So why does this happen? Let’s take a look next!
False positives
Our language model is trained on a LOT of text written by humans. It’s generally hard to detect if something was written (digitally) by a specific person. The model’s inputs for training comprise many, many different styles of writing, likely written by a large number of people. This causes the model to learn many different writing styles and content. It’s very likely that your writing style very closely matches the writing style of some text the model was trained on. This is the result of false positives and why the model can’t be sure that some text is AI-generated. However, the model can be sure that some text was human-generated.
OpenAI: OpenAI recently announced that it would discontinue its tools for detecting AI-generated text, citing a low accuracy rate (Source: Hindustan Times).
The original version of the AI classifier tool had certain limitations and inaccuracies from the outset. Users were required to input at least 1,000 characters of text manually, which OpenAI then analyzed to classify as either AI or human-written. Unfortunately, the tool’s performance fell short, as it properly identified only 26 percent of AI-generated content and mistakenly labeled human-written text as AI about 9 percent of the time.
Here’s the blog post from OpenAI. It seems like they used a different approach compared to the one mentioned in this article.
Our classifier is a language model fine-tuned on a dataset of pairs of human-written text and AI-written text on the same topic. We collected this dataset from a variety of sources that we believe to be written by humans, such as the pretraining data and human demonstrations on prompts submitted to InstructGPT. We divided each text into a prompt and a response. On these prompts, we generated responses from a variety of different language models trained by us and other organizations. For our web app, we adjust the confidence threshold to keep the false positive rate low; in other words, we only mark text as likely AI-written if the classifier is very confident.
GPTZero: Another popular AI-generated text detection tool is GPTZero. It seems like GPTZero uses perplexity and burstiness to detect AI-generated text. “Burstiness refers to the phenomenon where certain words or phrases appear in bursts within a text. In other words if a word appears once in a text, it’s likely to appear again in close proximity” (source).
GPTZero claims to have a very high success rate. According to the GPTZero FAQ, “At a threshold of 0.88, 85% of AI documents are classified as AI, and 99% of human documents are classified as human.”
The generality of this approach
The approach mentioned in this article doesn’t generalize well. What we mean by this is that if you have 3 language models, for example, GPT3, GPT3.5, and GPT4, then you must run the input text through all the 3 models and check perplexity on all of them to see if the text was generated by any one of them. This is because each model generates text slightly differently, and they all need to independently evaluate text to see if any of them may have generated the text.
With the proliferation of large language models in the world as of August 2023, it seems unlikely that one can check any piece of text as having originated from any of the language models in the world.
In fact, new models are being trained every day, and trying to keep up with this rapid progress seems hard at best.
The example below shows the result of asking our model to predict if the sentences generated by ChatGPT are AI-generated or not. As you can see, the results are mixed.
There are many reasons why this may happen.
- Train corpus size: Our model is trained on very little text, whereas ChatGPT was trained on terabytes of text.
- Data distribution: Our model is trained on a different data distribution as compared to ChatGPT.
- Fine-tuning: Our model is just a GPT model, whereas ChatGPT was fine-tuned for chat-like responses, making it generate text in a slightly different tone. If you had a model that generates legal text or medical advice, then our model would perform poorly on text generated by those models as well.
- Model size: Our model is very small (less than 100M parameters compared to > 200B parameters for ChatGPT-like models).
It’s clear that we need a better approach if we hope to provide a reasonably high-quality result to check if any text is AI-generated.
Next, let’s take a look at some misinformation about this topic circulating around the internet.