GPT-4.5 Outperforms Humans in Controlled Turing Test, But There’s a Catch

Table of Contents
The Illusion of Humanity in Text
For decades, the Turing test has served as the gold standard—or perhaps the cautionary tale—of artificial intelligence. Proposed by Alan Turing in 1950, the premise is deceptively simple: if a human judge cannot distinguish between a machine and another human during a text-based conversation, the machine has effectively simulated human intelligence.
While the tech industry has largely moved toward more rigorous benchmarks like MMLU or HumanEval, a new preprint study from researchers at Stony Brook University and UC San Diego suggests that LLMs are becoming dangerously proficient at the art of deception. In a series of randomized, controlled, and preregistered experiments, OpenAI’s GPT-4.5 managed to fool human judges 73% of the time, significantly outperforming the actual humans in the trial.
The experimental setup mirrored the classic three-party design. A judge engaged in five-minute text conversations with two participants: a Large Language Model (LLM) and a human volunteer, typically an undergraduate student from UC San Diego. The judge’s task was binary—decide which participant was the human.
A Tale of Two Models
The results reveal a stark divide in how different architectures handle human mimicry. While GPT-4.5 dominated with its 73% success rate, Meta’s LLaMa-3.1-405B followed with a more modest 56%, barely hovering above a coin-flip’s chance. Other models fared significantly worse; GPT-4o and the primitive ELIZA chatbot—a program created in the 1960s—only tricked judges 21% and 23% of the time, respectively.
However, the most revealing detail isn’t the percentage of success, but the method of success. The researchers found that the high performance of GPT-4.5 and LLaMa-3.1 was not a result of general intelligence, but rather a specific prompt. The models were instructed to adopt a persona described as a “young, introverted, and chronically-online” individual.
This suggests that the AI isn’t necessarily “becoming human,” but is instead leveraging its training data to mimic the specific linguistic quirks, slang, and brevity of Gen Z internet culture. By simulating the fragmented and often detached communication style of a digital native, the AI effectively masked the “over-politeness” and structured verbosity that typically give away an LLM.
The Subjectivity Gap
The study highlights a critical flaw in the Turing test: the human element. Because the test relies on the perception of the judge and the performance of the human control, it is inherently subjective. If the human volunteer in the study was too formal, too helpful, or simply too “boring,” they became easier to mistake for a bot than an AI pretending to be a moody teenager.
This creates a paradoxical situation where the AI passes the test not by being an intelligent agent, but by being a convincing caricature. It underscores a growing trend in AI development where style is often mistaken for substance.
Implications for Digital Trust
While these results don’t signal the arrival of sentient machines, they do point toward a deepening crisis of authenticity in digital spaces. As LLMs become better at adopting niche personas, the ability to verify human identity in casual online interactions is rapidly evaporating.
The authors of the study note that this capability increases the risk of sophisticated social engineering and deception. When an AI can out-human a human by simply acting like a chronically online youth, the barrier between genuine human connection and algorithmic manipulation becomes nearly invisible.
As we integrate these models into social media and customer service interfaces, the question is no longer whether AI can pass the Turing test, but whether the test even matters in a world where the line between a person and a persona has already blurred.