GPT-4.5 Has Passed The Turing Test. What Does That Mean For Teachers?

(Image credit: Pixabay)

Researchers at the University of San Diego have conducted what they say is the most rigorous Turing test of AI models to date and found it was nearly impossible for participants to distinguish humans from AI models in short conversations.

Introduced in a 1950 paper by computing pioneer Alan Turing, the “Turing test” or what he called the “imitation game,” is a classic test of machine intelligence in which a judge interacts with a human and an AI or machine, and tries to assess which one is human.

“Turing opens the paper with the question, 'Can machines think?' And then he says this is an unanswerable question, let's focus on an easier question, a practical question,” says Cameron Jones, a postdoc in the Language and Cognition Lab at UC San Diego. Jones adds, Turing goes on to suggest, “that if a machine can imitate a human on any topic, if it can produce behavior that's indistinguishable from a human, we shouldn't have any grounds for saying that the human is intelligent, but the machine isn't.”

Although Jones notes there is some debate over how serious Turing was, the test has become a commonly cited benchmark of machine intelligence.

GPT-4.5 and The Turing Test

For their study, Jones and colleagues ran two separate experiments. First, they recruited 126 undergraduate participants through the psychology program at UC San Diego. They also recruited 158 paid participants from a study-participant platform called Prolific.

In these experiments, Jones and his collaborators tested multiple AI models. The research found that “when prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant.”

Given the same prompt, LLaMa-3.1 was judged to be human 56% of the time, about the same as the humans they were compared to. Meanwhile, GPT-4o was thought to be human just 21% of the time.

The results of these two experiments have been published as a preprint study, so have not yet been peer reviewed. Nonetheless, Jones believes what his research has found has several implications for educators around the way we teach, test, and prepare students for the workforce.

Turing Test Results and Education

“The idea behind the Turing test is this kind of idea of indistinguishability. And so if models can produce behavior that's indistinguishable from human behavior, then we say that the models are as intelligent as people,” Jones says. “If people can't tell the difference between a human and a machine, then it's not clear that the human will have any marginal value at that task. So I think that's got to be a big worry in education: trying to think about what are the types of activities that will have a comparative advantage for humans in the future.”

He adds, “One thing that our results do suggest is that models have maybe already reached this stage for short conversations with strangers, and there might be quite a lot of jobs that have that component to them.”

What exactly these AI-proof jobs are is still a "million-dollar question." Broadly speaking, however, Jones says AI models still struggle with things such as hallucinations. Most also have a tendency to fail at their "jobs," for unexplained reasons, and that even a 5% fail rate can be a big problem in certain roles.

Most significantly AI models tend to fail at "long-horizon planning and use of context," he says. "An employee who has been at the company for three years has just picked up a lot of implicit knowledge about where things are and why things are done the way they're done."

He adds, "Manufacturing and maintaining a context window to include all of this information for an LLM can be very challenging. This means that tasks which take a person longer than a few hours are often too complex for models, because they either lack sufficient context or their errors compound, or their context window gets too bloated."

So until AI programs gain long-term memory and/or institutional knowledge, and can integrate all that consistently into tasks, humans still will be needed.

Evaluations Going Forward

The inability to distinguish between human and machine creations on school assignments is already an issue many teachers are familiar with and battling. Jones’ research highlights that this cheating risk is real. In addition, it raises questions about how we evaluate students.

In that vein, Jones says educators will need to start asking themselves questions such as, “What's the evaluation for? What is it that you're trying to learn if you're evaluating people on things that you can't distinguish between humans and models?”

These are questions that educators have been grappling with and debating since ChatGPT was released, but this type of research highlights the need for addressing them as AI continues to improve.

TOPICS

Erik Ofgang is a Tech & Learning contributor. A journalist, author and educator, his work has appeared in The New York Times, the Washington Post, the Smithsonian, The Atlantic, and Associated Press. He currently teaches at Western Connecticut State University’s MFA program. While a staff writer at Connecticut Magazine he won a Society of Professional Journalism Award for his education reporting. He is interested in how humans learn and how technology can make that more effective.

Recommended reading

GPT-4.5 and The Turing Test

Tech & Learning Newsletter

Turing Test Results and Education

Evaluations Going Forward