AI is failing humanity’s final test. So what does that mean for machine intelligence?

Machine Learning


How do you translate ancient Palmyrene script from Roman tombstones? How many paired tendons are supported by a particular sesamoid bone in a hummingbird? Based on the latest research on the Tiberian pronunciation tradition, can you identify the closed syllables of Biblical Hebrew?

These are some of the questions in “Humanity’s Last Test,” a new benchmark introduced in a study published this week in Nature. This collection of 2,500 questions is specifically designed to explore the outer limits of what today’s artificial intelligence (AI) systems can’t do.

This benchmark represents a global collaboration of approximately 1,000 international experts across a variety of academic disciplines. These scholars and researchers questioned the forefront of human knowledge. This problem required graduate-level expertise in mathematics, physics, chemistry, biology, computer science, and the humanities. Importantly, all questions are tested against leading AI models before being included. If the AI ​​could not answer the question correctly when designing the test, it was rejected.

This process explains why the initial results look so different from other benchmarks. AI chatbots have scored over 90% in popular tests, but when Humanity’s Last Test was first released in early 2025, leading models struggled mightily. GPT-4o’s accuracy was only 2.7%. Claude 3.5 Sonnet scored 4.1%. Even OpenAI’s most powerful model, o1, only achieved an 8% success rate.

The key point was the low score. This benchmark was built to measure what AI cannot figure out. While some commentators have also suggested that benchmarks like Humanity’s Last Test point the way toward artificial general intelligence or even superintelligence, that is, AI systems capable of performing any task at a human or superhuman level, we believe this is wrong for three reasons.

Benchmarks measure task performance, not intelligence

If a student does well on the bar exam, we can reasonably predict that he or she will become a competent lawyer. That’s because the test is designed to assess whether humans have acquired the knowledge and reasoning skills needed to practice law, and it works for humans. The understanding you need to pass will actually translate to your work.

But AI systems are not career-preparing humans.

If a large-scale language model can score well on a bar exam, we know that the model can produce answers that seem correct to legal questions. We don’t know if this model understands the law, can advise nervous clients, or can use professional judgment in ambiguous situations.

This test measures something that is realistic for humans. With AI, only the performance of the test itself is measured.

Using human performance tests to benchmark AI is a common practice, but it’s fundamentally misleading. It is a category error to think that a higher test score means the machine is more like a human. This is a lot like concluding that calculators “understand” mathematics because they can solve equations faster than humans.

Human intelligence and machine intelligence are fundamentally different

Humans continuously learn from experience. We have intentions, needs, and goals. We live lives, have physical bodies, and experience the world directly. Our intelligence evolved to help us survive as organisms and thrive as social creatures.

But AI systems are very different.

Large-scale language models derive their features from patterns in text during training. But they don’t actually learn.

For humans, intelligence comes first, and language serves as a communication tool. Intelligence is before language. But in the case of large-scale language models, language is intelligence and there is nothing beneath it.

The creators of Humanity’s Last Test also acknowledge this limitation.

High precision on [Humanity’s Last Exam] It will demonstrate expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge, but by itself does not imply autonomous research capabilities or artificial general intelligence.

Subbarao Kambanpati, a professor at Arizona State University and former president of the Association for the Advancement of Artificial Intelligence, puts it more clearly:

The essence of humanity is not captured by static tests, but by our ability to evolve and tackle previously unimaginable problems.

Developers love leaderboards

There’s another problem. AI developers use benchmarks to optimize models for leaderboard performance. They are basically cramming for the exam. And unlike humans, who improve their understanding by learning for a test, optimizing an AI simply means performing better on a particular test.

But it’s working.

Since Humankind’s last exam was published online in early 2025, scores have increased dramatically. Gemini 3 Pro Preview tops the leaderboard with 38.3% accuracy, followed by GPT-5 at 25.3% and Grok 4 at 24.5%.

Does this improvement mean these models are approaching human intelligence? No. It means they got better at the types of questions included on the exam. Benchmarks are now subject to optimization.

The industry recognizes this issue.

OpenAI recently introduced a measure called GDPval that is specifically designed to assess real-world usefulness.

Unlike academic benchmarks, GDPval focuses on tasks based on real-world work products such as project documentation, data analysis, and artifacts that exist in professional environments.

what this means for you

If you use AI tools at work or are considering implementing them, don’t let benchmark scores sway you. Even models that pass humanity’s last test may still struggle with certain tasks they need to perform.

Also be aware that the questions on the exam are heavily biased towards certain areas. Mathematics alone accounts for 41% of the benchmark, with physics, biology, and computer science accounting for most of the rest. If your job involves writing, communication, project management, or customer service, the exam will tell you little about which model will be most useful.

A pragmatic approach is to devise your own tests based on what you actually need your AI to do and evaluate your new models against important criteria. AI systems are really useful, but discussions about superintelligence remain science fiction and distract from the real work of relating these tools to people’s lives.



Source link