Humanity’s final test is a distraction.

# introduction

humanity’s last test (HLE) is a benchmark designed to measure the reasoning and deep knowledge capabilities of modern AI systems. Its characteristic is that the underlying evaluation is extreme. Think of it as today’s evolution of the Turing Test from decades ago.

This article carefully delves into this benchmark, outlines why it was created, summarizes the various opinions on this benchmark from a group of experts in the field, and concludes with a summary of the most widely accepted verdict.

# Why was it built and what does it consist of?

The traditional testing methods used by classic AI systems have become obsolete as these systems have evolved to the point where you can get a perfect score without much effort. For this reason, AI Safety Center I created a new benchmark called HLE Scale AI With the help of world experts. The benchmark was published on natureis the most prestigious scientific journal ever published in January 2026. It has been carefully designed to avoid repeating patterns like previous evaluation frameworks.

So what is the HLE? It’s an exam taken by state-of-the-art AI systems, including language models, and consists of more than 2,500 expert-level questions across more than 100 academic disciplines, including but not limited to physics, mathematics, biology, humanities, and more. Importantly, questions cannot be answered simply by memorizing them, nor are they limited to simple information searches or multiple-choice answers. Instead, it requires complex deductive reasoning and deep understanding.

Below are examples of two such questions.

Two examples of HLE questions. Image source: ArXiv

Two examples of HLE questions. Image source: AI Safety Center

Let’s talk about the results obtained so far with today’s state-of-the-art models. Even the most sophisticated frontier models, such as GPT, Gemini, and Claude, barely pass the 45-50% overall accuracy threshold. This number speaks to how incredibly difficult this exam is. In addition, failures are often the result of acting overconfidently on questions answered incorrectly.

# What are the leading experts’ opinions on HLE?

The honest answer is that there is little consensus on this. Opinions are quite divided among the technology, developer, and academic communities, but a subtle trend toward acceptance of practical utility prevails in HLE. However, there are also critical nuances.

In general, experts and the broader public familiar with HLE do not believe that this is a pointless endeavor at all, but find the pretentious and seemingly marketing-oriented name appealing.

There are broadly three dominant opinion groups regarding HLE.

// 1. HLE is really useful and necessary

Approximately 60% of opinions lean toward this collective opinion, according to which there are technical reasons why HLE is currently the most important. Previous benchmarks and testing frameworks for AI systems, including not-so-old language model benchmarks like Massive Multitask Language Understanding (MMLU), have become saturated or outdated, with nearly all modern AIs scoring above 90% on them. Therefore, it was impossible to actually compare the latest models and determine which one is the best. One notable reason HLE is praised by many experts is that it can measure whether an AI is willing to say “I don’t know” rather than hallucinating about complex problems or questions it can’t handle.

// 2. HLE distracts from real AI

This skeptical view was adopted by approximately 30% of the opinions. These experts believe that this test is purely based on overly academic and vague knowledge and does not truly evaluate the performance and success of AI in everyday life scenarios. Some engineers even dare to say, somewhat cynically, that as soon as AI starts scoring 90%+ on HLE in large numbers, companies will rush to develop things like HLE 2, and the marketing circle will become stronger in favor of large companies.

// 3. HLE is flawed

This is the third and smallest of the three major opinions and is discussed, for example, in data science forums. They claim that HLE has errors in some answers labeled as correct, especially some niche questions from fields such as chemistry and advanced mathematics. Rather poetically, it was the most powerful AI systems themselves that started detecting such errors in their benchmarks.

# summary

In summary, the usefulness of HLE is not denied, and although its naming is widely considered to be purely a marketing stunt, its importance has been emphasized to some degree by many experts. It seems unlikely that leveraging this benchmark will determine the birth of super AI or the true emergence of AI. general artificial intelligence (AGI): It’s a concept that’s been debated for years now, but it’s still more fiction than reality. Nevertheless, this benchmark is considered a very ambitious tool to identify which AI or company has the best models with memory and logic capabilities.

Ivan Palomares Carrascosa I am a leader, writer, speaker, and advisor in AI, machine learning, deep learning, and LLM. He trains and coaches others to leverage AI in the real world.

Source link