The final test of humanity: AI has no chance in the test!
The Ruhr University Bochum is participating in “Humanity’s Last Exam”, a new benchmark for assessing AI capabilities.

The final test of humanity: AI has no chance in the test!
On April 1, 2025, the “Humanity’s Last Exam” benchmark dataset was presented, specifically designed to test the capabilities of Generative Artificial Intelligence (AI). This dataset contains 550 questions selected from more than 70,000 submissions. The mathematicians from the Ruhr University Bochum, Prof. Dr. Christian Stump and Prof. Dr. Alexander Ivanov, actively contributed three questions to the test data set. Around 1,000 experts from 50 countries contributed to compiling the questions. To ensure the integrity of the test, only unpublished questions were chosen so that AI models cannot simply search for the answers on the Internet.
A particularly notable aspect of the data set is that 40 percent of the questions come from the field of mathematics. These questions have the potential to be used as a basis for doctoral theses. In addition, it turns out that the more abstract the questions, the better the AIs' chains of reasoning can be illuminated. Despite this sophisticated structure, the AIs tested were only able to meaningfully answer nine percent of the questions. The models consistently provided unusable answers to the remaining questions. This reveals the challenges in testing the intelligence and problem-oriented capabilities of artificial intelligences.
Importance of benchmarks for AI developments
The introduction of “Humanity’s Last Exam” (HLE) marks an important step in the evaluation of large language models. Previous benchmarks have often been insufficient to measure significant progress in the models, as current benchmarks such as MMLU are met by the models with over 90 percent accuracy. However, this high level of accuracy limits the ability to realistically assess the models' actual capabilities. The “HLE” data set therefore aims to be the last closed academic assessment covering a wide range of subjects.
“HLE” includes a total of 3,000 questions in various disciplines, including mathematics, humanities and science. The dataset contains both multiple-choice and short-answer questions suitable for automatic grading. Every question has a clear, verifiable solution and cannot be quickly answered by a simple internet search. The current language models show poor accuracy and calibration when tested with “HLE”, indicating a significant gap between the models' capabilities and human expert performance on closed-ended academic questions. This highlights the challenges faced in grading current AI development and highlights the urgent need to continually review progress in this area.
For those interested, “HLE” is publicly available and users are encouraged to cite the work when the dataset is used in research. This initiative could help significantly influence future standards and expectations for AI-powered education and assessment tools. In order to continue to critically monitor developments in the field of AI and its performance, such benchmarks are essential.