Researchers conducted an amazing study in which they analyzed the accuracy of five AI models using 500 everyday math prompts. An interesting result showed that the AI had a 40% chance of getting the answer wrong.
Omni Research on Calculation in AI (ORCA) primarily shows that when AI chatbots are asked to perform routine mathematics, their accuracy varies widely between AI companies and between different types of mathematical tasks.
The selected models are:
- Gemini 2.5 Flash (Google)
- ChatGPT-5(OopenAI)
- DeepSeek V3.2 (Deep Seek AI)
- Grok-4(xAI)
The results demonstrated that no AI model scored higher than 63% on everyday math. Outstanding leaders, Geminis (63%) still get nearly 4 out of 10 questions wrong.
Grok achieved a similar score at 62.8%, while DeepSeek came in third at 52%, followed by ChatGPT at 49.4%.
AI accuracy peaks in math and conversation, but hits record lows in physical tasks
Performance varies by category. For math and transformations (147 out of 500 prompts), Gemini leads with 83%, followed by Grok with 76.9% and DeepSeek with 74.1%.
According to euro news, ChatGPT scored 66.7 percent in this category, and all five models had an accuracy of 72.1 percent, the highest among the seven categories.
To avoid errors in any case, users are advised to use a calculator or double check with another prompt.
4 big mistakes made by AI models
Experts categorized mistakes into four types and noted that the main challenge lies in converting real-world situations into correct formulas.
calculation error
In such cases, the AI may understand the question and formula, but fail during the actual calculation. This category includes precision and rounding issues (35%) and calculation errors (33%).
broken logic error
This type of error is one of the most serious as it indicates that the AI is having a hard time understanding the actual cause of the problem. These include errors in methods or formulas such as the use of incomplete mathematical approaches (14%) and incorrect assumptions, which account for up to 12% of errors.
Misunderstanding instructions
Misreading instructions mainly occurs when the AI is unable to correctly interpret the question. Examples include using the wrong parameters, logic errors, and providing incomplete answers.
AI question deviation
The AI has been observed to simply reject or ignore questions rather than attempting a specific answer. The weakness is rounding, especially in multi-step calculations. If an error occurs at any point, the final result will be much different than usual.
Nevertheless, the study used a state-of-the-art model that is freely available to the public.
The study concludes with the following insights: If you want accurate answers to difficult word problems, ChatGPT is great. Want to take a photo of your receipt or receive an immediate response when your Gemini wins? Finally, if you need speed and concise answers, Grok is a solid choice.
This result shows that significant improvements are still needed to achieve reliable mathematics and conversational logic.

