Generalist AI model accuracy reaches 77%, highest in accounting

Accounting ERP provider centered on AI dual entry tested some of the most popular AI models in various accounting workflows and found that their accuracy was at best 77.3%.

Processing details

“Large language models are powerful drafting tools, but finance is not performed on drafts, but on verified records,” said Santiago Nestares, co-founder of DualEntry. “Benchmarks show that AI can speed up accounting workflows, but without system-level controls and validation, errors can quickly cascade into financial reporting.”

company tested 19 different generalist AI models (ChatGPT, Claude, Gemini, etc.) on 101 different accounting workflows that represent the core functionality of popular accounting systems. These include transaction classification, journal entries, accounts payable and receivable, bank reconciliation, financial reporting, month-end closing, and conceptual accounting knowledge. These workflows were distilled into a set of questions to pose to the AI model. When asked for an example, Ignacio Brasca, a staff software engineer who worked directly on the project, responded via email:

“Bright Ideas Marketing LLC received a bank transaction for $450 paid to Staples on March 15, 2025. Which account should this bank transaction be classified as? $450 paid to Staples on March 15, 2025. Please name the account and account type.” The actual question also has parenthetical instructions to guide the AI. should answer something along the lines of “office supplies”.

The questions were designed around a provisioned chart of accounts and a minimal amount of context that could provide the information needed for the questions to work without loading too much information into the prompt. Each benchmark was run in an isolated environment for each organization, without linking to actual accounts within the system. Each was agnostic about the other. Scoring was deterministic, so there was no “reasoning” behind the answers beyond simple binary logic decisions. Each benchmark could be run multiple times.

The entire benchmark was task-oriented rather than trivia-based, giving DualEntry the flexibility to perform actions such as “delegate_to_record_draft” and use other tool systems expected of an agent. Additionally, the model can be run through DualEntry’s full CoPilot agent, allowing you to do things like call up accurate charts of accounts, create journals and draft invoices, and generate structured output.

“Essentially, this model is not doing the calculations, it’s doing the calculations before each test run using the tools you bring in during setup,” Brasca said in an email.

What they discovered was that the big generic models weren’t very good at accounting. OpenAI’s ChatGPT 5.4 received the highest score with 77.3% accuracy, followed by Gemini 3.1 Pro with a score of 66% and Z.ai GLM-5. Most models had an accuracy of less than 65%, and older models such as GPT-4 only had an accuracy of 19.8%.

However, the test also found that while no model is particularly good at accounting, there are clear strengths and weaknesses. For example, most models had very high scores when it came to recalling information, such as discussing GAAP/IFRS questions. However, when it came to actually creating structured records, the score dropped significantly.

“The most interesting split we saw: The model can score 92% on transaction classification (selecting the right account for bank charges), but drops to 30-40% on creating journal entries. Creating journal entries requires creating multi-line entries with exact debits/credits. Classification is pattern matching, and record creation is constrained and structured inference. Bank reconciliation is another one. “Models that are good at arithmetic tend to do well (more than 90% of the time), while those that aren’t tend to fail badly by ‘hallucinating’ intermediate steps or skipping in-transit deposit adjustments,” he said, adding that he was surprised that many AI models were particularly bad at such tasks.

When asked why it performed so poorly, he said a lack of domain context was a contributing factor, as typical models are trained on broad internet data rather than deeper exposure to accounting standards, workflows, and edge cases. He also noted that in contrast to specialized business and accounting AI (such as the AI offered by DualEntry), they have limited access to external tools and data and are often integrated with databases, calculators, and search systems rather than relying solely on training data. And third, dedicated systems are typically fine-tuned based on financial datasets and real-world accounting scenarios, giving them distinct advantages for these specialized tasks.

The results may be sobering for the 82% of people who said they are like this in a recent poll. Trust AI to provide financial advice and guidanceAdditionally, nearly one in two respondents believe that AI is better than all the people in their life when seeking financial information and guidance.

The point was not to crown the “best model.” We wanted to understand where our models actually succeed and fail when asked to perform real accounting workflows. Blaska said the point is not to crown the “best models,” but that the company wants to make them somewhat transparent to better assess how well-suited they are for accounting work.

“Most public benchmarks test general reasoning and knowledge questions, which is very different from how accounting software actually works. ERP Under the hood, the model doesn’t create text; it has to create structured financial records, such as journal entries, invoices, and reconciliations with correct accounts, amounts, and line items. So we built a benchmark that reflects how Accounting Co-Pilot works in practice,” he said.

Source link