The reliability of artificial intelligence in an enterprise depends on whether you can do actual, professional work with the standards of trained professionals.
That's what the Bar's Chief Financial Officer sets for weighting productivity, cost savings and return on investment. fInance Chief is under pressure to scrutinise all AI dollars and demands evidence that the project will move beyond experiments and into measurable economic value. Called benchmark gdpval Introduced by Openai, it provides a concrete step in that direction by showing where AI is moving from experiments to economically valuable.
gdpval This is the first large-scale attempt to measure whether frontier AI models can perform professional-grade tasks. This assesses the key AI model in 1,320 tasks drawn from actual work across 44 occupations in nine industries accounting for $3 trillion in US wages. These are not puzzles or tests. They are professional deliverables such as financial forecasts, healthcare case analysis, legal notes, and sales presentations. On average, human experts needed seven hours to complete each task, with an estimate of nearly $400.
What the benchmark shows
When judged blindly to expert output, the main models showed almost characteristics. Claude Opus 4.1 produced deliverables rated equal to human work, especially in 47.6% of cases with excellent aesthetics like slide layouts. The GPT-5 was reliably and accurately guided according to instructions and processing calculations.
Pairing AI with human surveillance also produced measurable returns. In the scenario where experts reviewed and edited AI output, tasks were completed 1.1-1.6 times faster and cheaper than when humans worked alone. On average, model-only work still did not reach expert level consistency, but in hybrid settings, production quality quality did not increase by more than 30%.
The benchmark also revealed industry-wide fluctuations. Performance was most powerful in financial and professional services tasks where structured data and defined deliverables dominated, and where nuance and contextual judgment are more important, weaker in health care and education.
Advertisement: Scroll to continue
Where leaders see rewards
This evidence matches PYMNTS Report How companies are beginning to restructure their workflows. CAIO Report 98% of leaders expect that the generated AI will streamline their workflows from 70% last year. Almost the same (95%) predicts more keen decisions. Similarly, healthcare displays measurable ROIs in early AI deployments of billing and coding, but executives consistently cite accuracy and responsibility as gating factors.
External research supports trajectories. a National Economic Research Bureau Providing access to customer service agents to generate AI increased productivity by an average of 14%. This includes a 34% improvement, with junior staff getting the biggest benefits. meanwhile, McKinsey's analysis It continues to place the economic benefits of generator AI in a similar range, and estimates that technology can unlock $2.6 trillion to $4.4 trillion per year Over 63 use cases.
Controlled blind spots
GDPVal also highlights where AI is still lacking. The most common failure modes across the model did not follow the instructions. GPT-5 mistakes were often cosmetics like glitches and overly redundant output formats, but about 3% of the obstacles were devastating. This means that it can cause serious damage if deployed without monitoring, such as giving false medical advice or slandering a client. The study notes that even if the model approaches professional level performance for many tasks, these errors remain limiting factors.
this Reflects the PYMNT coverage of AI “Hazai”“In the context of compliance and payments, manufactured data and misconceptions can quickly become regulatory mines, and yet this trend shows steady improvements, closing gaps that each generation once thought to be insurmountable.
