OpenAI: GDPVal framework tests AI in real jobs

Openai has announced a new assessment framework, GDPVal, to measure the performance of artificial intelligence on economically valuable tasks. The system tests the model with 1,320 real job allocations, bridging the gap between academic benchmarks and practical applications.

The GDPVal framework evaluates how the AI model addresses 1,320 different tasks related to 44 different occupations. These employment are primarily knowledge work positions within the industry, each contributing more than 5% to the US GDP. To create a list of this related occupation, Openai utilized data from the US Bureau of Labor Statistics (BLS) and the Department of Labor O*Net database in May 2024. The resulting occupational choices include occupations that are frequently associated with AI integration, such as software engineers, lawyers, and video editors. This framework also extends to occupations that are less common in the AI context, including detectives, pharmacists and social workers, providing a broader assessment of potential economic impacts.

According to the company, the tasks within the assessment were created by experts with an average of 14 years of experience in each field. The measures were intended to accurately reflect the tasks “actual work products such as legal briefs, engineering blueprints, customer support conversations, nursing plans, and more.” Openai specified that the scope of GDPVal across a large number of tasks and occupations should be distinguished from other assessments focused on economic values that may be concentrated in a single domain, such as software engineering. The evaluation design gives a simple text prompt. Instead, you should provide the AI model with the files you want to reference and create multimodal artifacts such as presentation slides and formatted documents. This approach is intended to simulate how users interact with technology in a professional work environment. Openai said, “This realism makes GDPVal a more realistic test of how models support experts.”

In that study, OpenAI used the GDPVal framework to evaluate the output from several proprietary models, including GPT-4O, GPT-4O-MINI, GPT-3, and the most recent GPT-5. This evaluation includes the Claude Opus 4.1 of humanity, the Google's Gemini 2.5 Pro, and the Xai's Grok4. The core of the grading process involved experienced experts who performed blind assessments of the model's output. These human achievements unconsciously compared the output generated by human experts to the research of AI generation, providing a direct quality benchmark without knowing the origins of the work.

To compensate for this human-driven process, Openai has developed an “Auto Grader” AI system. The system is designed to predict how a human evaluator will score a particular deliverable. The company has announced its intention to release the autograder as an experimental research tool for others to use. However, Openai paid attention, saying that autograders are not as reliable as human performance. This tool has confirmed that it is not intended to replace human evaluations in the near future, reflecting the subtle judgments needed to assess the work of high-quality professionals.

The initial findings from the GDPVal test show that current advanced AI is approaching the quality standards of human experts. “We've seen that today's best frontier models are already approaching the quality of work produced by industry experts,” Openai writes. Among the models tested, Anthropic's Claude Opus 4.1 was identified as the best overall performer. That particular strength was observed in aesthetic-related tasks. This includes factors such as professional document formatting and clear and effective layout of presentation slides. These qualities are often important for client-oriented material and effective communication in business contexts.

The Claude Opus 4.1 was excellent in presentation, but Openai's GPT-5 model showed excellent performance with accuracy. This was particularly evident for tasks that require domain-specific knowledge to be found and applied correctly. The study also highlighted the rapid pace of model improvement. The results showed that the performance of the GDPVal task has more than doubled from GPT-4O (released in spring 2024) to GPT-5 (released in summer 2025). This significant increase in capabilities over a relatively short period of time indicates a significant acceleration in the development of underlying AI technologies.

This assessment also included an analysis of efficiency. “We found that the frontier model can complete GDPVal tasks about 100 times faster and 100 times cheaper than industry experts,” Openai reported. The company quickly qualified this finding with a significant warning. “However, these numbers reflect pure model inference times and API billing rates, and therefore do not grasp the human monitoring, iteration, and integration steps required in a real workplace setting to use the model.” In this context, it is clear that calculations exclude the significant amount of time and costs associated with managing, improving, and implementing AI-generating tasks in practical business workflows.

Openai acknowledged the major limitations of the current version of the GDPVal framework, describing it as “an early step that does not reflect the full nuances of many economic challenges.” A major constraint is the use of one-time evaluations. This means that the framework cannot measure the ability of the model to handle iterative tasks, such as completing multiple drafts of a project. For example, current tests cannot evaluate whether a model can successfully edit legal briefs based on client feedback or whether it can redo the data analysis to account for newly discovered anomalies.

A further limitation pointed out by the company is that professional work is not a simple process with organized files and clear directives. Current frameworks cannot capture more complex and unstructured aspects of many jobs. This includes “the person, and the deep, contextual work of exploring problems through conversation and dealing with ambiguity and changing situations.” These elements are often central to professional roles, but are difficult to replicate in a standardized testing environment. “Most jobs are more than just a collection of tasks that can be written down,” Openai added.

The company stated its intention to address these limitations in future iterations of the framework. Planning involves expanding its scope and incorporating more difficult, automated tasks across more industries. Specifically, OpenAI attempts to develop assessments of tasks that contain interactive workflows where the model must engage in the previous and subsequent processes. As part of this expansion, OpenAI will release a subset of GDPVal tasks that researchers can use in their research.

From these results, the conclusion that Openai puts is that AI will inevitably continue to disrupt the job market. The company assumes that AI can take on the everyday “busy” and thus frees human workers to focus on more complex and strategic tasks. This perspective frames AI as a tool to enhance human productivity rather than purely replacement. “In particular, in a subset of tasks where models are particularly powerful, giving tasks to models before trying them together with humans is expected to save time and money,” Openai writes.

At the same time as these findings, the company reiterated its stated commitment to its broad mission. This includes plans to democratize access to AI tools, and efforts to “build a system that supports workers through change and rewards wider contributions.” “Our goal is to keep everyone in an AI 'elevator',” the company concluded.

Featured Image Credits

Source link