
The authors demonstrate their system using a dataset that has received little coverage so far: the 2026 FIFA World Cup schedule. From the schedule and host city, we generate climate-focused articles with interactive maps.
Roughly four out of 10 games are scheduled in locations classified by players’ association FIFPRO as having a very high risk of heat, with humidity rather than temperature being the main factor. The authors emphasize that these are typical weather conditions and are not predictions for an actual tournament.

“Inspector” panel makes all claims traceable
The core feature of the system is the Inspector, a panel that displays structured evidence for each sentence and asset. Every annotated sentence, graph, and interactive element has its own index card that displays the exact line of code (and the data file behind it) or an external URL to support your claim.

This checks 93% of all displayed statements for their origin. The researchers stress that that doesn’t mean they’re correct, just that they can be verified. Do you doubt the numbers? Run the code. The baseline for articles written by humans is 25%, in part because journalists rarely publish their analytics code. The researchers argue that this gap reflects both holes in journalistic practice and strengths in the system.
7 agents, 1 editing workflow
Behind each story is a chain of seven expert agents, which the team calls a “virtual newsroom.” Tables rarely tell the whole story, so “detectives” perform web searches for context. For World Cup data, we link the host city to FIFPRO’s heat risk assessment and Open-Meteo’s climate data.
“Analysts” run code instead of guessing numbers. “Editors” choose which findings drive the story. The “designer” selects the appropriate medium, such as a geographic map or a musical audio clip. The “programmer” builds the HTML pages, the “auditor” checks the layout for errors, and the “inspector” ties everything back to the source.

The base model is Claude Opus 4.7 running on Claude Code. For images, video, and audio, the system ingests OpenRouter models such as gpt-5.4-image-2, seedance-2.0, and lyria-3-pro-preview.
53 readers rate agent articles higher than human original articles
The researchers combined 18 public datasets with matching original human-written datasets from three different sources. They used concise briefings from The Economist, lavishly designed long-form articles from The Pudding, and community datasets from Tidy Tuesday. Fifty-three recruited readers rated both versions across five categories, including visual design, narrative rhythm, data transparency, verifiability of claims, and insights gained.
Data2Story won all five categories. The biggest lead was transparency, at +1.49 on a 7-point scale. Overall, 74 percent preferred the agent article, 25 percent preferred the human article, and 2 percent said it was a tie.
The image changes depending on the information source. Agents clearly won in data-heavy areas economist Briefing session and neat tuesday pieces. There was a statistical tie for the Pudding report, which often takes design teams weeks to create. Agents couldn’t beat a handcrafted presentation.

When measuring which descriptions of articles written by humans also appear in articles written by agents, Data2Story covers about half. Conversely, only 35% of the agent’s utterances are contained in the human text.
Agents add many unique perspectives, but only partially capture the core of the editorial. The gap is largest for brief, boilerplate economist briefings, where agents replicate 73 percent of human findings. This is probably because these texts are very close to the standard statistics that the agent calculates.
A place where humans can still win
The researchers flag three areas where human authors will continue to lead the way. From an editor’s perspective, reporters explain what data cannot do. Repair Cafe’s report found that repair rates are low because phone, car and tractor manufacturers intentionally block access to diagnostic tools and parts. This is a theory based on reports, not data. The agent indicates what will break, but the “why” remains hidden.

In terms of creative design, Pring’s work on stand-up comedy turns the complete transcript of Ali Wong’s show into a user interface. Next to each line is a circle the same size as the length of the laugh. For the same content, agents can simply embed a static YouTube thumbnail.

In a single, dense graphic, The Economist’s visualization of the space race overlays government and private providers, success rates, and annotations into a single image. The point is lost because the agent is distributing the same data across multiple graphs.

Collaborator, not replacement
The authors frame Data2Story as a newsroom tool. Humans provide perspective and reporting, and agents handle calculations, graphics, and machine-verifiable procurement.
This may prove most useful for topics that newsrooms cannot cover due to lack of capacity, or niche datasets that would otherwise never result in readable articles. One limitation is that Data2Story currently runs on full autopilot. A version with human feedback is left for future work. The site is published at data2story.github.io and the code is on GitHub.
Machine verifiability is exactly where current AI systems continue to stumble. A recent Peking University benchmark found that leading models often gave the correct answer in document analysis but cited the wrong sources, a problem researchers called the “attribution illusion.”
Other research suggests that AI search agents do little research, mostly confirming what they already know from training. Data2Story attempts to close this gap by having analysts calculate numbers using executable code instead of guessing, and by having inspectors link every statement to the source. Perplexity follows a similar philosophy to “search as code,” where models write their own web searches instead of calling black-box APIs.
