New report on fact-checking services NewsGuard finds the struggles of generative AI systems to distinguish between truth and falsehood in real time.
Getty
The world's leading chatbots are now handling more inquiries than ever before, but their accuracy has declined dramatically. NewsGuard – Online NewsFactCheck Service – conducted an audit that found 35% of false news claims as of August 2025 compared to 18% in 2024.
The instant response drive from chatbots revealed fundamental weaknesses as it pulls information from the internet space that contains poor content, artificial news and deceptive ads.
“Instead of accepting restrictions, citing data cutoffs, or weighing sensitive topics, models are drawn from a contaminated online ecosystem,” wrote News Guard spokesman Mackenzie Sadegi in an email exchange. “The result was an authoritative but inaccurate response.”
AI chatbot responsiveness is increased – accuracy is reduced
This change represents a fundamental breakdown of system operations. Audit results show that the large-scale language model used to avoid specific inquiries presents inaccurate information and fails to identify authentic news reports, while providing answers through unreliable sources.
Additionally, the model showed that in August 2025 there was a zero percentage of people who refused to answer questions at the current event, but last year it reduced 31% of such questions. This gives more assertive false information as the tested AI models are willing to answer all questions, including those without answers.
The best-in-class AI class
Perplexity experienced the biggest performance decline among all the top performers of the previous year. A 2024 NewsGuard unveiling test gave us a 100% success rate for confusion, but in almost half of all attempts during this year's audit, the system failed to answer correctly.
Sadegi said the cause was not completely clear. “the [Perplexity’s] The Reddit forum has been met with complaints about the decline in reliability of the chatbot. In an August 2025 column, technology analyst Derrick David pointed to the confusion of influential power users, subscription fatigue, viewer inflation due to bundled trading, and competitive pressures. However, it is difficult to say whether these factors influenced the reliability of the model. ”
This system has dramatically reduced performance levels. The audit lists examples that provided stories flagged as exposed on fact-checking sites as one of the sources of “verification” of manufactured stories about Ukraine's official Zelensky's billion-dollar real estate holdings. It presented a valid fact check that disproves the claim, but presented it as one of multiple perspectives rather than an authoritative source.
Sadeghi said that false equivalence is part of the broader search problem. “Confusion cites both false and reliable fact checks and treats them as equivalents,” she said. “Chatbots continue to give equal weight to propaganda outlets and reliable sources.”
Score popular AI models – for good and bad
The audit revealed the first specific chatbot performance data, explaining the one-year delay in disclosure.
NewsGuard has introduced full scoring results for all 10 chatbots it tested for the first time. Instead of viewing a specific score in the audit model, organizations were used to release only general rankings. Researchers needed long hours to collect sufficient data to reveal results that were meaningful for scoring.
“We don't have a big picture when we publish a one-off score,” Sadegi said. “When the big picture is more complicated, you can refer to one strong outcome from one month to boost your reputation and promotional progress.”
A 12-month audit period with multiple model updates and distribution testing in the US, Germany, Moldova and other countries shows a clear trend.
Accuracy scores for the 10 most popular large language models.
Used with permission: NewsGuard
Some models are being trained. Others aren't.
Two top performance models, Claude and Gemini, demonstrated common behavior during the audit by showing constraints when providing answers. The system was more likely to identify sources with inadequate trustworthy information, and avoided spreading false information when reliable information is not available.
“It can reduce responsiveness in some cases,” Sadegi said.
Propaganda washing continues to be smarter – AI models can't keep up
The NewsGuard findings support what many people in the AI safety community suspect. Disinformation networks linked to Russia's states such as Storm-1516 and Pravda are building large content farms designed to poison AI systems rather than reaching people.
It's working.
The audit shows that Mistral's LE chat, Microsoft's co-pilot, Meta's llama, and others all escalate fake stories initially planted by rogue networks, often citing fake news articles and low-engagement social media posts on platforms like VK and Telegram.
“It shows how there is an adaptive and lasting foreign influence operation,” Sadegi said. “If a model cites a particular domain goes out, content from the same network could resurface through different channels.”
Washing is more than just domain hopping. It's the seed of the story. “That means that the same story can appear simultaneously on dozens of different websites, social media posts, reflected in the form of photographs, videos and texts, in dozens of different websites, and social media posts, reflected in the aligned actors.”
Volume is not a verification, but it is difficult with AI models
Even if the false claims arise from a licensed disinformation actor, if it is spread well enough, it tricks the model. That's the current blind spot. Chatbots still struggle to detect the laundry of stories across platforms and formats.
Sadeghi warns that there is no new way to detect organized lies without better evaluation and weighting of sources. AI systems are still at risk. “Taking action against a single site or a source of one category does not solve the problem because the same false claims persist in multiple ways.”
Real-time truth appears to remain hidden in multiple times as AI companies compete to improve the reliability of real-time search results.

