Why do AI medical assistants fail in the real world?

According to a large British study, Large-scale language model (LLM)) AI medical assistants cannot reliably help the public identify health conditions or decide when to seek treatment, highlighting a significant gap between technological performance and real-world safety.

Why AI medical assistants are attracting attention

Interest in AI medical assistants is growing as health systems face workforce shortages and demand for easily accessible advice increases. Large-scale language models are currently achieving near-perfect scores on medical licensure-style exams, raising hopes that they can support patients beyond the clinical setting. However, translating expert-level knowledge into safe and understandable guidance for non-experts remains uncertain. To address this, researchers investigated whether LLMs as medical assistants truly improve the way the public interprets symptoms and chooses appropriate actions compared to using regular sources of information.

Testing the LLM as a medical assistant with real users

In the prospectively registered, randomized study, 1,298 adults in the UK were asked to respond to one of 10 medical scenarios developed by doctors. Participants identified potential underlying health conditions and selected a recommended course of action on a five-point scale, ranging from staying home to calling an ambulance. They were randomly assigned to receive assistance from GPT-4o, Llama 3, Command R+, or to use their preferred source.

When tested alone, the model performed well, correctly identifying the condition in 94.9% of cases and the disposition in an average of 56.3%. However, when participants used the same tool, performance dropped sharply. Users identified the relevant condition in less than 34.5% of cases and selected the correct treatment in less than 44.2% of cases, comparable to the control group. Despite freely interacting with the model, participants often provided incomplete information or misunderstood responses. In this study, even though the underlying model was able to produce the correct answer, there was no significant improvement in decision-making.

Implications for clinical practice and deployment

This finding raises important concerns about deploying AI medical assistants directly to the general public. High benchmark scores and simulated patient tests could not predict how much performance would degrade when humans interacted with these systems. This suggests that in clinical practice, unsupervised use may not improve safety and may create a false sense of security. The authors argue that future development must prioritize human-centered design, clearer communication, and rigorous user testing with diverse populations. Before LLMs as medical assistants are used on a large scale, health systems need evidence that they improve understanding and decision-making, not just technical accuracy.

reference

Bean A et al. Confidence of the LLM as a medical assistant in the general public: A randomized pre-registration study. Nat Med. 2026;DOI:10.1038/s41591-025-04074-y.

Source link