A large-scale language model outperformed doctors on a variety of clinical reasoning tasks, according to a new study. However, the study’s authors cautioned that this result does not mean that AI tools are ready to autonomously practice medicine.
Since LLMs proliferated into healthcare settings in late 2022, the question of whether AI tools can accurately perform clinical reasoning tasks has become a top concern. In general, research shows that LLMs’ clinical reasoning abilities are improving, but models still struggle with certain tasks and should remain under human supervision.
However, few studies have compared the clinical reasoning abilities of advanced LLMs with the baseline performance of human physicians. So researchers at Harvard Medical School and Beth Israel Deaconess Medical Center set out to establish these baselines and evaluate LLM’s performance against them in a new study published in 2006. science.
Researchers evaluated the clinical reasoning capabilities of the OpenAI o1 series. They compared the performance of the AI model with hundreds of physicians across a variety of experiments, including public patient records, evaluation of new emergency room patients, and clinical tasks including diagnosis and clinical management planning.
Overall, the AI model outperformed physicians across experiments, including an experiment using real unstructured clinical data from an emergency department EHR. In the ER experiment, the model was presented to the patient at different points in the diagnostic process. They asked the model to inform it at each step from triage to admission decision and generate a likely diagnosis and treatment plan. Overall, o1 outperformed both ChatGPT-4o and two expert attending physicians when evaluated by two other attending physicians.
In another experiment, researchers used five clinical vignettes to test the AI model’s ability to provide next steps in clinical management. Using a mixed-effects model, we found that the o1-preview model scored 41 percentage points higher than GPT-4 alone, 41.9 percentage points higher than physicians using GPT-4, and 48.4 percentage points higher than physicians using traditional resources.
“Our findings suggest that LLM outperforms most clinical reasoning benchmarks,” the researchers concluded.
Does clinical AI still require human participation?
In short, the answer is yes.
The researchers noted limitations of the study, including that it only investigated six aspects of clinical reasoning, whereas they identified dozens of other issues that could have a major impact on real-world clinical care and need research.
They also emphasized that the study only evaluated text-based performance for both humans and AI. However, clinical medicine is multifaceted and involves a variety of non-textual inputs, such as auditory and visual information.
“Models may make accurate diagnoses, but they can also suggest unnecessary tests that could put patients at risk,” Peter Brodeur, Harvard Medical School Clinical Research Fellow, Beth Israel Deaconess Medical Research Fellow, and co-lead author of the study, said in a press release. “Humans should be the ultimate standard when evaluating performance and safety.”
The researchers noted that as AI models evolve, new testing and research approaches are needed, including new benchmarks, human-computer interaction studies, and prospective clinical trials.
“Models are becoming more and more capable,” Brodeur said in a press release. “We used to evaluate the model with multiple-choice tests, and now the model always scores close to 100%, but we can’t track its progress because it’s already maxed out.”
Anuja Vaidya has been covering the healthcare industry since 2012. Currently, I am responsible for the field of virtual healthcare, including telemedicine, remote patient monitoring, and digital therapeutics.
