AI diagnostic inference approaches physician performance

Advanced inference-based AI systems have shown physician-level performance in selected diagnostic tasks, but researchers warn that real-world safety, bias, and clinical liability remain major barriers to medical deployment.

Study: AI can reason like a doctor – what happens next? Image credit: Thandon88 / Shutterstock.com

Recently published Perspective articles science We investigate whether advanced artificial intelligence (AI) systems are approaching physician-level reasoning and consider the implications and safety of their integration into clinical practice.

Advances in AI and diagnostic reasoning

Large-scale language models (LLMs) are AI algorithms trained on large amounts of data to learn patterns used to generate human-like responses. Inferential models add these capabilities by evaluating possible approaches before generating a response, thereby mimicking structured cognitive processing.

Numerous studies have evaluated the medical applications of LLMs, including their performance on medical licensure examinations and other related assessments. These assessments often go beyond standard tests to include simulated clinical scenarios such as diagnostic case summaries, specialty-specific tests, and problem-solving tasks designed to approximate the clinical decision-making process.

Discussing the findings of Brodeur et al., the authors note that GPT-4 powered by OpenAI achieved accurate or very close diagnostic accuracy in up to 73% of cases, and that their first inference model, o1-preview, outperformed its performance by 88.6% in clinicopathological cases.

Additionally, o1-preview achieved near-accurate diagnostic accuracy in 67% of emergency department (ED) cases during initial triage. This exceeded the accuracy of two expert physicians in certain text-based diagnostic scenarios.

Since the inference model is independently developed, Their reasoning ability, deliberation time, and processing of diverse inputs include Much improved. While o1-preview only accepted text input, recent models can now handle combinations of text, images, audio, and video, allowing them to support more complex clinical assessments.

Read our interview with Dr. Rahul Goyal to learn how AI is changing clinical decision-making in real-world healthcare settings

How AI is being integrated into clinical practice

It is important to emphasize that AI systems are not proposed as a replacement for doctors. Rather, research in this area considers LLM and other advanced models as collaborative tools that provide clinicians with accountability, oversight, and context-specific judgment.

However, the authors also note that some well-defined medical tasks may eventually be performed more efficiently by AI systems operating independently. AI applications in healthcare have the potential to significantly reduce the human and economic costs associated with diagnostic errors, delays, and limited access.

The Medical Holistic Evaluation of Language Models (Med-HELM) defines five healthcare domains for the use of AI, including administrative workflow, clinical record generation, clinical decision support, patient communication, and medical research support. Across these areas, AI is evolving to enable analysis of patient records, monitoring of clinical practice, and interaction with predictive models, thereby minimizing delays, reducing diagnostic errors, and improving access to care.

Nevertheless, it remains unclear whether advanced AI models will work more effectively for specific tasks or independently across healthcare. As clinicians increasingly incorporate AI tools into their practices, some already without institutional supervision, randomized trials are urgently needed to establish how these models improve real-world applications.

Mandatory clinical certification of AI models is also proposed to expand the role of AI in healthcare while ensuring transparency and accountability. The proposed path would lead to the gradual evolution of AI systems from medical knowledge assistants to supervised clinical practice and potentially broader autonomous responsibilities. Implementing a robust monitoring framework can complement these efforts to further support the safety, efficiency, and cost of AI clinical decision support systems.

Despite these efforts, AI has had limited success in the real world due to poor benchmark performance and unclear clinical benefits. Although new multimodal systems can now integrate images, audio, and video, many medical AI assessments still focus on text-only tasks, limiting their ability to fully understand complex clinical decision-making.

The authors also highlight concerns surrounding the rapid deployment of consumer-facing medical AI systems. In one example, an independent evaluation found that publicly available health-focused AI tools did not adequately prioritize more than half of the emergency cases presented to them.

Beyond diagnostic accuracy, this perspective highlights the need for clinical AI systems to demonstrate real-world effectiveness, fairness, safety, transparency, and accountability before they can be widely adopted. The authors also point out that previous medical algorithms have racial bias, and biased AI systems can negatively impact clinician decision-making.