Autonomous medical AI outperforms doctors in simulated EHR cases

AI News


A new study shows how MIRA translates clinical reasoning into structured EHR actions and outperforms doctors in simulated emergency situations while highlighting necessary safeguards before autonomous AI reaches actual care.

MIRA is an autonomous medical AI agent that operates within an EHR sandbox and uses a set of tools to simulate clinical workflows. You can order tests, integrate results, and create diagnoses and treatment plans while interacting via chat with a patient AI agent based on documented HPI extracted from retrospective notes from real cases. On the left is an example of a conversation between a patient and MIRA with a tool call in between. That's right, an FHIR-based architecture that executes tool calls and records medical outputs. Note: The data shown here has been shortened and slightly modified to comply with dataset privacy restrictions.

MIRA is an autonomous medical AI agent that operates within an EHR sandbox and uses a set of tools to simulate clinical workflows. You can order tests, integrate results, and create diagnoses and treatment plans while interacting via chat with a patient AI agent based on documented HPI extracted from retrospective notes from real cases. On the left is an example of a conversation between a patient and MIRA with a tool call in between. That’s right, an FHIR-based architecture that executes tool calls and records medical outputs. Note: The data shown here has been shortened and slightly modified to comply with dataset privacy restrictions.

In a recent study published in the journal natureresearchers introduced the autonomous artificial intelligence MIRA (A.I.) agents built to operate within sandboxed electronic health records (EHR)environment.

Unlike previous implementations, which primarily consisted of task-specific chat applications, MIRA is designed to individually capture a patient’s medical history, order relevant diagnostic tests, and use these datasets to formulate a diagnosis and treatment plan within a controlled simulation.

This study revealed that MIRA achieved a diagnostic accuracy of 88.9% across 574 MIMIC-IV cases and 87.8% in a corresponding physician comparison of 311 cases, significantly outperforming experienced human physicians under identical simulation conditions, and demonstrated strong, although not perfect, safety and guideline performance.

background

Large language model (LLM) have already proven to be highly capable of passing standardized medical exams and answering complex clinical questions. However, the review revealed that translating this raw clinical knowledge into hospital operational workflows remains a major challenge.

This discrepancy can be attributed to the architectural design of traditional medical AI tools. These tools serve as narrow, task-specific search or text generation utilities rather than active partners in healthcare.

In contrast, true clinical decision-making is characterized as a complex, multi-step process in which physicians repeatedly interview patients, order blood and imaging tests, integrate conflicting results, and update hypotheses before arriving at a final treatment plan.

Additionally, nearly all of this clinical work takes place within the electronic medical record (EHR) Systems that rely on complex standardized coding protocols. Until now, it has not been proven that automated systems can reliably handle this end-to-end clinical action space in a realistic EHR-style environment without unacceptable errors.

About research

This study aimed to address this capability gap by developing MIRA, a novel AI tool designed to autonomously ingest and access patient medical records, identify knowledge gaps, order diagnostic tests to complement EHR records, and use the completed dataset to recommend clinical interventions.

This study then uses HL7 FHIR (FHIR). Sandbox testing is performed on the Intensive Care Medical Information Mart (Mimic-IV) database.

The included cases included eight different diagnoses across surgery (appendicitis), medicine (pneumonia), and oncology (pancreatic cancer), and MIRA navigated the diagnoses using 11 proprietary digital tools with over 85,000 surgical options. The tool was able to request physical exams, order targeted lab values, retrieve medical history, and generate medication orders within a simulated EHR, rather than during actual patient care.

MIRA outcomes were compared with two different groups of human physicians managing exactly the same cases under identical conditions. The human group consisted of 1. a cohort of 4 board-certified physicians and 2. a geriatric mixed team of 4 residents and 2 board-certified physicians.

Additionally, another (traditional) text-based AI agent was used to simulate a patient being treated by MIRA (or a team of human doctors). This agent was instructed to respond to questions posed by MIRA or its human counterparts based solely on authentic clinical history while resisting hostile attempts to prematurely divulge information. However, the authors noted that simulated patient conversations may be more structured than real emergency department conversations.

Research results

The results of the study revealed that MIRA outperformed the level of experienced human doctors. MIRA was found to achieve 88.9% diagnostic accuracy across a dataset of 574 cases and 87.8% in a matched physician comparison of 311 cases. By comparison, board-certified physicians had an average accuracy of 78.1% (p < 0.001), while the geriatric mixed medical cohort had an average accuracy of 71.1% (p < 0.001).

Additionally, MIRA was found to be superior in identifying appendicitis and pancreatitis, achieving a complete 100% recall of laparoscopic appendectomies. Although their ability to diagnose pancreatic cancer was comparable to board-certified physicians, diagnosing pneumonia and urinary tract infections remained difficult. Notably, MIRA did not achieve this superior accuracy by simply “ordering everything.” Although we observed that they requested a more extensive and comprehensive set of individual blood parameters than human physicians, overall test selection was still significantly below the baseline of historical datasets.

Study results demonstrated that the AI ​​model was successful in avoiding systematic over-ordering of high-cost radiology imaging procedures that matched or exceeded physicians in overall resource adjustment metrics.

Safety evaluations were similarly promising, but still preliminary. An independent blinded medical review of 56 patient-level outputs and separate evaluation of 468 prescriptions produced by MIRA demonstrated that this drug caused zero high-severity drug-drug interactions, renal dosing incompatibilities, and drug-allergy mismatches. Route specification was the weakest prescription field, with 97% correctness.

Additionally, MIRA achieved a perfect recall score of 1.00 when making critical hospitalization decisions for pneumonia and pulmonary embolism. This shows that the AI ​​tool did not miss a single patient requiring inpatient treatment. However, the analysis of pulmonary embolism also suggested a trend toward overhospitalization, reflecting a cautious trend.

conclusion

In this study, we introduced an integrated EHR AI agent (MIRA) to translate clinical intent into structured, safe and accurate surgeries and potentially support physicians. However, the authors caution that MIRA (and similar AI agents) is not a replacement for professional human staff.

The model did not achieve 100% perfection in all treatment choices, including the selection of specific antibiotics, highlighting the continued need for strict human oversight and patient-level safety measures. Future model iterations may improve performance by incorporating search-based support, stronger governance, and evidence from future real-world validation prior to clinical deployment.

Click here to download your PDF copy.



Source link