New AI-driven diagnostic frameworks combine clinical, genetic, and phenotypic data to help reduce diagnosis times for rare diseases while providing clinicians with transparent evidence-based inferences.

Research: Agent systems for rare disease diagnosis with traceable inference. Image credit: Mdisk / Shutterstock
In a recent study published in the journal natureresearchers developed DeepRare, an agent system that leverages large-scale language models (LLM) For diagnosis of rare diseases.
Global burden of rare diseases and diagnostic delays
Rare diseases affect more than 300 million people worldwide, but diagnosis remains difficult due to clinical heterogeneity, limited physician knowledge, and low disease prevalence. Patients often experience a lengthy diagnostic journey that can exceed 5 years, including repeated referrals, misdiagnoses, unnecessary interventions, treatment delays, and poor clinical outcomes. These delays place a significant financial and emotional burden on patients and families and highlight the urgent need for accurate and scalable rare disease diagnostic tools.
DeepRare Agentic System Architecture and Core Components
In this study, researchers introduced DeepRare, a large-scale language model-based agent system for rare disease diagnosis. DeepRare consists of three main components: (1) an LLM-powered central host with memory banks, (2) specialized agent servers that perform analytical tasks, and (3) heterogeneous data sources that provide diagnostic evidence from web-scale medical knowledge bases and scientific literature. The system uses DeepSeek-V3 as the default LLM to power the central host.
DeepRare processes a variety of patient inputs, including genomic test results, free-text clinical descriptions, and human phenotype ontology (HPO) clause. The central host coordinates agent servers to retrieve relevant evidence tailored to patient data, generate preliminary diagnostic hypotheses, and perform a structured introspection phase to verify or refute them through additional searches. If no hypothesis meets the predefined criteria, the system repeats the inference cycle until a solution is reached. The final output is a ranked list of rare disease candidates, with a traceable inference chain linking each inference to supporting evidence.
Benchmark comparison with LLM, bioinformatics tools, and agent systems
Researchers evaluated DeepRare against state-of-the-art general-purpose LLMs, inference-enhanced LLM variants, medical domain-specific LLMs, bioinformatics diagnostic tools, and other agent systems. Generic models include Claude-3.7-Sonnet, GPT-4o, Gemini-2.0-flash, DeepSeek-V3, as well as inference-enhanced versions such as Claude-Sonnet-3.7-Thinking, o3-mini, Gemini-2.0-FT, and DeepSeek-R1. Medical-specific LLMs include MMedS-Llama 3 and Baichuan-14B. Bioinformatics tools included PubCaseFinder and PhenoBrain, and other agent systems included MDAgents and DS-R1-search.
DeepRare was evaluated on 6,401 clinical cases across 2,919 diseases across seven public and two in-house datasets. Public datasets include the Deciphering Developmental Disorders Study, RareBench, Matchmaker Exchange (MME), RareBench-LIRICAL, RareBench HMS, MIMIC-IV-Rare, MyGene2, and RareBench-RAMEDIS. The in-house dataset consisted of clinical cases from hospitals in Xinhua and Hunan Province, China. These datasets include literature-derived case reports, curated repositories, and real-world clinical center data across diverse populations.
Diagnostic accuracy metrics and Recall@K performance
For each diagnostic task, the system generated five ranked predictions. Performance was evaluated using Recall@K, which measures the probability of a correct diagnosis appearing within the top K predictions. Recall@1 reflects the proportion of cases in which the correct diagnosis was ranked 1st, and Recall@3 and Recall@5 indicate whether it appeared within the top 3 or 5 predictions, respectively.
In HPO-based analysis, DeepRare achieves 57.18% Recall@1, outperforming the second best model, Claude-3.7-Sonnet Thinking, by 23.79%. DeepRare maintained consistently superior diagnostic performance across 14 body systems representing multiple medical specialties. When the analysis was stratified by disease representativeness, DeepRare performed well for both well-represented diseases with >10 cases per disease and underrepresented diseases with <10 cases per disease, highlighting its robustness across different case distributions.
Performance and rare disease experts
DeepRare was evaluated against five rare disease experts using the same HPO input. Clinicians were allowed to browse search engines but were not allowed to use AI-based diagnostic tools. DeepRare achieved Recall@1 and Recall@5 rates of 64.4% and 78.5%, respectively, compared to experts’ average Recall@1 of 54.6% and Recall@5 of 65.6%. These results suggest that the system outperformed human experts under standardized benchmark conditions.
Integration of genetic data improves diagnostic accuracy
The researchers evaluated DeepRare using a combination of genetic information and HPO inputs, including whole exome sequence data from Xinhua and hospitals in Hunan province. Incorporating genetic data significantly improved performance. Recall@1 increased from 33.3% to 63.6% in the Hunan dataset and from 39.9% to 69.1% in the Xinhua dataset.
When compared with Exomiser, a bioinformatics tool that integrates genetic and HPO data, DeepRare achieved higher Recall@1 values of 63.6% (Hunan) and 69.1% (Xinhua) compared to Exomiser’s 58.0% and 55.9%, respectively.
Various LLMs were tested as central hosts, including DeepSeek-R1, Gemini-2.0-flash, Claude-3.5-Sonnet, and GPT-4o. The overall performance impact of LLM selection is minimal, suggesting the robustness of the architecture. The authors noted that these findings reflect a controlled retrospective evaluation rather than a prospective real-world expansion.
Implications for transparent reasoning and clinical decision support
DeepRare represents an agent LLM-powered system that can generate transparent inference chains for rare disease diagnosis. The system consistently outperformed existing LLMs, bioinformatics tools, agent frameworks, and expert clinicians across a variety of datasets in retrospective benchmarks. Clinician review of the generated inference chains demonstrated high referential accuracy, although occasional hallucinations and irrelevant citations were observed.
Future studies may extend this framework to treatment selection, prognostic prediction, and prospective clinical validation to assess real-world clinical utility.
