Historical evolution of agent
The conceptualization of the term “Agent” can be traced back to the depths of philosophical contemplation, traversing the realms of both theoretical musings and practical technological implementation. The quest to understand “Agent” has transcended disciplinary boundaries, marking a relentless journey across various academic domains. Ancient Greek philosophers showed an interest in intelligent machines. During this period, philosophers began to describe entities that possessed desires, beliefs, intentions, and the capacity to act, thereby illuminating the nascent concept of intelligent bodies8. The concept of “telos” (end), as articulated by Aristotle, subsequently provided a philosophical basis for the goal-directed characteristics that would come to define intelligent bodies in subsequent eras.
In the contemporary era, following the advancements witnessed in the domains of natural sciences and computer technology, the realm of research in artificial intelligence has undergone a transition, progressing from philosophical contemplations to practical applications. The 1950s witnessed the proposal of the renowned “Turing Test” by Alan Turing, which subsequently emerged as a pivotal benchmark for evaluating the intelligence of machines. The subsequent emergence of expert systems in the 1970s primarily entailed the utilization of human expert knowledge to facilitate reasoning and decision-making through the utilization of computer programmes9. The advent of machine learning techniques, which facilitate the acquisition of knowledge and skills from data, has led to a substantial enhancement in the intelligence of agents. In the 21st century, deep learning technology has achieved significant breakthroughs in perception, decision-making, execution capabilities, and the expansion of application scenarios, thereby bringing revolutionary progress to the development of agents10. Notably, the field of reinforcement learning (RL), particularly multi-agent Reinforcement Learning (MARL), has witnessed substantial advancements, successfully addressing numerous sequential decision-making problems in machine learning11. These advancements have enabled Agent to make more optimal decisions in complex environments.
After 2022, AI has penetrated various aspects of society, particularly with the proliferation of LLMs, which have opened up new avenues for the advancement of agents. AI agents built on the foundation of large AI models have led to a period of accelerated growth and development. AI agents based on LLM have a richer knowledge base, more natural human interaction capabilities, and better interpretability compared to reinforcement learning agents12. For instance, OpenAI has introduced the Custom GPT feature (GPTs), allowing users to create their own GPT by integrating knowledge, operations, and instructions. Google has launched an agent framework through the Gemini series of models, supporting multimodal task processing. The LLaMA series of models open-sourced by Meta has given rise to a large number of community-driven agent applications. Anthropic’s Claude model sets new standards for agents in terms of safety and controllability through the Constitutional AI framework. DeepMind’s Sparrow project demonstrates an innovative path of combining language models with reinforcement learning. These developments have paved the way for the widespread use of personalized AI assistants, marking the formation of a diverse ecosystem for AI agent technology.
The advent of LLMs has prompted numerous organizations to prioritize the development of LLM-based AI agents, particularly within the healthcare sector. For instance, IBM Watson Health utilizes natural language processing, machine learning, and big data technologies to furnish healthcare organizations with a range of intelligent services, encompassing assisted diagnosis, patient care, and drug development13. The whole progression of AI agent is visualized in Fig. 1 with key milestones.

Interpretation of agent
There is currently no universally accepted definition of AI agent in academic circles. In 1998, Cristiano Castelfranchi proposed the concept of an AI agent as an intelligent entity capable of goal-orientation, social intelligence, mind-reading, adaptability, and flexibility, and able to make decisions and take actions autonomously14. Weng defines an AI agent as an autonomous system with a LLM as its core controller, which handles complex tasks through the ability to plan, remember, and use tools, as “LLM+memory+task planning+tool use”15. Feifei Li’s team characterizes an AI agent as an intelligent entity capable of sensing its environment, making decisions, and performing actions, with its core focus on using LLMs or visual language models (VLMs) to enhance the system’s interactivity and adaptability, emphasizing its ability to plan for task execution and use and reason about large-scale knowledge. Parisi describes the AI agent as a system that enables using external API tools to extend the model’s capabilities16. The language model explored by Schick et al. improves its performance by learning to use tools, suggesting a view of the AI agent as an intelligent system that can autonomously perceive its environment, understand the task requirements, and select and perform appropriate actions based on those requirements, including invoking external tools17.
In the preceding definitions, Weng’s characterization, which foregrounds the LLM as the core controller and systematically integrates planning, memory, and tool use into a unified framework, is particularly well-suited for constructing autonomous systems capable of handling complex, multi-step tasks. Compared with Castelfranchi’s earlier macro-level cognitive-architecture perspective, Weng’s account is more operational and readily implementable. In contrast to the tool-calling and functional-extension emphases found in the work of Parisi, Schick, and others, it offers a more holistic and autonomous conception of the system. Therefore, this study adopts Weng’s definition, conceptualizing an AI agent as an autonomous intelligent system whose central controller is a large language model, supplemented by four key modules: planning, memory, tool use, and self-reflection, to ensure efficient and reliable execution of domain-specific tasks in healthcare15.
Characteristics of agent
Understanding and generating text. The AI agent, when combined with the LLMs, demonstrates strong proficiency in understanding and generating text. This proficiency is evidenced by its in-depth understanding of the contextual information of the text, as well as its powerful text generation capability, which is capable of generating natural and smooth text content13. This has resulted in a revolutionary change in the fields of dialog systems, content creation, and so on. This combination of comprehension and generation capabilities enables AI agents to interact more intelligently and personally with humans, providing more sophisticated and tailored services.
Tool Use and Interactivity. In addition to their powerful learning and processing capabilities, AI agents can self-learn how to use external tools15. They are able to select the most appropriate tools for a given situation and obtain the required information or perform specific operations through API calls and other means, thus further enhancing the efficiency and accuracy of task processing18. The introduction of this tool-using capability enhances the autonomy of the AI agent and provides more possibilities for its interaction with humans or other systems.
Task Processing and Generalizability. The integration capability of AI agents is of great significance to the development of the field of AI15. The ability of AI agents to seamlessly integrate with other information systems and devices facilitates collaboration and information sharing among them17. For instance, in diagnostic support scenarios, AI agents can integrate with Electronic Health Record (EHR) systems, Picture Archiving and Communication Systems (PACS), and Laboratory Information Systems (LIS) to automatically extract patients’ multimodal data. This assists physicians in comprehensive decision-making, reduces human errors, and enhances diagnostic accuracy and efficiency19,20. The great versatility and flexibility that AI agents based on LLMs show in handling tasks is evidence that they are capable of handling many different tasks and problems, as well as freely switching between multiple domains. This wide applicability and powerful task processing capability are important tools to promote AI agents as a means to solve complex problems and promote intelligent transformation.
Logical Reasoning and Task Decomposition. The employment of AI agents based on LLMs can be likened to the endowment of a more powerful “brain” to the agents in question. LLMs possess the faculty of logical reasoning, a capability that can be further augmented through the implementation of prompting strategies by agents. However, if the prompt is not sufficiently effective in stimulating the reasoning ability of the LLM itself, users may encounter difficulties in obtaining satisfactory answers. In contrast, the addition of auxiliary reasoning prompts can significantly improve the reasoning effectiveness of the LLM12,21. The ability of autonomous agents to generate bespoke prompts that align with specific objectives underscores their potential to more effectively stimulate and leverage reasoning capabilities in handling complex reasoning tasks.
Learning and Adaptation Capability. In comparison with traditional AI technologies, AI agents based on LLMs demonstrate exceptional learning and adaptation capabilities. These agents autonomously learn from large-scale data, extracting key information, and continuously optimizing their performance22. These agents are capable of self-learning from substantial amounts of data, extracting key information, and continuously optimizing their own performance. This process requires minimal reliance on a large number of manual annotations or preset rules. Furthermore, these agents possess the capacity to acquire knowledge from limited or even zero samples, rapidly adapting to new tasks or small data sets while demonstrating commendable performance23. Moreover, the system’s highly scalable nature encourages continuous improvement in performance and self-driven evolution to meet the ever-growing demands of substantial applications24. Fig. 2 synthesizes these five core capabilities of AI agent.

Characteristics of AI agent
Application of agent
The potential for AI agents to demonstrate significant application in a variety of fields, including education, industry, finance, transportation, logistics, and more, is attributable to their advanced flexibility and intelligent processing capabilities. For instance, in the domain of financial investment, Robo-advisors represent a prominent example of intelligent robotic investment advisor application, capable of creating and managing diversified investment portfolios through the utilization of technology, algorithms, and scientific portfolio theories25. FinRobot, a novel open-source AI agent platform, employs LLM to drive multiple AI agents. It specializes in finance, providing more effective financial advice, portfolio management, and risk prediction26. In the field of autonomous driving, the Agent-Driver researched by Jiageng Mao et al. empowers AI agents with intuitive common sense and powerful reasoning capabilities27. In the field of education, Khan Academy has launched the AI teaching assistant Khanmigo, which not only provides subject counseling for students, but also provides real-time tracking and intelligent evaluation, and writes lesson plans and plans courses in the role of a teacher28.
AI agent applications in healthcare
The exploration of AI agent applications in healthcare focuses on assisted diagnosis, decision making, report generation, chatbots, healthcare management, and medical education. Figure 3 provides a detailed illustration of the applications of AI agents in these field.

AI agent applications in healthcare
Assisted Diagnosis: Assisted diagnosis represents one of the most common applications of AI agents in healthcare. From a technical perspective, some studies have shown that multi-intelligence interactions can improve diagnostic accuracy and correct errors in historical records29,30. Therefore, researchers often leverage expert simulations, patient interaction, and multi-agent collaboration to enhance diagnostic performance. For instance, Tsinghua University built an agent hospital by simulating the actual scenarios of medical staff and patients in healthcare institutions, thus improving the intelligence in the interaction between doctors and patients31. Similarly, the assistant-driven expert consulting (AMSC) model from Harbin Institute of Technology simulates expert seminars through multiple intelligent bodies with diverse knowledge backgrounds32. ClinicalAgent employs specialized LLMs to provide tailored departmental support, closely aligning simulations with real-world clinical environments33. From the perspective of target areas of assisted diagnosis, in addition to general diagnostic assistance systems such as Stanford University’s MMedAgent, which utilizes multimodal imaging to detect, segment, and classify medical images34, AI agents are increasingly applied to specialty domains. ZODIAC, developed for cardiology, extracts clinically relevant features, detects arrhythmias, and contributes to diagnostic decisions35. Baidu’s AI agent could assist the ear deformities in newborns36, and the MAGDA system integrates radiology images with clinical guidelines for enhanced reasoning37.
Assisted decision-making: decision-making is another key area where AI agents have shown significant potential in healthcare. Similar to the field of assisted diagnosis, many medical scenarios involve multiple disciplines and roles, so researchers often integrate different data sources, establish distinct agents with complementary expertise, and enable role-based interactions, aiming to leverage multi-agent collaboration to enhance the quality, interpretability, and consensus of clinical decisions. For example, Yale’s MedAgents employ role-playing and multidisciplinary discussion to iteratively improve credibility and interpretability, ultimately facilitating consensus decision-making38. MDAgents establish different agents, such as general practitioners and specialists, according to scenario complexity, supporting decision-making through structured Multidisciplinary Team (MDT) collaboration39. Similarly, MEDAIDE improves understanding of clinical intent via query rewriting, intent recognition, and multi-intelligence collaboration, enhancing decision-making effectiveness in complex situations40. From the perspective of application, assisted decision-making has also been applied in specialty domains. For example, in oncology treatment, the agent developed by the Heidelberg University Hospital for oncology scenarios are capable of text, radiology and histopathology image interpretation, genomic data processing, web search and medical guideline document retrieval41. In emergency care, a multi-intelligence system comprising emergency physicians, triage nurses, pharmacists, and dispatchers integrated the Emergency Triage Assessment Scale (ETAS) to improve quality, efficiency, and safety in decision-making42. Furthermore, there are multi-intelligentsia specialized in clinical error and error correction tasks that enable positive and negative analysis of medical decisions by breaking down the steps of observation, evaluation, reflection, and formatting43.
Assisted Report Generation: Assisted report generation represents one of the earlier applications of AI agents in healthcare, with initial efforts primarily aimed at assisting radiologists in interpreting medical images and alleviating workforce shortages. For example, Stanford University’s CheXagent focuses on the interpretation of chest X-rays and is able to generate radiology reports through image analysis and textual response, with its performance on visual tasks exceeding that of the generalized domain model by 97.5%44. Similarly, CXR-agent is another agent focusing on chest X-rays, capable of achieving pathology detection, classification, localization, and generation of clinical reports45. In later developments, research attention has expanded to improving report quality, accuracy, readability, and patient-centered communication. For example, MGA employs existing medical reports to construct a medical dictionary and matches the most pertinent sentences to form a medical report, thereby enhancing the accuracy and professionalism of report generation46. It should be noted that due to the specialized nature of report generation, previous studies have mostly focused on single-agent systems, which can ensure computational efficiency and semantic consistency. As more attention is paid to the doctor-patient experience, recent research has begun to incorporate multi-agent architectures to optimize the report generation process. This approach enables the production of patient-friendly reports, thereby reducing clinicians’ workload while enhancing readability and improving the overall patient experience47.
Assisted health management: assisted health management has emerged as a prominent direction for the application of AI agents in healthcare, with conversational agents being the predominant form. Conversational agents, referred to as chatbots, are able to interact with humans using natural language. As machine learning continues to evolve, conversational agents are beginning to emerge48. These are capable of processing more complex information, and thus are able to respond to health needs in a more personalized and precise manner49. Against this background, most studies in this area primarily focus on mental health, such as Agent Mental Clinic (AMC), is a conversational intelligent designed for depression diagnosis, which replicates doctor-patient interactions to assist in depression diagnosis by establishing patient roles, psychiatrist roles, and supervisory roles50. MISHA is targeted towards students, providing psychoeducation on stress management and relaxation techniques, among other topics, and facilitates alleviating students’ perceivable stress51. The scope of research encompasses studies on topics such as alleviating suicidal thoughts52 and reducing post-traumatic isolation53. Beyond the realm of mental health, research is also being conducted in the domain of weight loss counseling and skin management54. With regard to the modality of interaction, conversational agents are predominantly text-based, though Polaris55, which facilitates interaction via phone or voice, employs roles such as nurses, medical assistants, social workers, and nutritionists to achieve health management functions, including medication adherence, appointment inquiries, and dietary adjustments.
Assisting medical education: medical education represents a further application scenario. Researchers often use multi-agent systems to simulate various roles, such as patients or teachers, based on real medical scenarios to create interactive scenarios to improve the abilities of medical students. For example, AI Patient, developed by the University of Michigan, is able to simulate patients, and by setting up knowledge graphs and incorporating multiple intelligent assistant roles such as retrieval, reasoning and generation, it can enhance the validity and credibility of the simulation and contribute to the education of medical students56. MEDCO, developed by the Chinese University of Hong Kong, emphasizes clinical case training to enhance the level of medical students by setting up patient simulations, feedback from senior doctors and experts, and multi-student interactions to provide more personalized and accurate medical education57. ChatCoach can help medical students improve their ability to communicate with patients by setting up roles such as doctors, patients, and coaches and simulating conversations about medical scenes58. In addition to the teaching of general medical knowledge, some studies have extended to specialized education, such as the LLM-based chatbots specifically designed for radiation oncology education have emerged as valuable tools for professional healthcare training, enhancing the accessibility, personalization, and interactivity of medical education59.
Assisting in medication management: In the domain of drug management, researchers have explored applications in prescription management, adverse event prevention, and prediction of drug efficacy in clinical trials by simulating the processes of different stages. Correspondingly, Rx Strategist offers a multi-intelligence prescription validation concept, which facilitates indication and dosage validation through knowledge graph retrieval and drug information set retrieval60. MALADE prioritizes pharmacovigilance, enabling the identification of adverse drug reactions through the design of multi-intelligence synergism61. The ClinicalAgent system, a multi-intelligence system developed for clinical trials, enables analyzing and evaluating the potential efficacy of drugs on diseases, as well as carrying out drug safety analysis62.
Aiding hospital management: In the domain of hospital management, it is important to reduce the burden on doctors, improve efficiency, and optimize processes. The intricate computer operations and management responsibilities of Electronic Health Records (EHRs, or Electronic Medical Records, EMRs) have been identified as contributing factors to the burden and burnout experienced by physicians63. Consequently, numerous researchers have directed their attention towards this area, endeavoring to devise solutions from the perspective of agents. EHRAgent facilitates direct communication between clinicians and EHR systems through autonomous code generation and execution, enhancing physician efficiency and experience64. Almanac Copilot can assist clinicians with EMR-related tasks by automating routine tasks and streamlining the documentation process65. ColaCare’s approach centers on EHR modeling and clinical prediction, utilizing DoctorAgent and MetaAgent to emulate the collaborative decision-making process among doctors of diverse specialties. This facilitates enhanced clinical decision-making and the implementation of personalized precision medicine66. In addition, there are researchers focusing on prior authorization (PA) to decompose this task by building a multi-intelligent assistant system to automate and de-emphasize physician workloads67,68. In terms of medical insurance, there is also study that has investigated the utilization of International Classification of Diseases (ICD) coding within the paradigm of multiple agents61.
Furthermore, research is also being conducted in the field of biomedical knowledge, encompassing areas such as biological experiment design, cell biology, chemical biology, and genetics69. At the primary health care level, it involves the establishment of task-difficulty-assessment agents, expert agents, and response-simplification agents, as well as the incorporation of regional cultures and local languages to provide references for primary healthcare70.
It is important to note that despite the broad potential of AI agents in the healthcare domain, their real-world implementation still faces several critical challenges: (1) Hallucinations. Diagnostic hallucinations may arise in the context of rare diseases or ambiguous clinical presentations, where the agent generates confident yet substantively incorrect conclusions, thereby posing clinical risks71. (2) Lack of interpretability. The decision-making processes of AI agents often lack transparency, making it difficult for clinicians to trace the underlying reasoning, which in turn undermines trust and limits adoption72. (3) Ambiguity in accountability. When AI agents generate diagnostic or therapeutic recommendations, the absence of clear definitions regarding legal and ethical responsibility in the event of erroneous outcomes remains a major challenge for clinical implementation and governance73 (4) Data-related issues. On one hand, training datasets may exhibit imbalances across dimensions such as gender, ethnicity, and geography, resulting in performance degradation for specific populations and generating inequitable decisions that compromise health equity. On the other hand, the use of medical data involves highly sensitive personal information; in the absence of robust data governance frameworks and security safeguards, there is a heightened risk of privacy breaches and ethical violations74,75.
In response to these challenges, the following sections will further explore a multi-dimensional evaluation framework designed to support the more scientific, robust, and trustworthy deployment of AI agents in healthcare.
Evaluation of AI agent in healthcare
As LLMs gain traction in healthcare, their potential to deliver clinical value depends critically on ensuring reliability, validity, and safety across every operational component. Without rigorous evaluation, AI agents may harbor latent flaws in medical reasoning that could translate into diagnostic inaccuracies or inappropriate treatment recommendations, thereby compromising patient safety. Even when designed for decision support, inadequately tested systems may generate ambiguous or inconsistent guidance, forcing clinicians to cross-check outputs and disrupting already burdened clinical workflows. Beyond these direct risks, insufficient evaluation also heightens concerns regarding bias, equity, and data privacy, all of which are crucial in sensitive healthcare environments. Against this backdrop, this section explores the evaluation subjects, comparison objects and dimensions and evaluation indicators of AI agents.
In the evaluation process of LLMs in the medical field, the evaluation subjects are typically divided into three categories. One such category is that of other LLMs, which are frequently employed, such as GPT–4/GPT-4o76,33, Gemini-Pro39. These models enable analysing the intelligence of the medical LLM to be evaluated in terms of performance, functionality, and other relevant metrics. The second approach involves human evaluation77, which involves inviting professionals from the relevant medical fields based on the type of intelligence required, including doctors of various disciplines, specialists35, licensed nurses55, clinical pharmacists60, and radiology and imaging experts78. Clinical experts, drawing on their extensive professional knowledge and practical experience, assess the model’s outputs, such as the rationality of diagnostic results and the viability of treatment plans, from the perspective of medical specialties. Their evaluation results embody authority and professionalism. Thirdly, the fair test sets77, including MedQA, PubMedQA, MultiMedQA and other customized datasets for testing according to specific requirements58,69,77. The test set can provide a large number of standardized data samples, and the performance of the model on the test set allows for a more objective and quantitative evaluation of the model’s level of competence in different tasks and knowledge domains.
The main comparison objects are baseline models and expert behavioral results when assessing LLMs in healthcare. The baseline models cover industry-leading LLMs, such as GPT-4/GPT-4o56,79, Gemini-Pro33,35, LLaMA55,60, Mixtral79, as well as models specialized in healthcare, like BioGPT35 、Meditron40, Med–Flamingo34, and BioMistral40. These baseline models provide a frame of reference for the evaluation of the model under investigation. By comparing the model being evaluated with the baseline model on various performance metrics, it is possible to obtain a clear picture of its position and level in relation to similar models. This comparison can also reveal the uniqueness of the model or the areas that require improvement. Conversely, the expert behavior results focus on comparing the performance of the big language model intelligences with the diagnostic results, treatment decisions, and question-answering scores of human clinical experts55. For instance, in disease diagnosis tasks, the diagnostic consistency of the model’s diagnostic results is compared with that of clinical experts; in treatment plan recommendation, the rationality and effectiveness of the plan given by the model is compared with that formulated by experts. By measuring the discrepancy or similarity between the LLMs and human experts in medical professional judgment and decision-making, the practical application value and effectiveness of the model in the medical field can be determined with greater accuracy, thus providing a clear goal and direction for the optimization and improvement of the model.
Multifaceted indicator dimensions and corresponding evaluation indicators are covered in Table 1, which can be specifically divided into two categories: basic indicators and development indicators. Existing studies demonstrate that quantitative metrics such as accuracy and F1-score remain the most commonly used measures, offering precise statistical evaluations of model performance. However, recent research has increasingly emphasized additional aspects, including efficiency, ethical compliance, and the patient–clinician interaction experience20,80. Collectively, these indicators reflect an evolutionary progression from basic feasibility to comprehensive excellence. Basic feasibility corresponds to the basic indicators, representing the minimum standards required to ensure the safe and effective delivery of healthcare services, including objective correctness, semantic correctness, task completion. Comprehensive excellence corresponds to the developmental indicators, reflecting the pursuit of high-quality, human-centered, and sustainable performance in complex clinical contexts, including efficiency level, content and presentation level and humanistic care. For detailed explanations of the indicators, please see Supplementary Information A.
Objective correctness: includes indicators such as accuracy, precision, recall, F1-score, ROC, AUC, which are used to measure the correctness of the model’s prediction results. These metrics evaluate the extent to which the outcomes generated by AI agents are objectively consistent with verified medical facts, benchmark datasets, or other reference standards, thereby reflecting the quantitative reliability of the model across diverse healthcare tasks. For example, Yale University’s MEDAGENTS primarily uses accuracy to evaluate the performance of models38. ClinicalAgent62, which focuses on clinical trials, also assessed its outcomes using accuracy, ROC-AUC, precision, recall, and F1-score.
Semantic correctness: there are metrics such as BLEU/GLEU, METEOR, BERTScore and ROUGE that can be utilized to assess the semantic correctness of a model. These metrics ascertain the model’s capacity to comprehend and articulate semantics by evaluating the degree of similarity between the text generated by the model and the reference text with respect to vocabulary and semantic structure. For instance, Dingkang Yang40 validated the multi-dimensional health risk assessment capability of the medical agent by comparing its pre-diagnosis results with those of benchmark models using metrics such as BLEU-1/2 (%), ROUGE-1/2/L (%), and GLEU (%).
Task completion: the completion rate and success rate are used as indicators to examine how well the model achieves a specific medical task. In more complex agentic settings, task completion may involve the ability to autonomously select, invoke, and coordinate external tools to achieve a given objective, reflecting the agent’s procedural reasoning and execution capability. For instance, the clinical decision-making agent developed by Heidelberg University Hospital for oncology leverages the accuracy in identifying and using tools, as well as the accuracy and correctness of tool usage as key evaluation metrics41 Similarly, tool utilization serves as a primary criterion for assessing task completion in Stanford University’s MMedAgent34
Efficiency level: the emphasis is placed on the response time and the number of interaction rounds, in order to evaluate the model’s operational speed and the ease with which it can be interacted with. A reduced response time signifies that the model can respond to user inputs with greater alacrity, which can enhance service efficiency in scenarios such as medical consultation. The number of interaction rounds is indicative of the frequency with which the model must engage effectively with the user. A reduced number of interaction rounds signifies that the model is capable of comprehending the user’s needs and furnishing accurate responses or solutions with greater alacrity. For instance, Lang Cao81 employed the “number of turns” metric, defined as the average number of turns required to complete a task-oriented dialogue, to test the agent’s dialogue quality across 20 scenarios. A lower number of turns signifies higher efficiency.
Content and presentation level: the overall readability, clarity, coherence and practical application value of the text information provided by the model were analyzed through the following indicators of content richness, detail, usefulness, safety, and ethical compliance. These metrics examine whether outputs are clinically meaningful, understandable, and ethically appropriate. High-quality content and presentation should convey sufficient detail and depth for clinical tasks while remaining comprehensible to both professionals and patients. For example, CheXagent44 is an agent system for generating radiology reports. The research team not only assessed the reports’ completeness, correctness, and conciseness, but also invited radiologists to evaluate the text quality. Furthermore, the study evaluated the potential for bias related to gender, race, and age to ensure fairness.
Humanistic care: Includes indicators such as accuracy under hidden symptoms, humanistic care, confidence, compliance, counseling and satisfaction, which focuses on the extent to which the model pays attention to the patient’s psychological and healthcare service needs in medical situations. These metrics reflect the humanistic concept that emphasizes respect for patients’ emotions, autonomy, and social context by medical AI agent systems. For instance, Samuel Schmidgall79 noted that implicit biases among doctors can influence diagnostic judgments and treatment planning, while patients’ biases affect trust and adherence. Accordingly, when evaluating AgentClinic, the team incorporated metrics related to doctor–patient interaction and doctor empathy to capture human-centered dimensions of care.
It is important to note that the two-tiered framework proposed in this paper provides conceptual indicators for evaluating AI agents. However, its application in real-world scenarios remains to be further explored. In fact, developing official and actionable evaluation systems still faces numerous challenges, and no mature and widely adopted framework currently exists. Nevertheless, during this transitional phase, regulatory explorations for the evaluation of AI + healthcare have gradually emerged. For example, the UK’s MHRA “AI Airlock” sandbox mechanism is a regulatory sandbox designed to provide a controlled testing environment for AI medical devices, with evaluations emphasizing the following indicators: Safety/quality, effectiveness, adoption, equity/robustness82. Meanwhile, the EU’s CORE-MD project proposes an evaluation framework primarily consisting of the following indicators: Valid clinical association score, valid technical performance score and clinical performance score83. Additionally, China’s National Medical Products Administration has issued guidelines for the clinical evaluation and registration review of AI-assisted detection medical devices (software), proposing the following key evaluation metrics: diagnostic accuracy indicators, such as sensitivity, specificity, and area under the ROC curve, and clinical reference standard construction. These evaluation frameworks provide valuable references for assessing AI agents, and the practical experiences and indicator systems offer important lessons for the development of future evaluation systems.
