Earlier this year, the US Food and Drug Administration (FDA), in partnership with the European Medicines Agency, released 10 guiding principles for the use of artificial intelligence (AI) in drug development.1 Among other things, the principles advise adhering to relevant legal, ethical, technical, scientific, cybersecurity, and regulatory standards; having a well-defined context of use; and using plain language to present clear, accessible, and contextually relevant information to the intended audience.
However, even before the FDA’s release of these guidelines, the use of AI in federal health agencies has been garnering increasing interest due to its potential to expedite processes and improve efficiency.2-4 Whether this potential will be realized, however, depends on myriad factors related to optimizing AI’s performance while minimizing its risks.2,3
According to a 2025 report from the US Government Accountability Office (GAO), the number of generative AI use cases submitted to the Office of Management and Budget by selected federal agencies increased 9-fold from 2023 to 2024.2 Forty-one percent of these cases were submitted by the Department of Health and Human Services (HHS).
“Each federal agency publishes an annual AI use case inventory that lists current AI tools in use at the agency along with many other fields including their stage of development, whether the tool is from a vendor or homegrown, and whether the use case may be high impact, meaning a decision is likely to made from the AI’s output,” explained Kaustav Shah, MD, an internal medicine physician at the Perelman School of Medicine at the University of Pennsylvania and a fellow in the National Clinician Scholars Program (NCSP) there.
At the time the last use case inventories were updated for 2025, HHS had 447 use cases and the Veterans Health Administration had 367 use cases that spanned a range of administrative, research, and clinical tasks, Dr Shah said.5,6
Multiple use cases pertained to the development of ChatCDC, a generative AI chatbot used by the Centers for Disease Control and Prevention (CDC).7 The CDC reportedly was the “first federal agency to deploy a generative AI … chatbot to all staff” and has also generated 55 use cases “that demonstrate the power of AI in preventing outbreaks and enhancing operational efficiency,” as the CDC noted in an August 2025 update.8
Other examples of generative AI use reported by federal health agencies include a Department of Veterans Affairs (VA) “effort to automate various medical imaging processes to enhance veterans’ diagnostic services” and an HHS initiative to “support containment of the poliovirus … to extract information from publications and identify outbreaks in areas previously thought to be polio-free,” as described in the GAO report.2
Current AI Endeavors in Federal Health Agencies
Under the current Trump administration to date, there has been a dearth of published data regarding the ways in which AI is being used by federal health agencies.
“Since the administration change in January 2025, there has been a lot of press about the use of AI in federal agencies,” Dr Shah said. “A few prominent examples include Elsa, a generative AI tool that the FDA is using for scientific reviews,9 and plans for the VA to use an ambient dictation tool to assist with medical documentation.”10
According to the FDA’s press release announcing the launch of Elsa in June 2025, the “agency is already using Elsa to accelerate clinical protocol reviews, shorten the time needed for scientific evaluations, and identify high-priority inspection targets.”9 As examples of ways the tool will be used to improve operational efficiency at the FDA, the release stated that the tool “can summarize adverse events to support safety profile assessments, perform faster label comparisons, and generate code to help develop databases for nonclinical applications.”
“The FDA deals with huge volumes of textual information as part of their review process and other processes such as surveillance and pharmacovigilance,” noted Russ B. Altman, MD, PhD, a professor of bioengineering, genetics, medicine, and biomedical data science at Stanford University and associate director of the Stanford Institute for Human-Centered AI. As part of the FDA’s Centers of Excellence in Regulatory Science & Innovation (CERSI) program, Dr Altman and colleagues are currently collaborating with an office at the FDA to assist with the use and fine-tuning of large language models (LLMs) for specific purposes.
“There are several startups also trying to help sponsors create their applications with LLMs, so we are looking at a future where LLMs help write the applications and other LLMs help review them,” Dr Altman continued. “The general goal of speeding up both processes is laudable because then we can more quickly determine which new products are effective and safe and get them to the public more quickly.”
However, he noted that a substantial amount of research is needed to determine the accuracy and real-time savings of these tools. While his team has not yet finalized their conclusions or reported their findings from their work with the FDA, there are “some indications that ‘generic’ LLMs may need to be exposed to and ‘fine-tuned’ with FDA-specific documents to get the best performance,” he explained.
Dr Shah said he is unaware of any published scientific papers detailing how the newer AI tools used in federal health agencies, like Elsa, are performing thus far or how they were initially evaluated. However, the agencies have internal processes and standards they aim to follow, he noted.
Potential Benefits and Drawbacks
The use cases that have been publicly described illustrate the array of potential advantages of using AI in federal health agencies, according to Dr Shah.2 The reported benefits include “helping to summarize and synthesize large amounts of medical literature, detecting patterns from electronic health record data to improve drug safety, and powering chatbots that can answer common internal agency questions.”
For example, the CDC has estimated that use of their internal chatbot saved more than $3.7 million in labor costs and led to a 527% return on investment.8
However, inaccuracies and hallucinations produced by AI tools are among the drawbacks of using this technology. “There have been news reports about hallucinations from Elsa, the FDA tool, citing studies that do not exist,” Dr Shah said.11 “This is certainly a common problem when LLMs like ChatGPT are used to generate text and would be concerning given the importance of the FDA in rigorously evaluating drugs and medical devices.”
According to Dr Altman, much attention is currently focused on using methods to reduce or remove hallucinations and to improve the accuracy of data computing. “Many drug documents involve quantitative data, and LLMs are sometimes not good at precise mathematics, so we’re developing methods for ensuring that things like dosage calculations, safe dose recommendations, and the speed of metabolism and elimination from the system are accurately computed,” he said. “This is still an active research area and needs to be essentially solved before the tools are fully trustworthy.”
Other potential risks of using AI in federal health agencies include overreliance on AI tools and bias in situations that are not well-represented in the AI’s training data, Dr Shah added. He emphasized the importance of carefully studying each AI tool to ensure robust performance, as well as “developing thoughtful human-AI collaborative models and being transparent with the public about what AI is and is not being used for.”
Additionally, significant time and effort are required to train and test AI systems, posing a burden to employees of federal agencies already balancing numerous responsibilities while adhering to federal mandates. The work “may pay off eventually with decreased effort, but right now we are asking federal workers to both do their normal job and also help create and evaluate tools for the future,” Dr Altman said.
This scenario highlights the important role of academic-government collaborations like the CERSI programs. These collaborations “allow academic researchers to assist the federal government by bringing our AI expertise to help them with this current period of experimentation and evaluation,” Dr Altman explained. “I think the CERSI program and others like it are critical for providing this ‘surge’ capability in research and development.”
Strengthening Research on AI Use in Federal Health Agencies
Research is ongoing to test and optimize the use of AI in federal health agencies. Physician-researcher Irene Dankwa-Mullan, MD, affiliate professor of health policy and management at the Milken Institute School of Public Health at George Washington University in Washington, DC, described the following measures that could strengthen federal research on AI tools aimed at improving population health and outcomes.
- Task-specific, real-world evaluation. AI tools should be tested on real work and not solely on demos, and results should be reported by patient group to allow identification of outcome disparities between groups.
- External validation and post-deployment monitoring. AI performance can drift over time and should be checked routinely, and there should be clear rules for pausing or rolling back the use of AI tools if performance declines.
- Early identification and reporting of AI output mistakes. Teams can use a list of common errors to share lessons and avoid repeating reported problems.
- Data readiness and documentation. Data sources should be transparent, explainable, usable, and well-labeled so that results are trustworthy.
- Human-computer design studies and workflow studies. AI tools should be designed for people and workflows, and studies should examine how tools fit into daily work to ensure they really save time instead of creating more work.
- Studies on value and safety economics as well as health impact. Costs, benefits, and potential harms should be tracked to support evidence-based decisions regarding the use of AI tools.
Educational Needs
Many additional needs remain to ensure that AI tools are used appropriately and efficiently in federal health agencies. “Most importantly, we need to make sure that humans are in the loop ensuring no big errors or dangerous decisions,” Dr Altman advised.
The level of human involvement should be matched to the level of risk, according to Dr Dankwa-Mullan, who wrote a 2024 paper on the topic of equity and ethical considerations regarding the use of AI in public health and medicine.12 “Higher-stakes uses require human review and stronger safeguards, while lower-risk tasks can be lighter-touch.”
Dr Shah cited the importance of “educating federal workers on how to look for inaccuracies or hallucinations, especially when these tools are used in high-stakes settings such as clinical medicine, drug approvals, or claims review.” In addition, the workforce should be trained to discern which situations may lead to better or worse AI performance or a higher risk of bias to “help ensure that AI is used in the right settings for the right tasks,” he said.
Dr Altman believes that AI must ultimately “augment professionals but not replace their professional judgment or their ability to override AI recommendations that don’t make sense to them.” He added, “This will be the next big challenge — integrating powerful tools into the workflow so that it super-powers professionals and doesn’t take away their agency or override their expertise.”
Policy and Planning Considerations
Dr Dankwa-Mullan provided several suggestions regarding policy and governance as foundations for managing risks related to the use of AI in federal health agencies.
“Because AI tools vary widely in reliability, relevance, and risks, the most urgent needs or priorities should match the task, mission, and stakes,” she said. “This means real-world evaluations with fairness checks across diverse settings, role-based education so clinicians can safely verify outputs, and strong governance that guarantees transparency, auditability, and human accountability for any safety-critical use.”
Health agencies should set AI procurement standards that “require transparency, audit trails, and access to evidence before adoption,” she continued. Agencies should prioritize equity through early engagement of communities and clinicians and ongoing performance assessments of AI tools by subgroup.
To support security, privacy, and accountability with the use of AI, agencies should “follow HIPAA and cyber best practices, define who is responsible, report incidents, and have sunset and rollback plans,” Dr Dankwa-Mullan advised. Agencies should also invite independent, third-party testing to detect potential bias and other issues.
“Overall, AI has tremendous potential to augment the capabilities of the federal health agencies, but it is important to be strategic, thoughtful, and deliberate in evaluation and implementation to ensure that its use meets the stated goals of the agency and serves the public,” Dr Shah concluded.
Disclosures: Dr Altman reported that he is supported by an FDA grant that includes work on AI with the FDA. He stated, “My comments are my own and do not necessarily represent the official views of, nor an endorsement, by FDA, HHS, or the US government.” Dr Altman also reported that he has provided expert testimony for Anthropic. Dr Shah and Dr Dankwa-Mullan reported having no relevant disclosures.
