A laboratory technician holds a tube containing a swab sample for serological testing for coronavirus infection at the Roimit Health Services Institute in Or Yehuda, Israel, on July 16, 2020 – Copyright AFP Rostislav NETISOV
AI in medicine is set to transform drug discovery, clinical trials, manufacturing, and marketing by analyzing vast data sets to speed up processes, reduce costs, and enable personalized medicine.
Applications currently being worked on include everything from drug candidate identification and protein structure prediction to supply chain optimization and regulatory automation, but challenges such as data quality and transparency remain. There are examples where AI can help discover new targets, design molecules faster, better recruit trial patients, and create customized treatments, making drug development more efficient and accurate.
But how will this technological revolution be reconciled with drug regulatory authorities overseeing the pharmaceutical sector at national and supranational levels?
The European Medicines Agency (EMA) has become the first medicines regulatory authority to produce draft guidance on the use of artificial intelligence applied to the development and manufacture of medicines. This is at a critical juncture, as the benefits and errors associated with AI are at a critical point.
This draft document, known as Annex 22, represents a new regulatory annex focused on the governance, validation, and monitoring of AI/ML systems used in Good Manufacturing Practice (GMP) environments. This draft is strictly complementary to Annex 11 (set out for computerized systems). These two documents are designed to prevent unsafe use of adaptive or opaque models in critical GxP processes.
My assessment of the contents of Annex 22 is as follows.
Range – very strict boundaries
Annex 22 only applies to static, deterministic AI/ML models used in critical GMP processes. This means static machine learning models. Deterministic model (same input → same output); critical applications only allowed under strict controls.
However, dynamic/self-learning models are explicitly excluded. Stochastic model. Generative AI and Large-Scale Language Models (LLM). The Annex specifically states that the use of generative AI/LLM is only permissible for non-critical GMP tasks. With HITL monitoring. Human-in-the-loop (HITL) is an AI and machine learning approach that integrates human interaction and intelligence into a system’s training, testing, and operating cycles.
This is a very high bar, and many commercial AI tools will fail unless they are configured to be severely restricted.

Emphasis on cross-functional accountability
The Annex mandates that all subject matter experts, data scientists, quality assurance (QA), IT, and vendors collaborate from algorithm selection to operation. Illustrating this process requires clear documentation, whether the model is built in-house or by a supplier. To this end, quality risk management must underpin all decisions.
Additionally, each pharmaceutical organization using AI must develop and implement a strong governance framework for AI.
Intended use – must be very clearly defined
Pharmaceutical acceptance testing consists of formal, documented, GMP-compliant device validation, particularly through Factory Acceptance Testing (FAT) and Site Acceptance Testing (SAT). FAT validates equipment at the vendor’s site before shipping, while SAT ensures functional, integrated performance in the final operating environment.
In this context, the annex indicates that a complete characterization of the input sample space, including identification of rare variations, is required before acceptance testing begins. To achieve this, subgroups (sites, equipment, defect types, etc.) need to be identified and HITL responsibilities explicitly defined and monitored.
Acceptance Criteria – Statistical Expectations
To assess the success of AI, the Annex requires:
- Clear test metrics (accuracy, sensitivity, etc.).
- Passing criteria set by experts in front The test will begin.
- The performance of the AI model must be better than the process it replaces.
This assumes that the current manual/automated processes that are intended to be replaced by AI have known and documented performance metrics..

Test data – high statistical and procedural rigor
The test data used to evaluate the AI is whole Input space (including rare edge cases). The appendix also requires that the dataset be large enough to be statistically significant, and that it be labeled with a very high degree of accuracy.
Interestingly, the annex also states that in order to evaluate the AI, users should avoid generated test data created by the AI.
Test data independence – strong separation of duties
To ensure that the AI development process is not biased, the Annex introduces a series of controls. These include:
- We do not share and use training and test data (to ensure that the data is not contaminated).
- Access controlled and audited repository.
- Developers should never have access to test data.
- Staff members who have seen the test data cannot train the same model unless they are under the control of four eyes.
- Physics objects used for testing cannot be reused for training.
Therefore, this requirement enforces strict data separation.
Running the test
To test the suitability of an AI, the Annex requires:
- Demonstrate generalization (no over/underfitting).
- Fully predefined test plan including metrics, test scripts, and data references.
- Handling deviations is the same as the standard GMP deviation process.
- Retention of all test artifacts, including audit trails and physical test objects.
Explainability – essential for critical applications
Each AI model must provide functional attributes. These are explainable AI (sometimes abbreviated as XAI) techniques that assign importance scores to input features and quantify their impact on machine learning model predictions. These methods help determine how specific inputs, such as drug yield, control the model’s behavior when making predictions. Its purpose is to provide model transparency and decision-making insights.

To demonstrate “explainability,” SHAP and LIME are common model-independent techniques used to understand the predictions of machine learning models, but they differ primarily in their approaches.
- lime (Locally Interpretable Model-Agnostic Explanation) builds a simple local linear model based on specific predictions.
- sharp (SHApley Additive exPlanations) uses game theory (Shapley values) to obtain more robust and mathematically grounded feature attributes, providing both local and global insights.
It has been pointed out that AI “black boxes” are unacceptable in a GMP environment.
Reliability – control over uncertain predictions
To have confidence that your AI models are working as intended, each model must:
- Logs the confidence score.
- Use thresholds to avoid unreliable output.
- If the reliability is low, “Undetermined” is output.
These features are believed to prevent the occurrence of inappropriate automated decisions.
Operations – Rigorous lifecycle governance
To ensure that AI models operate throughout their intended lifecycle, the annex requires that each change be documented and evaluated, and that configuration controls be implemented to detect unauthorized changes.
AI has the potential to accelerate its application in the pharmaceutical sector. The draft annex provides some clarity on what is expected of drug regulators within the European Union. The annex document recently closed for public comment and the final version is expected to be published in late 2026.
