
Drug safety signals are often hidden in clinical notes and other text in electronic health records (EHRs). Finding them requires either costly manual chart abstraction or natural language processing using software tailored to the specific drug being documented in a particular center. This is the case for immune checkpoint inhibitors (ICIs), a type of anti-cancer drug that first became commercially available in 2011, causing a variety of immune-related adverse events (irAEs) that affect the colon, liver, lungs, heart, nervous system, skin, and endocrine system.
Large-scale language models (LLMs), an increasingly inevitable form of artificial intelligence (AI), are being explored as a solution to speed the identification of drug safety signals buried in text. The multicenter study was reported April 6. e-biomedicine We test an LLM for detecting irAEs using a model from San Francisco-based Open AI.
The researchers use so-called zero-shot learning. In this study, LLMs are given one detailed prompt without examples. The prompt created by the team includes the message, “You are a clinical expert in identifying immune-related adverse events caused by immune checkpoint inhibitors…” and includes a list of six ICIs and dozens of irAEs. This prompt was applied to randomly selected clinical records of patients exposed to ICIs published by Vanderbilt Health (100 patients) and the University of California, San Francisco (70 patients), as well as records from seven ICI trials (272 patients) sponsored by Roche, a pharmaceutical company based in Basel, Switzerland.
“Manual patient record abstraction to monitor the safety and efficacy of already marketed drugs is resource-intensive and slows the pace of discovery in precision medicine, especially for immune checkpoint inhibitors, where adverse events are highly variable. If zero-shot learning with LLM can help with these notes, it has the potential to significantly reduce time and costs for all involved,” said Vanderbilt, corresponding author of the report. said Dr. Cosmin Bijan, assistant professor of biomedical informatics in the College of Health.
The team studied three LLMs: GPT-3.5, GPT-4, and GPT-4o, with the last one offering the best performance. As a primary performance measurement, teams use F1 scores. This score ranges from 0 to 1 and is sensitive to both false positives and false negatives. An F1 score of 90% or higher is considered good, and predictive models with a score of 80% or higher may be eligible to drive automated clinical decision support.
For patient-level irAE detection, the average F1 scores from GPT-4o across Vanderbilt Health and UCSF EHRs and Roche study notes were 56%, 66%, and 62%, respectively. The model showed a systematic bias in overpredicting irAEs. When detecting 17 irAEs at the single-note level (using GPT-4o on 667 notes from Vanderbilt Health), the average F1 score was 57%.
“These results demonstrate that zero-shot learning with a powerful LLM can help detect these adverse events,” Bejan said. “Although this performance is not at the level required for clinical decision support, this method is valuable for automated irAE extraction across multiple sites, potentially accelerating discovery and increasing the safety and efficacy of cancer immunotherapy.”
Other Vanderbilt Health participants in the study include Yaomin Xu, MD, Eric Mukherjee, MD, Matthew Krantz, MD, Douglas Johnson, MD, MSCI, Elizabeth Phillips, MD, and Justin Balko, MD. This research was supported in part by the National Institutes of Health under awards R01CA227481 and R01HL156021.
In this regard, in a research letter from December last year, JAMA OncologyUsing logistic regression with adverse event reports collected by the Food and Drug Administration, Mukherjee, Phillips, and colleagues found that ICIs were independently associated with an increased risk of a dangerous skin reaction, SJS/TEN (Stevens-Johnson syndrome/toxic epidermal necrolysis), and that this increased risk may occur in association with patient exposure to human leukocyte antigen-restricting drugs.
