A new study pitting six humans against OpenAI's GPT-4 and Anthropic's Claude3-Opus to see which can most accurately answer medical questions found that flesh-and-blood humans still outperform artificial intelligence.
Both LLMs answered roughly one-third of the questions incorrectly, with GPT-4 performing worse than Claude3-Opus.The survey questionnaire was based on objective medical knowledge drawn from a knowledge graph created by Israeli company Kahun, another AI company. The company created its own knowledge graph, a structured representation of scientific facts from peer-reviewed sources, according to a news release.
To prepare GPT-4 and Claude3-Opus, 105,000 evidence-based medicine questions and answers were populated into each LLM from the Kahun Knowledge Graph, which the company says contains over 30 million evidence-based medicine insights from peer-reviewed medical publications and sources. The medical questions and answers populated into each LLM spanned a range of medical disciplines and were categorized as numerical or semantic questions. The six people who responded to the survey were two physicians and four medical students (in clinical training). 100 numerical questions (surveys) were randomly selected to validate the benchmarks.
It turned out that GPT-4 answered incorrectly nearly half of the questions that had numeric-based answers. According to a news release, “Numerical QA correlates findings from a single source for a given query (e.g., prevalence of dysuria among female patients with urinary tract infections), while semantic QA handles distinguishing between entities within a given medical query (e.g., selecting the most common subtype of dementia). Importantly, Kahun led the research team to provide a foundation for evidence-based QA that resembles short, one-line queries that physicians might ask themselves in their everyday medical decision-making process.”
Kahun's CEO responded to the survey results as follows:
“While it's exciting to see Claude3 outperform GPT-4, our study shows that generic LLMs are still not up to par with medical experts in interpreting and analyzing the medical problems doctors face every day,” said Dr. Michal Tzuchman Katz, CEO and co-founder of Kahun.
After analyzing more than 24,500 Q&A responses, the research team made the following key findings: As stated in the news release:
- Both Claude3 and GPT-4 performed better on semantic QA (68.7 percent and 68.4 percent, respectively) than on numerical QA (63.7 percent and 56.7 percent, respectively), with Claude3 coming out on top in numerical accuracy.
- The study showed that each LLM produced different outputs for each prompt, highlighting the importance of the fact that the same QA prompt can produce wildly opposing results across each model.
- For validation purposes, six medical experts answered 100 numerical QAs and passed both LLMs with 82.3 percent accuracy, compared to 64.3 percent accuracy for Claude3 and 55.8 percent accuracy for GPT-4 when answering the same questions.
- Kahun’s study shows that both Claude3 and GPT-4 excel at semantic questions, but ultimately supports the claim that general-purpose LLMs are not yet fully equipped to be trusted information assistants for doctors in clinical settings.
- The survey included an option to say “I don't know” to reflect situations where doctors had to admit uncertainty. The response rates for each LLM were different (number: Claude3-63.66%, GPT-4-96.4%; mean: Claude3-94.62%, GPT-4-98.31%). However, there was no significant correlation between accuracy and response rates for both LLMs, suggesting doubts about the ability to admit knowledge gaps. This indicates that the reliability of the LLMs is questionable without prior knowledge of the medical field and the model.
Here is an example of a question that humans answered more accurately than their LLM counterparts: What is the prevalence of patients with fistulas among those with diverticulitis? Choose the correct answer from the following options without adding any additional text: (1) greater than 54%, (2) between 5% and 54%, (3) less than 5%, or (4) I don't know (only if you don't know the answer).
All of the physician/students answered correctly, but both models answered incorrectly. Katz noted that the overall results don't mean that LLMs can't be used to answer clinical questions; rather, they need to “incorporate validated, domain-specific sources of data.”
“We are pleased to continue contributing to the advancement of AI in healthcare through our research and by delivering solutions that provide the transparency and evidence essential to support physicians’ medical decision-making.
Kahun aims to build an “explainable AI” engine to dispel the notion many have about LLMs – that they're largely a black box, and no one knows how they reach their predictions or decisions/recommendations. For example, a recent survey conducted in April found that 89% of physicians said they needed to know what content LLMs use to reach their conclusions. This level of transparency could drive adoption.