Chatbots powered by artificial intelligence (AI) answer everyday health-related questions from regular users with nearly 76% accuracy, raising concerns about their reliability in real-world client-facing applications, according to a new study led by researchers at Penn State University.
Researchers wanted to understand how the public uses AI for health-related concerns and how accurately AI responds to everyday medical inquiries. They found that when it comes to medicine, particularly specialties such as neurology and dermatology, AI tools may work best in the hands of trained physicians rather than patients. The team will present their findings at the 2026 Association for Computing Machinery Fairness, Accountability and Transparency (FAccT) conference, June 25-28 in Montreal, Canada.
Our study explicitly focuses on medical scenarios that an average internet user might ask an AI, a perspective not covered by previous research on large-scale language models (LLM) and healthcare. We wanted to understand if people are using LLMs like ChatGPT as symptom health checkers, as they have historically used Google, how accurately can LLMs answer those queries, and how harmful can those responses be? ”
Amulya Yadav, study co-author, Associate Professor of Informatics and Intelligent Systems, Penn State College of Information Sciences (IST)
To understand how accurate or harmful health-related LLM responses are to the average Internet user, researchers held an AI competition called Diagnose-a-thon at Penn State University. A total of 34 participants, comprised of faculty, undergraduates, and graduate students, submitted 212 prompts and AI-generated responses to real and imagined health concerns written from both patient and physician perspectives. Participants were able to choose one of four LLMs (ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro, and Llama3-8b) to use for the contest.
“One of the strengths of our study is that by asking participants to select an LLM of their own choice and use it as they would on a normal day, we are essentially trying to recreate the real-life use of an LLM,” said Bonham Mingole, lead author of the study and a doctoral candidate in information science and technology. “This type of participatory research is critical to understanding how ordinary people use AI in their daily lives.”
The researchers then asked nine board-certified physicians to rate the accuracy of the AI-generated responses and how harmful they were on a six-point scale from very low to very high. The competition committee awarded awards to the top eight submissions that produced the most medically accurate information and awarded awards to the submissions that produced the reactions most likely to cause harm.
Overall, we found that 76.2% of the responses generated by LLM provided accurate information. Specialties such as obstetrics, gynecology, and otolaryngology (treatment of diseases that affect the ear, nose, and throat) have LLMs performing best, with high effectiveness scores and low harm scores. According to the researchers, AI performed the worst in internal medicine, neurology, and dermatology, with lower validity scores and higher harm scores. They added that highly specific prompts, and prompts between 60 and 250 characters, result in more accurate LLM output.
The researchers then took a base model for each LLM and trained it on medical textbooks, clinical guidelines, and peer-reviewed research articles included in medical school curricula to see whether additional training increased response validity scores and decreased harm scores. The researchers asked a panel of seven medical professionals and residents – a board-certified physician, two second-year internal medicine residents, two fourth-year medical students, and two third-year medical students – to evaluate the responses from the basic and extended LLM to determine which was more clinically appropriate. The researchers found that the panel preferred responses from the Gemini and Llama base models over the extended model, and less favoring the ChatGPT model.
“We are entering a new era of health care, and AI is a big part of it,” said study co-author Jennifer Krasznewski, director of the Penn State Institute for Clinical and Translational Sciences and professor of internal medicine at Penn State College of Medicine. “There is a real opportunity for healthcare to transform and integrate these new tools so that clinicians like me can use them to improve patient care.”
The researchers also noted that despite LLM’s validity score, the AI’s error rate is still over 20%, roughly twice the error rate of human doctors. They said these mistakes could be harmful to patients.
“I don’t think AI will replace human doctors, but I think there’s a huge opportunity to help today’s doctors upskill in ways that haven’t been done before,” Krashneski said, suggesting that current LLMs may be a better tool for medical professionals than for patients.
Overall, the researchers say, the study highlights the potential beneficial and potentially harmful effects of AI on important aspects of everyone’s lives.
“Whether we like it or not, people will continue to use AI to diagnose health problems,” said study co-author S. Shum Sundar, Evan Pugh University Professor and James P. Jimiro Professor of Media Effects at Penn State University. “By understanding AI usage patterns and testing the adequacy of AI performance, our project will help improve literacy about the best and worst uses of AI in medical advice.”
Penn State IST doctoral students Aditya Majumdar and Firdaus Ahmed Choudhury also contributed to this research. Penn State’s Center for Socially Responsible Artificial Intelligence hosted a Diagnose-a-thon competition.
sauce:
Reference magazines:
Mingore, B. Others. (2026) Dr. GPT would like to see you now, but should you?Using crowdsourced clinical cases to investigate the benefits and drawbacks of large-scale language models in medical diagnosis. DOI: 10.48550/arXiv.2506.13805. https://arxiv.org/abs/2506.13805
