Data collection
MRI reports were collected from 6,174 tumor patients who had been scanned at three locations between January 1, 2019 and December 31, 2024. These reports were written by radiologists who specialize in multiple anatomical systems, including but not limited to the nerve, digestive and urinary systems, due to their different lengths and complexities. Each report meticulously detailed in detail that normal anatomy of the scanned area, abnormal lesion signals were observed and provided a brief preliminary diagnosis. Two independent reviewers analyzed the original MRI report along with the corresponding scans and categorized the findings into benign, atypical, or malignant categories (Table 1). In the case of disagreement, the third oncologist made the final decision. To maintain data reliability, no changes were made to the content of the original report. Additionally, all identifiable information such as patient details, exam date, registration number, and physician name were anonymized to protect patient confidentiality.
Research Design
This study used two chatbots: GPT O1-Preview (developed by Openai) and GPT O1-Preview (Deevand Deevand Deepseek), previously known as Chatbot 1. It is designated as Chatbot 2. To standardize the readability comparison, all original MRI reports and submitted queries were in English only. The chatbot was assigned to answer four questions in sequence. First, interpret the report in a way that patients with no medical background can understand. Second, classifying the lesions as benign, atypical, or malignant. Third, we assess the need for surgical intervention. Fourth, we recommend treatment plans based on the content of the report (Table 2). A new chat session was launched for each analyzed report to minimize bias. The response to each prompt has been documented from the chatbot.
In this study, readability assessments for both the original MRI report and the explanatory report generated by the chatbot were conducted using online tools available at https://www.webfx.com/tools/read-able/. Three widely recognized readability indexes were mainly calculated. Flesch-Kincaid Reading Ease (FRE) score, Flesch – Kincaid Grade Level (FKGL), and Gunning Fog Score (GFS). Readability assessment was completed during the survey period spanning February 1st to March 31st, 2025. The responses provided by the chatbot were subsequently received medical reviews. Each response thread generated by the chatbot was assessed independently by two medical reviewers. In the event of disagreement, the third oncologist was consulted to rule out the inconsistency.
A medical review of the explanatory report, i.e., the answer to the initial question, divides the findings into four different levels. “Correct” indicates that all content from the original report is included accurately without errors. “Partially correct” refers to omission of details that do not affect patient management, such as not being able to explain normal variations in Sulci or Gyri. “Partially incorrect” includes errors that affect patient management slightly, such as mild inaccuracy in describing tumor size and shape that are not important enough to change diagnostic or treatment recommendations. “Incorrect” means an error that has a major impact on patient management, such as misrepresenting the location of the tumor.
The second and third questions aim to classify the nature of the tumor and determine the need for surgical intervention, respectively. In these classification tasks, results are evaluated as either correct or wrong. Additionally, reviewers adopted Likert scales to assess the quality of treatment suggestions and the empathy demonstrated by chatbots during the response process.
Ethical considerations
The Academic Ethics Review Board of Beijing Union Medical University Hospital, the Academy of Medical Sciences, China, provided an exemption from ethical review for this cross-sectional study (exemption number SZ-3192). All data used in this study were abolished to ensure the privacy and confidentiality of human subjects. The institutional review board of Beijing Union Medical University Hospital, the Academy of Medical Sciences, China, waived the original informed consent requirement without additional consent and permitted a secondary analysis. This study followed guidelines to adhere to the principles of the Declaration of Helsinki and to enhance reporting of observational research in epidemiology.
Statistical analysis
The Friedman test was employed to compare the readability differences between the original report and the two chatbot-generated explanatory reports. Wilcoxon signed rank tests were used to compare readability, quality of treatment recommendations, and empathy between reports generated by two Chatbots. Additionally, the chi-square test was applied to assess differences in medical review performance between the two chatbots. All statistical analyses were performed using SPSS software (version 26.0, IBM) and two tail p-values below 0.05 were considered statistically significant.
