Guidelines for reporting research on generative artificial intelligence applications: What to use and when?

The increasing number of publications dealing with the use of general artificial intelligence (GAI), or large-scale language models (LLM), for health purposes has created a need to guide authors on transparent reporting practices.^1,2. Although LLM is currently the mainstream, other GAI applications such as diffusion models and large-scale multimodal models are also gaining popularity.³. One of the main differences between GAI and traditional AI is that GAI can create new information based on training data. Differences in methodology and incomplete reporting among studies applying GAI for health purposes impair readers’ ability to accurately interpret study results.³This is a particularly relevant issue when evaluating the effectiveness of complex GAI platforms in the medical field.

The GAI model is currently being used to address a variety of research questions across alternative research designs that require new reporting guidelines.⁴. Although more than 25 reporting guidelines address research applying artificial intelligence and machine learning in the medical field, few reporting standards apply to research on GAI applications in the medical field, and few reporting standards adhere to modern methodological standards.^{5, 6, 7, 8}. As journal editors adopt these reporting standards, researchers may be encouraged to complete and submit checklists and methodology diagrams to accompany their submissions to optimize transparent reporting of their methods. Therefore, authors applying GAI models in the medical field should carefully identify the most appropriate reporting guidelines for their research. This is because these criteria include items tailored to studies involving the GAI model.^{5, 6, 7, 8}. The purpose of this article is to summarize the current rigorous GAI reporting guidelines and highlight those that are in development.

GAI Reporting Guidelines

Choosing the most appropriate reporting guidelines usually depends on the purpose of the study. Figure 1 provides a list of potential research objectives currently addressed in reporting guidelines. At the time of this writing, LLM is the primary GAI model being evaluated in the medical field, but other common examples include diffusion models and large-scale multimodal models.⁹. Research on LLM is addressed by Chatbot Assessment Reporting Tool (CHART), Transparent Reporting of Multivariable Predictive Models for Personal Prognosis or Diagnosis (TRIPOD)-LLM, or Generative Artificial Intelligence Tools in Medical Research (GAMER).^{5, 6, 7, 8}.

Overview of clinical evidence and health advice

CHART summarizes clinical evidence and provides reporting recommendations for studies that evaluate GAI models or GAI-driven chatbots that provide health advice (referred to as chatbot health advice (CHA) studies).^6,8. CHART can also be applied to study standalone GAI models if the model interacts with the user in natural language, such as through an application programming interface. Researchers should apply CHART to CHA studies that evaluate a single GAI model or GAI-driven chatbot, as well as comparative studies between multiple GAI models or chatbots.^6,8. The framework is also relevant for the evaluation of tuned or fine-tuned GAI models and chatbots for customized evidence summaries and health advice. As shown in Figure 1, the scope of CHART includes clinical evidence or health advice related to health prevention, screening, diagnosis, treatment, prognosis, and general health information.^6,8.

Develop models, generate documentation, and predict results

Authors can apply TRIPOD-LLM to a wide range of use cases, from de novo LLM development to using LLM for generating medical documentation and predicting outcomes using patient data.⁵. The TRIPOD-LLM authors also recommend using TRIPOD-LLM in studies that evaluate the LLM’s ability to perform tasks such as:

○

Text processing (e.g. identification of predefined categories of objects within a data body, or named entity recognition)⁵.
○

Classification (for example, determining whether patient pronouns are used correctly in medical records).
○

Information retrieval (e.g. training a GAI model to respond to user queries using relevant publications)⁵.
○

Summaries (e.g., translating clinical documents into a specific language for patients).

Figure 1 outlines further use cases, similar to the original TRIPOD-LLM publication.⁵. The report’s recommendations are suitable for evaluating a single LLM or comparing multiple LLMs.

Applying GAI to manuscript writing

The studies described so far evaluated the performance of GAI models for specific research objectives. However, there is growing interest in applying GAI models to support manuscript writing across traditional research designs.⁷. Rather than focusing on model performance, the GAMER reporting guidelines provide recommendations that address studies where all or part of the manuscript is written by a GAI model for medical research.⁷. For example, authors can apply GAMER when applying a GAI model to assist in writing case reports. Figure 1 shows an additional example.

Strengths and limitations of current reporting guidelines

All reporting guidelines listed above follow methodological guidance from the Enhancing Quality and Transparency Health Research Network. International efforts to improve transparency in health research^{10, 11}. Although these reporting guidelines currently apply to LLM, CHART and TRIPOD-LLM are designed as living documents and will be updated regularly to keep pace with advances in the field.^{5, 6, 8}. Authors applying traditional study designs such as randomized controlled trials or cohort studies should continue to adhere to the guidelines described here, as well as related tools such as CONsolidated Standards Of Reporting Trials (CONSORT) and STrengthening the Reporting of OBservational Studies in Epidemiology (STROBE) reporting guidelines.^5,12.

One of the strengths of the CHART reporting guidelines was the input from a broad representation of interdisciplinary stakeholders through 531 members during the Delphi consensus. The applicability to CHA research is high, but the scope is narrow. In contrast, TRIPOD-LLM applies to a large number of use cases where LLM is relevant, but the applicability of each checklist item may depend on the specific use case. Although the GAMER checklist is concise and particularly relevant to medical research, it may be missing important items included in other reporting guidelines.

Reporting guidelines are under development

Several reporting guidelines are in development, including ChatGPT and Accountable Reporting and Use (CANGARU) reporting guidelines.¹³. CANGARU has been developed according to robust methodological standards, including live systematic reviews, Delphi consensus, and panel consensus meetings among international multidisciplinary stakeholders.¹⁴. Once published, researchers may be interested in the CANGARU guidelines when using LLM in academic research and scientific papers. The CANGARU guidelines apply not only to research within medicine, but also to research using LLMs for manuscript writing in other non-medical scientific fields.¹⁴.

In health economics, researchers initiated the ELEVATE-GenAI framework, which includes 10 preliminary checklist items, after a targeted literature review, iterative discussion, and usability testing for both systematic reviews and health economic modeling.¹⁵. Currently, it consists of a structured framework and a checklist for practical implementation, using a scoring system that awards up to 3 points per domain. The authors plan to consult with stakeholders across different disciplines through Delphi consensus to improve the effectiveness of the tool.¹⁵.

In contrast, the Consolidated Standards for Reporting Qualitative Research (COREQ) Extension for LLM (COREQ-LLM) accommodates studies that use LLM for qualitative research.¹⁶. COREQ-LLM was developed following a systematic scoping review and Delphi consensus to identify checklist items that support transparent reporting of qualitative research involving LLMs. This reporting guideline is expected to address current trends in qualitative research where LLMs are used to support research design, data processing, analysis, interpretation, and direct interaction with qualitative data.¹⁶.

These represent the first iteration of reporting guidelines that address the context of GAI research in healthcare. These address not only the development of GAI models, but also the use of GAI models for manuscript writing, summarizing clinical evidence, providing health advice, and predicting health outcomes using electronic health records. Clinicians, researchers, journal editors, and publishers should be mindful of these reporting guidelines and apply them to any studies evaluating the use of GAI models for health purposes. Future iterations, enhancements, and/or new reporting guidelines are expected to respond to the dynamically changing nature of the field. As we work toward safely and responsibly integrating GAI technology into the medical field, researchers must stay informed of the latest literature and continue to apply the most appropriate reporting standards to their research. Journal editors and publishers should also remain aware of the latest information in the GAI field and continue to encourage authors to adhere to relevant reporting standards. We conduct a systematic review of GAI-oriented reporting guidelines to keep our readers up to date with the dynamically evolving landscape of GAI literature.

Source link