Human researchers still outperform AI when it comes to writing reliable, systematic reviews.

AI News


Despite rapid advances in large-scale language models, this study shows that AI is best suited as a supervised support tool rather than an independent author, and that human expertise remains important for producing rigorous systematic reviews.

Research: Human researchers outperform large language models at writing medical systematic reviews in comparative multi-task evaluation. Image credit: Summit Art Creations / Shutterstock.com

Recent research published in journals scientific report We show that human researchers perform better than large-scale language models (LLMs) when preparing systematic literature reviews.

What is an LLM?

LLM is an advanced artificial intelligence (AI) system that uses deep learning techniques to analyze vast amounts of input data and generate human-like language. Since the introduction of OpenAI's ChatGPT in 2022, LLM has gained a lot of attention for its ability to perform a wide range of everyday tasks, including text generation, language translation, and email composition.

LLMs can both interpret and generate text, making them an essential part of the medical, educational, and research fields. Indeed, several studies have demonstrated that LLMs such as GPT-4 and BERT can perform a wide range of medical tasks, including annotating ribonucleic acid (RNA) sequence data, summarizing content, and creating medical reports.

In scientific research, LLM has been used to screen and summarize literature, analyze data, and write reports. Despite the immense potential to accelerate the scientific process, the responsible integration of LLM in healthcare, education, and research requires a comprehensive analysis of potential challenges, such as ensuring data consistency, mitigating bias, and maintaining application transparency.

research design

To elucidate the risks and benefits of incorporating LLMs into major scientific disciplines, the current study investigated whether LLMs perform better than human researchers in conducting systematic literature reviews. To achieve this objective, six different LLMs were used to conduct the literature search, article screening and selection, data extraction and analysis, and final draft of the systematic review.

All results were compared to original systematic reviews written by human researchers on the same topic. This process was repeated twice to assess changes between versions and improvements to the LLM over time.

Major discoveries and significance

On the first task, which involved searching and selecting literature, LLM Gemini performed best, selecting 13 out of 18 scientific papers that were included in the original systematic review authored by human researchers. Nevertheless, significant limitations were observed in the LLM's ability to perform key tasks such as literature searching, summarizing data, and drafting the final manuscript.

These limitations likely reflect the fact that many LLMs do not have access to electronic databases of scientific papers. Furthermore, the training datasets used for these models may contain relatively few original research papers, further reducing accuracy.

Despite its unsatisfactory performance on the first task, LLM extracted some relevant papers more quickly than human researchers. Therefore, the time efficiency of LLM can be exploited for initial literature screening in parallel to standard cross-searches of databases and references by human researchers.

For the second task of data extraction and analysis, LLM DeepSeek performed best, with 93% correct entries overall and 7 completely correct entries out of 18 original articles. The three LLMs showed satisfactory performance on this task, as they required slow and complex prompts and multiple uploads to obtain results. This suggests that it is less time efficient compared to human work.

In the third task, which involved drafting the final manuscript, none of the tested LLMs achieved satisfactory performance. Specifically, LLM produced short and uninspiring full text articles that did not fully adhere to the standard template for systematic reviews.

The tested LLM produced articles that were well structured and used correct scientific language, but could be misleading to non-expert readers. Because systematic reviews and meta-analyses are considered the gold standard of evidence-based medicine, critical appraisal of published literature by human experts is essential to effectively guide clinical practice.

conclusion

Modern LLMs cannot produce systematic reviews in the medical field without a rapid engineering strategy. Nevertheless, the improvements observed in the LLM between the two evaluations indicate that, with appropriate supervision, the LLM can provide valuable support to researchers in certain aspects of the review process. In this context, recent evidence suggests that guided prompting strategies, such as knowledge-based prompting, can improve LLM performance on some review tasks.

The current study includes a single systematic review in the medical field as a reference for comparison, which may limit the generalizability of these findings to other scientific fields. Therefore, future studies that evaluate multiple systematic reviews across diverse biomedical and non-biomedical areas are needed to improve robustness and external validity.

Reference magazines:

  • Solini, M., Pini, C., Lazar, A., Others. (2025). Human researchers outperform large-scale language models in writing medical systematic reviews in comparative multitask evaluation. scientific report. Doi: 10.1038/s41598-025-28993-5. https://www.nature.com/articles/s41598-025-28993-5



Source link