Studies show ChatGPT falls short in accuracy and reliability of scientific summaries

In a recent study published in Npj Digital Medicineresearchers compare abstracts of studies published in high-impact medical journals with abstracts developed using ChatGPT, a large-scale language model of artificial intelligence (AI), to find such a method for scientific writing. We are evaluating the accuracy and reliability of using large language models.

Research: Compare scientific summaries generated by ChatGPT to real summaries using detectors and blinded human reviewers. Image credit: NicoElNino / Shutterstock.com study: Compare scientific summaries generated by ChatGPT with real summaries using detectors and blinded human reviewers. Image credit: NicoElNino / Shutterstock.com

Background

OpenAI’s recent release of ChatGPT has received a lot of attention due to controversy surrounding its usefulness and use in academia. While many users had a good experience while using ChatGPT, others expressed concern about the increasing usage of ChatGPT and the decline of traditional writing methods.

ChatGPT is one of the largest, large-scale language models based on neural network-based models trained on large-scale data to generate naturally readable content. ChatGPT is built on the Generative Pretrained Transformer-3 (GPT-3), which has been trained using 175 billion parameters to produce consistent, fluent content that’s hard to distinguish from human-generated content can generate

ChatGPT is a free and open-access platform that is widely used to create scholarly content across all areas of science, including biomedical research. However, given the fact that biomedical research has had a long-lasting impact on many aspects of human health and medicine, it is essential to judge the accuracy and reliability of content created using ChatGPT. is.

About research

In the current study, the researchers obtained 50 abstracts from five high-impact medical journals and used those abstracts as controls. His five journals that have obtained abstracts and titles include: Nature Medicine, The Lancet, BMJ, JAMA, and NEJM.

For the test group, ChatGPT was used to generate 50 abstracts based on journals and titles selected from the list. To this end, the researchers asked his ChatGPT to produce an abstract of the study with the given title in the style of the given journal.

The two summary sets were compared using an AI output detector, the GPT-2 Output Detector. For this purpose, high scores were given to texts that appeared to be generated using AI language tools. Free and paid plagiarism checking tools were also used to detect plagiarism rates in ChatGPT-generated and original summaries.

Blinded human reviewers were also used to assess whether they could distinguish between the actual abstracts and those produced by ChatGPT. Each reviewer was assigned one of her 25 summaries, a combination of the original and ChatGPT-generated summaries, and assigned a score based on whether they found the generated or original summaries. I was asked. The ability of the abstracts produced by ChatGPT to comply with the journal’s guidelines was also evaluated.

ChatGPT summaries are unreliable

Summaries generated through ChatGPT were more likely to be AI-generated, with a median score of 99.89%. By comparison, the median score for the original summary was 0.02%. This indicates that these summaries were unlikely to have been generated using AI language tools.

However, the plagiarism tool reported a higher percentage match score for the original abstract. By comparison, AI-generated abstracts had a median similarity plagiarism score of 27, with a score of 100 for plagiarism reported by the original abstract.

Blinded human reviewers identified that approximately 64% of the summaries generated were generated using ChatGPT. Of the original abstracts, 86% were correctly identified as original abstracts by human reviewers.

Human reviewers misidentified approximately 32% of AI-generated abstracts as original abstracts. However, the GPT-2 output detector reported similar scores for all ChatGPT-generated summaries.

About 14% of the original abstracts were misidentified as being generated by ChatGPT, indicating that human reviewers have difficulty distinguishing between the original scientific literature and the AI-generated abstracts . In addition, human reviewers focused excessively on information such as alternate spellings of some words and clinical trial registration numbers for abstracts that they correctly identified as being generated by ChatGPT, and the abstracts commented that it felt vague and superficial.

Conclusion

While the AI output detection tool was successful in identifying ChatGPT-generated summaries and distinguishing them from the original summaries, human reviewers often had difficulty distinguishing between the two. These findings demonstrate the usefulness of AI output for journals and scientific publishers to maintain scientific standards in their publications.

Reference magazines:

Gao, CA, FM, Howard, NS, Markov, other. (2023). Compare scientific summaries generated by ChatGPT to actual summaries using detectors and blinded human reviewers. Npj Digital Medicine 6(75). Doi: 10.1038/s41746-023-00819-6.

Source link