Artificial intelligence has changed the way scholars write, think and share knowledge. In late 2022, Openai released ChatGpt, and soon, Google's Bard (now known as Gemini). Within a few months, these large-scale language models (LLMs) became everyday tools. People used them to brainstorm ideas, edit drafts, clean data, and write full paragraphs of academic papers.
Many researchers have embraced this technology. For non-native English speakers, LLMS provided the lifeline. English rules the world of academics. Journals require sophisticated writing, and often force authors to pay for costly editorial services. LLMS became a cheaper and faster alternative, helping scholars improve clarity and style while saving money.
However, this rapid recruitment raised ethical issues. Some authors have copied and pasted AI text. Others listed AI as co-authors, sparking a heated debate about responsibility and originality. Ultimately, the journal agreed that LLM cannot become an author, but can assist with language editing when used transparently.
Despite this clarity, not everyone discloses the use of AI. Some people think it's not necessary for editing grammar. Others fear that if AI is involved, they will be worried about their work, and their work seems uninventive.
Issues with AI detection tools
As a spread of AI-generated lighting, detection tools have emerged to catch private use. Schools, publishers and reviewers wanted to ensure academic integrity. Tools like Gptzero, Zerogpt, and DetectGPT claim to spot AI-written texts with high accuracy.
However, new research published in Peerj Computer Science reveals the dark side of these tools. With the title entitled “Accuracy Bias Tradeoffs in AI Text Detection Tools and Impact on Academic Publications' Equity, these tools often misconstrued human writing, especially when enhanced with AI.
Researchers have found that high accuracy does not imply fairness. Ironically, the tools with the highest overall accuracy showed the greatest bias for a particular group. Non-native English speakers were hit hardest. Their summary was flagged by AI more frequently, despite being original or lightly edited.
“This research highlights the limitations of a detection-focused approach and encourages a shift towards the ethical, responsible and transparent use of LLM in academic publications,” the researchers said.
Inside the research
The team wanted to answer three questions:
- How accurate is AI detection tools using human, AI, and AI-assisted text?
- Is there a trade-off between accuracy and fairness?
- Do certain groups face disadvantages?
They tested popular tools using a summary of peer-reviewed articles. The dataset included 72 summaries from three areas: technology and engineering, social sciences and interdisciplinary research. The author comes from countries that speak English, such as the US, the UK, and Australia, and countries where English is not widely spoken either officially or widely.
Researchers used CHATGPT O1 and Gemini 2.0 Pro Experimental to generate AI versions of these summaries. We also created the AI-ASSIST version by running the original summary through these models to improve readability without changing the meaning.
Important findings
In the first test, we compared human-created abstracts with AI-generated abstracts. Here, the detection tool worked best because the difference was clearer. Metrics are included:
- Accuracy: How often the tools are classified.
- False positive rate: How often human abstracts were mistakenly labeled as AI.
- False negative rate: How often did AI text miss?
- False accusation rate: Percentage of summary of false positive people.
- False accusation rate of majority: Percentage of false positives than correct classification.
Even with this clear test, non-native speakers faced a higher rate of false accusations.
In the second test, we looked at AI-assisted texts where human writing was enhanced by AI. This hybrid text is common in real life, but it poses challenges for the detector. Metrics are included:
- Summary Statistics: Distribution of AI detection scores.
- Detection Rate (UDR): The frequency with which AI-assisted text was marked purely as human.
- Overdetection rate (ODR): The frequency with which they were flagged as fully ai-written.
The detection tool was a pain here. Many AI-assisted texts were labelled 100% AI generation, ignoring human effort. This creates real risks for scholars who use AI responsibly.
Impact on non-native authors
Historically, non-native English speakers have faced barriers to academic publication. Professional editing costs are high. LLMS helps to fill this gap and offers instant near-close language improvements at minimal cost.
However, if journals use AI detectors to use for police writing, these same authors could be unfairly targeted. Their improved writing style, helped by AI, appears to be “too perfect” and causes false positives. This means more rejection and accusations of dishonesty and hurt their careers.
A variety of academic disciplines also face risks. The humanities and social sciences use subtle language of interpretation. AI models and detection tools trained with simpler data may misinterpret such texts and enhance biases for specific fields.
Additionally, LLM tends to replicate patterns in training data. This risks amplifying existing inequality by silencing a diverse range of voices while promoting uniform language and ideas.
Beyond detection: Calling for change
This study emphasizes that detection tools alone cannot resolve ethical issues around AI in writing. The tool works as a black box. They do not explain why they classify texts as AI or human. This lack of transparency makes it difficult to challenge their decisions.
Furthermore, the line between human and AI writing is blurred. Researchers can write their own drafts, use AI for editing, then manually revise them. Others use AI input to jointly write down the entire section. Detection tools struggle to accurately assess these real-world practices.
The team is urging journals, universities and policymakers to rethink their dependence on AI detectors. Ethical guidelines should promote honest disclosure, recognizing the benefits of AI, particularly for non-native speakers. A blanket ban or strict detection policy can do more harm than good.
I'll move forward
AI tools continue to evolve. This study used state-of-the-art models available in late 2024, but new versions will emerge. Detection tools need to be adapted, but fairness must be central.
The authors are looking for more research into biases in AI detection and how they affect underrepresented groups. They also recommend creating standards for responsible AI use in academia and balancing integrity and fairness.
For now, it's clear that AI detection is not a magical solution. It's another tool with strengths and flaws. Human judgment, transparency, and inclusiveness are just as important as technology to build a fair academic system.
