A flood of AI peer reviews is making it difficult to judge science, editors say

The pile-up started quietly and then turned into something that journal editors could no longer ignore.

Organization Science, one of the leading journals in management research, has seen a 42% increase in submissions since the introduction of ChatGPT in late 2022. This alone may have seemed like an explosion in productivity. But the journal’s editors say the additional papers were not, on average, better papers. Many were difficult to read, full of jargon, and unlikely to survive the review process.

A new analysis by the journal’s AI Task Force reveals what editors have been observing across the peer review pipeline. Based on 6,957 initial posts and 10,389 text-input reviews processed from January 2021 to February 2026, the team found that increased use of AI led to weaker writing, higher rejection rates, and increased strain on the unpaid academics who keep the system running.

The numbers suggest there’s a bigger problem than clumsy phrasing. The same tools that can speed up writing also clash with an academic culture that often values output over attention.

Monthly submissions to Organization Science from January 2013 to the end of 2025. (Credit: Organization Science)

“We were not trying to make a point” in the source material, said Claudine Gartenberg, the team’s senior editor and a Wharton professor. “I just said, let’s add some facts to this sentiment.”

She is not arguing from the sidelines. “I use Claude Code and Codex all day long,” she said. “Every aspect of my research program over the last year.”

As the number of papers increases, so does the prose.

Our editorial team used Pangram, an AI detection tool, to score posts and reviews on a continuous scale of 0 to 1. Rather than trying to definitively classify a particular paper as written by a human or machine, the analysis looked for large variations across thousands of texts.

The number of posts judged to contain little or no AI text decreased, but the number of posts that were AI-powered or AI-heavy generated increased. By February 2026, the majority of papers submitted to the journal showed at least some level of AI involvement. The fastest growing segment was the most machine-intensive segment: manuscripts with an AI score above 70%.

The quality of writing went in the opposite direction. Flesch Reading Ease, one of the standard measures of readability, has declined sharply since the end of 2022. The journal reports that by January 2026, the quality of submitted manuscripts was 1.28 standard deviations below January 2021 levels.

Across journal data, the higher the AI score, the lower the readability. The correlation between AI scores and Flesch Reading Ease was negative with rho of -0.4 and p less than .001. AI-heavy texts also tend to require higher reading comprehension levels, use more technical terminology, and rely heavily on nominalization, a form of abstract nouns that can turn simple actions into bureaucratic fog.

Long-term trends in monthly sending volume by AI usage category. (Credit: Organization Science)

That doesn’t mean all indicators have worsened. The analysis found that AI-heavy prose was less hedging, used fewer passive voices, and was more specific. But the broader effect was still the prose, which felt denser and more difficult to read through.

Gartenberg compared it to the language criticized in George Orwell’s famous essay on political writing: bloated, abstract, and strangely slippery.

Incentive issues behind the surge

The editors argue that AI alone is not the answer.

Their main argument is that generative tools amplify incentives already built into academic life, particularly the pressure to produce large numbers of papers. At business schools, one of the strongest symbols of that pressure is the UT-Dallas Journal Ranking List, which tracks faculty publications across 24 designated journals.

The researchers investigated whether the schools that historically responded most strongly to the ranking system changed their behavior the most after ChatGPT’s debut. They did.

Schools classified as stronger “UTD responders” increased their submissions after ChatGPT, and that increase was concentrated in papers with AI writing scores above 15%. Even after excluding mainland China and Hong Kong schools from one version of the analysis, the direction of the pattern remained similar.

This is important because it suggests that heavy use of AI is not simply random and not evenly spread across the field. It seems to be tied to organizational reward systems.

“AI as used today conflicts with institutional incentives to produce more research, not better research,” Gartenberg said. “This is not just AI; it’s AI plus publish-or-perish incentives.”

Long-term trends in AI usage categories. (Credit: Organization Science)

The magazine also found that using AI in this way doesn’t seem to be of much use to authors. Papers with many AI descriptions are more likely to be rejected at the paper stage, and more likely to be rejected after review. The breaking point seemed particularly clear once the manuscript exceeded approximately 30% AI usage.

After the launch of ChatGPT, 11.9% of papers in the 0%-15% AI category received a decision to revise and resubmit. For papers in the 70% and above category, that number dropped to 3.2%.

The same trend appears in peer review

The submitter is only half the story. The journal found that the same pattern crept into peer review itself.

Over 30% of text-filled reviews show detectable use of AI. Before ChatGPT, this number was close to zero. And, like manuscripts, these reviews also became harder to read as AI scores increased. The number of technical terms increased, the number of nouns increased, and the readability decreased.

The content of the review has also changed.

Using word frequency measures, the editors found that AI-heavy reviews focused more on theory than data. In their regression analysis, AI scores were positively correlated with theory emphasis and negatively correlated with data emphasis. The research team also used principal components analysis to show that AI-written reviews had a narrower range of ratings than human reviews.

That narrowness can be important. Reviews that lean too much toward abstract theory and away from methods and evidence can give editors and authors less insight into what the paper actually does wrong or right.

The AI scores for each section of the manuscript sample are stratified by AI scores that fall into the low (70) category. (Credit: Organization Science)

Most surprisingly, AI-heavy reviews didn’t seem to help determine editorial outcomes. Human review was correlated with decisions. This was not the case with AI-heavy reviews.

“It’s not like editors know those are AI reviews and throw them out,” Gartenberg said. “They’re reading it, but they’re not passing on the editor’s final recommendation.”

This means that editors do much of the reviewing themselves, which protects the journal’s standards but puts a strain on a system built on volunteer work.

humans still draw the line

For now, gatekeeping is still working.

The journal found that the overwhelming majority of published papers are still written by humans, at least based on detectable signals in the abstracts. Manuscripts generated with heavy use of AI rarely make it through the funnel. Editors catch most of the weak works before they go to print.

But it takes a person to catch it.

To handle the increased workload, Organization Science has increased the number of associate editors from six to 11. The number of active senior editors increased from about 30 in the first term to about 60 in the second term. We currently have an associate editor who handles more than 250 manuscripts a year.

The report’s conclusion is not that AI has no place in science. In fact, the author says the opposite. They used AI themselves while preparing the editorial, including coding, outlining, phrasing, and comparing the essay to previous work. They note that even with such support, the editorial received a score of 8.8% on Pangram, still falling within the human-first range.

Scatterplot of AI usage in the abstract (x-axis) and AI usage within the manuscript body (average AI usage in introduction, theory, methods, results, discussion, and conclusion). (Credit: Organization Science)

They argue that the problem is not the use of the tools themselves. This is what happens when researchers over-relax the thinking and writing process.

“You think as you write, so if you don’t write, you’re not thinking deeply about it,” Gartenberg said.

Research also has limitations. Although pangrams are treated here as a powerful detector, the authors emphasize that no detection system is completely reliable for determining individual texts. Their arguments apply to collective patterns rather than single manuscripts or reviews. The journal also focused on one field and one outlet: organizational science, although the authors suspect a similar pattern to be more widespread. And many of the writings in the dataset are likely from older models such as ChatGPT 3.5 and GPT-4, whose prose quirks were easier to spot and often clunkier than those of newer systems.

Therefore, this is not the final verdict on AI in research. This is a snapshot of a rapidly changing moment.

Where technology could help next

The irony is that the same technologies that currently bloat pipelines could eventually help manage them.

The report argues that the real bottleneck in publishing is no longer producing papers, but evaluating them. Journals are struggling to find reviewers and editors are drowning in submissions. In such situations, AI may serve more as a screening and triage tool than as a ghostwriter.

The author raises several possibilities. Journals can use AI to flag difficult-to-read prose, a high density of jargon, or weak alignment between argument and method before editors spend much time on it. This may help guide reviewers to neglected questions about data and evidence, rather than replacing their judgment. It can serve as a scaffold rather than a substitute for expertise.

The team stops short of calling for automated gatekeeping. The report warns that disclosure rules or outright bans will not solve deeper problems with institutions that reward paper counts and journal publication rather than sustained intellectual contributions.

It is here that a long battle over tenure decisions, hiring criteria, journal lists, and the very culture of academic productivity could begin.

Practical implications of the research

The most immediate lesson is that AI should not be excluded from science. That means science may need to be clearer about what it wants AI to do.

For journals, this finding points to the practical need for better triage, better peer reviewer support, and policies that reduce the burden of low-quality, high-volume submissions while little editorial attention is consumed.

For universities, this study raises a more difficult question: Are incentives for publication counts and journal listings actively encouraging lower-value output? For researchers, the message is straightforward. Using AI to save time can be counterproductive if it replaces the thinking that strong writing reflects.

The journal’s own data suggests that heavy writing by AI does not improve the chances of a paper, but may harm it. If AI is to enhance, rather than overwhelm, research, its tools will need to aim for better research, not just more research.

Source link