A large-scale randomized study of large-scale language model feedback in peer review

Machine Learning


  • Alberts, B., Hanson, B., Kelner, KL Editorial: Peer review. science 32115–15 (2008).

    Article Google Scholar

  • Kelly, J., Sadeghieh, T., Adeli, K. Peer review of scientific publications: Benefits, critiques, and survival guide. EJIFCC twenty five227 (2014).

    Google Scholar

  • Publons global peer review status 2018 (Clarivate Analytics, 2018).

  • Azad, A. & Banu, A. Artificial intelligence conference publication trends: The rise of hyper-prolific authors. Preprint at https://doi.org/10.48550/arXiv.2412.07793 (2024).

  • McCook, A. Is peer review broken? The number of submissions is increasing, reviewers are overburdened, and authors at top journals are complaining more and more about the process. What’s wrong with peer review? scientist (February 1, 2006).

  • Tucker, N. et al. Assist with feedback to ICLR 2025 reviewers. ICLR 2025 Review Feedback Blog post by the agent team and program chair. ICLR Blog https://blog.iclr.cc/2024/10/09/iclr2025-assisting-reviewers/ (2024).

  • In Rogers, A. & Augenstein, I. How can we improve peer review in NLP? Computational Linguistics Association Survey Results: EMNLP 2020 (Eds. Cohn, T., He, Y. & Liu, Y.) 1256–1262 (ACL, 2020).

  • Rogers, A., Karpinska, M., Boyd-Graber, J., and N. Okazaki. Program Chair Report on Peer Review at ACL 2023. Procedures 61st Annual Meeting of the Association for Computational Linguistics Vol. 1, xl–lxxv (ACL, 2023).

  • Arns, M. Open access is making reviewers tired. nature 515467 (2014).

    Article Google Scholar

  • Cortes, C. & Lawrence, ND Conflicts in conference peer review: Reconsidering the 2014 NeurIPS experiment. Preprint available at https://doi.org/10.48550/arXiv.2109.09774 (2021).

  • Claude 3.5 Sonnet (Anthropic, 2024).

  • Liang, W. et al. Can large-scale language models provide useful feedback for research papers?A large-scale empirical analysis. NEJM AI 1AIoa2400196 (2024).

    Article Google Scholar

  • Yuksekgonul, M. et al. Optimize generative AI by backpropagating language model feedback. nature 639609–616 (2025).

    Article Google Scholar

  • Madhan, A. et al. Self-improvement: Iterative improvement through self-feedback. Advanced neural information processes. system. 3646534–46594 (2023).

    Google Scholar

  • Hosseini, M. & Horbach, SPJM Combat reviewer fatigue or amplify bias? Considerations and recommendations for the use of GhatGPT and other large-scale language models in academic peer review. Integrate resolution. pastor peer 84 (2023).

  • Liang, W. et al. Monitoring AI-modified content at scale: A case study on the impact of ChatGPT on AI conference peer review. in Procedures 41st International Conference on Machine Learning 29575–29620 (ICML, 2024).

  • Zhang, Y. et al. The siren’s song in the AI ​​sea: A study of hallucinations in large-scale language models. computational linguistics 511373–1418 (2025).

    Article Google Scholar

  • Zhou, J. et al. Evaluating the commands of large-scale language models. Preprint available at https://doi.org/10.48550/arXiv.2311.07911 (2023).

  • Liu, R. & Shah, NB Reviewer GPT? An exploratory study on using large-scale language models for article reviews. Preprint available at https://doi.org/10.48550/arXiv.2306.00622 (2023).

  • Biswas, S., Dobaria, D. & Cohen, HL ChatGPT and the future of journal reviews: A feasibility study. Yale J. Biol. medicine. 96415–420 (2023).

    Article Google Scholar

  • Liang, W. et al. Mapping the increasing use of LLM in scientific writing. in Procedure 1st Conference on Language Modeling (Corum) (2024).

  • Shah, N.B. Challenges, experiments, and computational solutions in peer review. common. ACM 6576–87 (2022).

    Article Google Scholar

  • Price, S. & Flach, PA Computer support for academic peer review: An artificial intelligence perspective. common. ACM 6070–79 (2017).

    Article Google Scholar

  • Kankanhalli, A. Peer Review in the Age of Generative AI. J. Assoc. Information Systems. twenty five76–84 (2024).

  • Kuznetsov, I. et al. What can natural language processing do for peer review? Preprint available at https://doi.org/10.48550/arXiv.2405.06563 (2024).

  • Leung, T.I., Taiane de Azevedo, C., Mavragani, A. & Eysenbach, G. Best practices for using AI tools as an author, reviewer, or editor. J.Med.Internet resolution twenty fivee51584 (2023).

    Article Google Scholar

  • Checco, A., Bracciale, L., Loreti, P., Pinfield, S., Bianchi, G. AI-assisted peer review. Humanit. Social science. common. 825 (2021).

    Article Google Scholar

  • Kousha, K. & Thelwall, M. Artificial intelligence supporting publishing and peer review: An overview and review. learn. Publications. 374–12 (2024).

    Article Google Scholar

  • Goldberg, A. et al. The usefulness of LLM as an author checklist assistant for scientific papers: A NeurIPS’24 experiment. Preprint available at https://doi.org/10.48550/arXiv.2411.03417 (2024).

  • Su, X., Wambsganss, T., Rietche, R., Neshaei, SP & Käser, T. Reviewriter: AI-generated instructions for peer-review writing. in 18th Workshop on Innovative Use of NLP for Building Procedures Educational Applications (ed. E. Kochmar) 57–71 (ACL, 2023).

  • D’Arcy, M., Hope, T., Birnbaum, L. & Downey, D. MARG: Multi-agent review generation for scientific articles. Preprint available at https://doi.org/10.48550/arXiv.2401.04259 (2024).

  • GPT-4 Technical Report (OpenAI, 2024).

  • Goldberg, A. et al. Peer review of peer review: Randomized controlled trials and other experiments. PLoS ONE 20e0320444 (2025).

    Article Google Scholar

  • Kocak, B., Onur, MR, Park, SH, Baltzer, P. & Dietzel, M. Ensuring peer review integrity in the era of large language models: A critical inventory of challenges, red flags, and recommendations. EUR. J. Radiol. Artif. intelligence. 2100018 (2025).

    Article Google Scholar

  • Yes, R. et al. Have you arrived yet? Clarifying the risks of using large-scale language models in academic peer review. Preprint available at https://doi.org/10.48550/arXiv.2412.01708 (2024).

  • Shin, H. et al. Beware of blind spots: A focus-level assessment framework for LLM reviews. in Conference on Empirical Methods in Procedural Natural Language Processing 35630–35656 (EMNLP, 2025).

  • Luo, M. et al. Benchmarking peer-review harm detection: A challenging task using a new dataset. Preprint available at https://doi.org/10.48550/arXiv.2502.01676 (2025).

  • Tamkin, A. et al. Clio: Privacy-preserving insights into real-world AI use. Preprint available at https://doi.org/10.48550/arXiv.2412.13678 (2024).

  • Saad-Falcon, J. et al. LMUnit: Fine-grained evaluation with natural language unit testing. in Computational Linguistics Association survey results 3303–3324 (ACL, 2025).

  • Prasad, A., Stengel-Eskin, E., Chen, JC-Y., Khan, Z., Bansal, M. Learn to generate unit tests for automated debugging. Preprint available at https://doi.org/10.48550/arXiv.2502.01619 (2025).

  • Charlin, L., Zemel, RS & Boutilier, C. A framework for optimizing article matching. in Procedures 27th Conference on Uncertainty in Artificial Intelligence 1186–95 (AUAI Press, 2011).

  • ICML 2023 Examiner Tutorial (ICML 2023 Program Committee, 2023).

  • How to be a good reviewer? ICML 2022 Reviewer Tutorial (ICML 2022 Program Chair, 2022).

  • Last minute review advice (ACL PC Chair, 2017).

  • Baldenegro, M. LXCV @ CVPR 2021 Reviewer Mentoring Program: and How to Write a Good Review. Presentation at LatinX in Computer Vision (LXCV) Workshop, CVPR 2021 (2021).

  • Rogers, A. ARR reviewer guidelines (Computational Linguistics Association, 2021).

  • Silbiger, New Jersey and Billboard Stabler Unprofessional peer review unfairly harms underrepresented groups within STEM. Peer J 7e8247 (2019).

    Article Google Scholar

  • Feniak, M. et al. PyPDF library. https://pypi.org/project/pypdf/ (2024).

  • Ribeiro, MT & Lundberg, S. Test language models (and prompts) like you would test software (Medium, 2023).

  • Thakkar, N. zou-group/review_feedback_agent: First release. Zenodo https://doi.org/10.5281/zenodo.17903957 (2025).



  • Source link