Skip to main navigation Skip to search Skip to main content

AI-assisted teams outperform AI-led teams but not human-only teams in assessing research reproducibility in quantitative social science

  • Abel Brodeur
  • , David Valenta
  • , Alexandru Marcoci
  • , Juan P. Aparicio
  • , Derek Mikola
  • , Bruno Barbarioli
  • , Rohan Alexander
  • , Lachlan Deer
  • , Tom Stafford
  • , Lars Vilhuber
  • , Gunther Bensch
  • , Fabio Motoki
  • , Mohamed Abdelhady
  • , Yousra Abdelmoula
  • , Ghina Abdul Baki
  • , Tomás Aguirre
  • , Sriraj Aiyer
  • , Shumi Akhtar
  • , Farida Akhtar
  • , Melle R. Albada
  • Micah Altman, David Angenendt, Zahra Arjmandi Lari, Jorge Armando De León Tejada, David Rodriguez Arana, Igor Asanov, Anastasiya-Mariya Noha, Rebecca Ashong, Tobias Auer, Francisco J. Bahamonde-Birke, Bradley J. Baker, Söhnke M. Bartram, Dongqi Bao, Lucija Batinovic, Tommaso Batistoni, Monica Beeder, Louis-Philippe Beland, Carsten Gero Bienz, Christ Billy Aryanto, Cylcia Bolibaugh, Carl Bonander, Ramiro Bravo, Egor Bronnikov, Stephan Bruns, Nino Buliskeria, Sara Caicedo-Silva, Andrea Calef, Juan Sebastian Cano Arias, Gustavo A. Castillo Alvarez, Solomon Caulker, Simonas Cepenas, Arthur Chatton, Zirou Chen, Ngozi Chioma Ewurum, Anda-Bianca Ciocîrlan, Felix J. Clouth, Jason Collins, Nikolai Cook, Cesar Cornejo, João Craveiro, Jonathan Créchet, Jing Cui, Niveditha Chalil Vayalabron, Christian Czymara, Carlos Daniel Bermúdez Jaramillo, Hannes Datta, Lien Denoo, Arshia Dhaliwal, Nency Dhameja, Elodie Djemai, Erwan Dujeancourt, Uǧurcan Dündar, Thibaut Duprey, Yasmine Eissa, Youssef El Fassi, Ismail El Fassi, Keaton Ellis, Ali Elminejad, Mahmoud Elsherif, Aysil Emirmahmutoglu, Giulian Etingin-Frati, Emeka Eze, Jan Fabian Dollbaum, Jan Feld, Andres Felipe Rengifo Jaramillo, Guidon Fenig, Victoria Fernandes, Lenka Fiala, Lukas Fink, Mojtaba Firouzjaeiangalougah, Sara Fish, Jack Fitzgerald, Rachel Forshaw, Alexandre Fortier-Chouinard, Louis Fréget, Joris Frese, Jacopo Gabani, Sebastian Gallegos, Max C. Gamill, Attila Gáspár, Romain Gauriot, Evelina Gavrilova, Diogo Geraldes, Giulio Giacomo Cantone, Grant Gibson, Dirk Goldschmitt, Amélie Gourdon-Kanhukamwe, Andrea Gregor de Varda, Idaliya Grigoryeva, Alexi Gugushvili, Aaron H. A. Fletcher, Florian Habermann, Márton Hablicsek, Joanne Haddad, Jonathan D. Hall, Olle Hammar, Malek Hassouneh, Carina I. Hausladen, Sophie C. F. Hendrikse, Matthew Hepplewhite, Anson T. Y. Ho, Senan Hogan-Hennessy, Elliot Howley, Gaoyang Huang, Héloïse Hulstaert, Zlatomira G. Ilchovska, Paola Jaimes Santamaria, Niklas Jakobsson, Joakim Jansson, Ewa Jarosz, Hossein Jebeli, Yanchen Jiang, Hiba Junaid, Rohan Kalluraya, Sunny Karim, Edmund Kelly, Eva Kimel, Sorravich Kingsuwankul, Valentin Klotzbücher, Daniel Krähmer, Pijus Krūminas, Nicholas Kruus, Essi Kujansuu, Christoph F. Kurz, Stephan Küster, Blake Lee-Whiting, Felix Lewandowski, Tongzhe Li, Ruoxi Li, Dan Liu, Jiacheng Liu, Helix Lo, Katharina Loter, Felipe Macedo Dias, Christopher R. Madan, Nicolas Mäder, Marco Mandas, Cesar Mantilla, Jan Marcus, Diego Marino Fages, Xavier Martin, Ryan McWay, Daniel Medina-Gaspar, Sisi Meng, Lingyu Meng, Simon Merz, Alex P. Miller, Thibault Mirabel, Dibya Deepta Mishra, Sumit Mishra, Belay W. Moges, Morteza Mohandes Mojarrad, Myra Mohnen, Louis-Philippe Morin, Lucija Muehlenbachs, Gastón Mullin, Andreea Musulan, Sara Muzzì, James A. C. Myers, Florian Neubauer, Tuan Nguyen, Ali Niazi, Ardyn Nordstrom, Bartłomiej Nowak, Daneal O’Habib, Tim Ölkers, Justin Ong, Valeria Orozco Castiblanco, Ömer Özak, Ali I. Ozkes, Mikael Paaso, Shubham Pandey, Varvara Papazoglou, Romeo Penheiro, Linh Pham, Ulrike Phieler, Peter Pütz, Quan Qi, Jingyi Qiu, Manuel T. Rein, David A. Reinstein, Juuso Repo, Nicolas Rudolf, Shree Saha, Orkun Saka, Chiara Saponaro, Georg Sator, Martijn Schoenmakers, Raffaello Seri, Meet Shah, Paul Sibille, Christoph Siemroth, Vladimir Skavysh, Ben Slater, Wenting Song, Stefan Staubli, Tobias Steindl, Nomwendé Steven Waongo, Paul Stott, Stephenson Strobel, Roshini Sudhaharan, Pu Sun, Scott D. Swain, Oleksandr Talavera, Hanz M. Tantiangco, Georgy Tarasenko, Boyd Tarlinton, Mariam Tarraf, Ken Teoh, Rémi Thériault, Bethan Thompson, Tonghui Tian, Wenjie Tian, Emmanuel Tolani, Nicolai Borgen, Solveig Topstad Borgen, Javier Torralba, Carolina Velez-Ospina, Man Wai Mak, Lukas Wallrich, Zeyang Wang, Leah Ward, Matthew D. Webb, Duncan Webb, Bryan S. Weber, Christoph Weber, Wei-Chien Weng, Christian Westheide, Tom Wilkinson, Kwong-Yu Wong, Marcin Wroński, Zhuangchen Wu, Qixia Wu, Victor Y. Wu, Bohan Xiao, Feihong Xu, Cong Xu, Pranav Yadav, Yu Yang Chou, Luther Yap, Myra Yazbeck, Bo Yao, Zuzanna Zagrodzka, Tahreen Zahra, Mirela Zaneva, Xiaomeng Zhang, Ziwei Zhao, Han Zhong, Aras Zirgulis, Jiacheng Zou, Floris Zoutman, Christelle Zozoungbo

Research output: Contribution to journalArticleScientificpeer-review

Abstract

Large Language Models (LLMs) such as ChatGPT are transforming how scientists conduct and validate research, offering promise as tools to improve scientific reproducibility. However, computational reproducibility and error detection remain expensive and labor-intensive. We experimentally test how collaboration between researchers and LLM assistants influences the reproduction of quantitative social science findings across different levels of AI autonomy. We randomly assigned 288 researchers to 103 teams working under three conditions: human-only, AI-assisted (using ChatGPT as a collaborative tool), or AI-led (ChatGPT operating with minimal human oversight). Teams reproduced published results from leading social science journals, detected coding errors, and proposed robustness checks. Human-only and AI-assisted teams achieved comparable reproduction rates (94% vs. 91%) and performed similarly on most outcomes, except human-only teams identified significantly more major coding errors. Both substantially outperformed AI-led teams, which achieved only a 37% reproduction rate, detected fewer errors across all categories, proposed weaker robustness checks, and required more time. This autonomous approach, however, likely represents only a lower bound of AI capabilities. Despite rapid model advances, expert human judgment currently remains indispensable for reliable empirical verification. While AI assistance did not degrade most outcomes, it provided no measurable advantages and was associated with reduced detection of major errors. However, the 37% autonomous reproduction rate indicates that AI could provide value in settings where scale or cost constraints preclude human review of papers, even though general-purpose LLMs offer no immediate advantages for human-supervised verification.

Original languageEnglish
Pages (from-to)e2524747123
JournalProceedings of the National Academy of Sciences of the United States of America
Volume123
Issue number22
DOIs
Publication statusPublished - 2 Jun 2026

Keywords

  • AI
  • reproducibility
  • large language models

Fingerprint

Dive into the research topics of 'AI-assisted teams outperform AI-led teams but not human-only teams in assessing research reproducibility in quantitative social science'. Together they form a unique fingerprint.

Cite this