Estimating the reproducibility of psychological science

Open Science Collaboration, R.M. Rahal

Research output: Contribution to journalArticleScientificpeer-review

26 Downloads (Pure)

Abstract

Introduction
Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. Scientific claims should not gain credence because of the status or authority of their originator but by the replicability of their supporting evidence. Even research of exemplary quality may have irreproducible empirical findings because of random or systematic error.
Rationale
There is concern about the rate and predictors of reproducibility, but limited evidence. Potentially problematic practices include selective reporting, selective analysis, and insufficient specification of the conditions necessary or sufficient to obtain the results. Direct replication is the attempt to recreate the conditions believed sufficient for obtaining a previously observed finding and is the means of establishing reproducibility of a finding with new data. We conducted a large-scale, collaborative effort to obtain an initial estimate of the reproducibility of psychological science.
Results
We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. There is no single standard for evaluating replication success. Here, we evaluated reproducibility using significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes. The mean effect size (r) of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline. Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Conclusion
No single indicator sufficiently describes replication success, and the five indicators examined here are not the only ways to evaluate reproducibility. Nonetheless, collectively these results offer a clear conclusion: A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes. Moreover, correlational evidence is consistent with the conclusion that variation in the strength of initial evidence (such as original P value) was more predictive of replication success than variation in the characteristics of the teams conducting the research (such as experience and expertise). The latter factors certainly can influence replication success, but they did not appear to do so here.Reproducibility is not well understood because the incentives for individual scientists prioritize novelty over replication. Innovation is the engine of discovery and is vital for a productive, effective scientific enterprise. However, innovative ideas become old news fast. Journal reviewers and editors may dismiss a new test of a published idea as unoriginal. The claim that “we already know this” belies the uncertainty of scientific evidence. Innovation points out paths that are possible; replication points out paths that are likely; progress relies on both. Replication can increase certainty when findings are reproduced and promote innovation when they are not. This project provides accumulating evidence for many findings in psychological research and suggests that there is still more work to do to verify whether we know what we think we know.
Original languageEnglish
Article numberaac4716
JournalScience
Volume349
Issue number6251
DOIs
Publication statusPublished - 2015

Cite this

Open Science Collaboration ; Rahal, R.M. / Estimating the reproducibility of psychological science. In: Science. 2015 ; Vol. 349, No. 6251.
@article{4b1e6c373ee2414bb79066024d631d4a,
title = "Estimating the reproducibility of psychological science",
abstract = "IntroductionReproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. Scientific claims should not gain credence because of the status or authority of their originator but by the replicability of their supporting evidence. Even research of exemplary quality may have irreproducible empirical findings because of random or systematic error.RationaleThere is concern about the rate and predictors of reproducibility, but limited evidence. Potentially problematic practices include selective reporting, selective analysis, and insufficient specification of the conditions necessary or sufficient to obtain the results. Direct replication is the attempt to recreate the conditions believed sufficient for obtaining a previously observed finding and is the means of establishing reproducibility of a finding with new data. We conducted a large-scale, collaborative effort to obtain an initial estimate of the reproducibility of psychological science.ResultsWe conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. There is no single standard for evaluating replication success. Here, we evaluated reproducibility using significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes. The mean effect size (r) of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline. Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results; 47{\%} of original effect sizes were in the 95{\%} confidence interval of the replication effect size; 39{\%} of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68{\%} with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.ConclusionNo single indicator sufficiently describes replication success, and the five indicators examined here are not the only ways to evaluate reproducibility. Nonetheless, collectively these results offer a clear conclusion: A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes. Moreover, correlational evidence is consistent with the conclusion that variation in the strength of initial evidence (such as original P value) was more predictive of replication success than variation in the characteristics of the teams conducting the research (such as experience and expertise). The latter factors certainly can influence replication success, but they did not appear to do so here.Reproducibility is not well understood because the incentives for individual scientists prioritize novelty over replication. Innovation is the engine of discovery and is vital for a productive, effective scientific enterprise. However, innovative ideas become old news fast. Journal reviewers and editors may dismiss a new test of a published idea as unoriginal. The claim that “we already know this” belies the uncertainty of scientific evidence. Innovation points out paths that are possible; replication points out paths that are likely; progress relies on both. Replication can increase certainty when findings are reproduced and promote innovation when they are not. This project provides accumulating evidence for many findings in psychological research and suggests that there is still more work to do to verify whether we know what we think we know.",
author = "{Open Science Collaboration} and Aarts, {Alexander A.} and Anderson, {Joanna E.} and Anderson, {Christopher J.} and Attridge, {Peter R.} and Angela Attwood and Jordan Axt and Molly Babel and Stepan Bahnik and Erica Baranski and Michael Barnett-Cowan and Elizabeth Bartmess and Jennifer Beer and Raoul Bell and Heather Bentley and Leah Beyan and Grace Binion and Denny Borsboom and Annick Bosch and Bosco, {Frank A.} and Bowman, {Sara D.} and Brandt, {Mark J.} and Erin Braswell and Hilmar Brohmer and Brown, {Benjamin T.} and Kristina Brown and Jovita Bruening and Ann Calhoun-Sauls and Callahan, {Shannon P.} and Elizabeth Chagnon and Jesse Chandler and Chartier, {Christopher R.} and Felix Cheung and Christopherson, {Cody D.} and Linda Cillessen and Russ Clay and Hayley Cleary and Cloud, {Mark D.} and Michael Cohn and Johanna Cohoon and Simon Columbus and Andreas Cordes and Giulio Costantini and Chris Hartgerink and Job Krijnen and Nuijten, {Michele B.} and {van 't Veer}, {Anna E.} and {Van Aert}, Robbie and {van Assen}, M.A.L.M. and Joeri Wissink and Marcel Zeelenberg and R.M. Rahal",
year = "2015",
doi = "10.1126/science.aac4716",
language = "English",
volume = "349",
journal = "Science",
issn = "0036-8075",
publisher = "American Association for the Advancement of Science",
number = "6251",

}

Estimating the reproducibility of psychological science. / Open Science Collaboration ; Rahal, R.M.

In: Science, Vol. 349, No. 6251, aac4716, 2015.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - Estimating the reproducibility of psychological science

AU - Open Science Collaboration

AU - Aarts, Alexander A.

AU - Anderson, Joanna E.

AU - Anderson, Christopher J.

AU - Attridge, Peter R.

AU - Attwood, Angela

AU - Axt, Jordan

AU - Babel, Molly

AU - Bahnik, Stepan

AU - Baranski, Erica

AU - Barnett-Cowan, Michael

AU - Bartmess, Elizabeth

AU - Beer, Jennifer

AU - Bell, Raoul

AU - Bentley, Heather

AU - Beyan, Leah

AU - Binion, Grace

AU - Borsboom, Denny

AU - Bosch, Annick

AU - Bosco, Frank A.

AU - Bowman, Sara D.

AU - Brandt, Mark J.

AU - Braswell, Erin

AU - Brohmer, Hilmar

AU - Brown, Benjamin T.

AU - Brown, Kristina

AU - Bruening, Jovita

AU - Calhoun-Sauls, Ann

AU - Callahan, Shannon P.

AU - Chagnon, Elizabeth

AU - Chandler, Jesse

AU - Chartier, Christopher R.

AU - Cheung, Felix

AU - Christopherson, Cody D.

AU - Cillessen, Linda

AU - Clay, Russ

AU - Cleary, Hayley

AU - Cloud, Mark D.

AU - Cohn, Michael

AU - Cohoon, Johanna

AU - Columbus, Simon

AU - Cordes, Andreas

AU - Costantini, Giulio

AU - Hartgerink, Chris

AU - Krijnen, Job

AU - Nuijten, Michele B.

AU - van 't Veer, Anna E.

AU - Van Aert, Robbie

AU - van Assen, M.A.L.M.

AU - Wissink, Joeri

AU - Zeelenberg, Marcel

AU - Rahal, R.M.

PY - 2015

Y1 - 2015

N2 - IntroductionReproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. Scientific claims should not gain credence because of the status or authority of their originator but by the replicability of their supporting evidence. Even research of exemplary quality may have irreproducible empirical findings because of random or systematic error.RationaleThere is concern about the rate and predictors of reproducibility, but limited evidence. Potentially problematic practices include selective reporting, selective analysis, and insufficient specification of the conditions necessary or sufficient to obtain the results. Direct replication is the attempt to recreate the conditions believed sufficient for obtaining a previously observed finding and is the means of establishing reproducibility of a finding with new data. We conducted a large-scale, collaborative effort to obtain an initial estimate of the reproducibility of psychological science.ResultsWe conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. There is no single standard for evaluating replication success. Here, we evaluated reproducibility using significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes. The mean effect size (r) of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline. Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.ConclusionNo single indicator sufficiently describes replication success, and the five indicators examined here are not the only ways to evaluate reproducibility. Nonetheless, collectively these results offer a clear conclusion: A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes. Moreover, correlational evidence is consistent with the conclusion that variation in the strength of initial evidence (such as original P value) was more predictive of replication success than variation in the characteristics of the teams conducting the research (such as experience and expertise). The latter factors certainly can influence replication success, but they did not appear to do so here.Reproducibility is not well understood because the incentives for individual scientists prioritize novelty over replication. Innovation is the engine of discovery and is vital for a productive, effective scientific enterprise. However, innovative ideas become old news fast. Journal reviewers and editors may dismiss a new test of a published idea as unoriginal. The claim that “we already know this” belies the uncertainty of scientific evidence. Innovation points out paths that are possible; replication points out paths that are likely; progress relies on both. Replication can increase certainty when findings are reproduced and promote innovation when they are not. This project provides accumulating evidence for many findings in psychological research and suggests that there is still more work to do to verify whether we know what we think we know.

AB - IntroductionReproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. Scientific claims should not gain credence because of the status or authority of their originator but by the replicability of their supporting evidence. Even research of exemplary quality may have irreproducible empirical findings because of random or systematic error.RationaleThere is concern about the rate and predictors of reproducibility, but limited evidence. Potentially problematic practices include selective reporting, selective analysis, and insufficient specification of the conditions necessary or sufficient to obtain the results. Direct replication is the attempt to recreate the conditions believed sufficient for obtaining a previously observed finding and is the means of establishing reproducibility of a finding with new data. We conducted a large-scale, collaborative effort to obtain an initial estimate of the reproducibility of psychological science.ResultsWe conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. There is no single standard for evaluating replication success. Here, we evaluated reproducibility using significance and P values, effect sizes, subjective assessments of replication teams, and meta-analysis of effect sizes. The mean effect size (r) of the replication effects (Mr = 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr = 0.403, SD = 0.188), representing a substantial decline. Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.ConclusionNo single indicator sufficiently describes replication success, and the five indicators examined here are not the only ways to evaluate reproducibility. Nonetheless, collectively these results offer a clear conclusion: A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes. Moreover, correlational evidence is consistent with the conclusion that variation in the strength of initial evidence (such as original P value) was more predictive of replication success than variation in the characteristics of the teams conducting the research (such as experience and expertise). The latter factors certainly can influence replication success, but they did not appear to do so here.Reproducibility is not well understood because the incentives for individual scientists prioritize novelty over replication. Innovation is the engine of discovery and is vital for a productive, effective scientific enterprise. However, innovative ideas become old news fast. Journal reviewers and editors may dismiss a new test of a published idea as unoriginal. The claim that “we already know this” belies the uncertainty of scientific evidence. Innovation points out paths that are possible; replication points out paths that are likely; progress relies on both. Replication can increase certainty when findings are reproduced and promote innovation when they are not. This project provides accumulating evidence for many findings in psychological research and suggests that there is still more work to do to verify whether we know what we think we know.

U2 - 10.1126/science.aac4716

DO - 10.1126/science.aac4716

M3 - Article

VL - 349

JO - Science

JF - Science

SN - 0036-8075

IS - 6251

M1 - aac4716

ER -