Many Labs 2: Investigating variation in replicability across samples and settings

Many Labs 2

Research output: Contribution to journalArticleScientificpeer-review

5 Downloads (Pure)

Abstract

We conducted preregistered replications of 28 classic and contemporary published findings, with protocols that were peer reviewed in advance, to examine variation in effect magnitudes across samples and settings. Each protocol was administered to approximately half of 125 samples that comprised 15,305 participants from 36 countries and territories. Using the conventional criterion of statistical significance (p < .05), we found that 15 (54%) of the replications provided evidence of a statistically significant effect in the same direction as the original finding. With a strict significance criterion (p < .0001), 14 (50%) of the replications still provided such evidence, a reflection of the extremely high-powered design. Seven (25%) of the replications yielded effect sizes larger than the original ones, and 21 (75%) yielded effect sizes smaller than the original ones. The median comparable Cohen’s ds were 0.60 for the original findings and 0.15 for the replications. The effect sizes were small (< 0.20) in 16 of the replications (57%), and 9 effects (32%) were in the direction opposite the direction of the original effect. Across settings, the Q statistic indicated significant heterogeneity in 11 (39%) of the replication effects, and most of those were among the findings with the largest overall effect sizes; only 1 effect that was near zero in the aggregate showed significant heterogeneity according to this measure. Only 1 effect had a tau value greater than .20, an indication of moderate heterogeneity. Eight others had tau values near or slightly above .10, an indication of slight heterogeneity. Moderation tests indicated that very little heterogeneity was attributable to the order in which the tasks were performed or whether the tasks were administered in lab versus online. Exploratory comparisons revealed little heterogeneity between Western, educated, industrialized, rich, and democratic (WEIRD) cultures and less WEIRD cultures (i.e., cultures with relatively high and low WEIRDness scores, respectively). Cumulatively, variability in the observed effect sizes was attributable more to the effect being studied than to the sample or setting in which it was studied.
Original languageEnglish
Pages (from-to)443-490
JournalAdvances in Methods and Practices in Psychological Science
Volume1
Issue number4
DOIs
Publication statusPublished - 2018

Cite this

@article{81f9a2883e264fd2a4ffc77ade701bba,
title = "Many Labs 2: Investigating variation in replicability across samples and settings",
abstract = "We conducted preregistered replications of 28 classic and contemporary published findings, with protocols that were peer reviewed in advance, to examine variation in effect magnitudes across samples and settings. Each protocol was administered to approximately half of 125 samples that comprised 15,305 participants from 36 countries and territories. Using the conventional criterion of statistical significance (p < .05), we found that 15 (54{\%}) of the replications provided evidence of a statistically significant effect in the same direction as the original finding. With a strict significance criterion (p < .0001), 14 (50{\%}) of the replications still provided such evidence, a reflection of the extremely high-powered design. Seven (25{\%}) of the replications yielded effect sizes larger than the original ones, and 21 (75{\%}) yielded effect sizes smaller than the original ones. The median comparable Cohen’s ds were 0.60 for the original findings and 0.15 for the replications. The effect sizes were small (< 0.20) in 16 of the replications (57{\%}), and 9 effects (32{\%}) were in the direction opposite the direction of the original effect. Across settings, the Q statistic indicated significant heterogeneity in 11 (39{\%}) of the replication effects, and most of those were among the findings with the largest overall effect sizes; only 1 effect that was near zero in the aggregate showed significant heterogeneity according to this measure. Only 1 effect had a tau value greater than .20, an indication of moderate heterogeneity. Eight others had tau values near or slightly above .10, an indication of slight heterogeneity. Moderation tests indicated that very little heterogeneity was attributable to the order in which the tasks were performed or whether the tasks were administered in lab versus online. Exploratory comparisons revealed little heterogeneity between Western, educated, industrialized, rich, and democratic (WEIRD) cultures and less WEIRD cultures (i.e., cultures with relatively high and low WEIRDness scores, respectively). Cumulatively, variability in the observed effect sizes was attributable more to the effect being studied than to the sample or setting in which it was studied.",
author = "{Many Labs 2} and Klein, {Richard A.} and Michelangelo Vianello and Fred Hasselman and Adams, {Byron G.} and Adams, {Reginald B.} and Sinan Alper and Mark Aveyard and Axt, {Jordan R.} and Babalola, {Mayowa T.} and Štěp{\'a}n Bahn{\'i}k and Rishtee Batra and Mih{\'a}ly Berkics and Bernstein, {Michael J.} and Berry, {Daniel R.} and Olga Bialobrzeska and Binan, {Evans Dami} and Konrad Bocian and Brandt, {Mark J.} and Robert Busching and R{\'e}dei, {Anna Cabak} and Huajian Cai and Fanny Cambier and Katarzyna Cantarero and Carmichael, {Cheryl L.} and Francisco Ceric and Jesse Chandler and Jen-ho Chang and Armand Chatard and Chen, {Eva E.} and Winnee Cheong and Cicero, {David C.} and Sharon Coen and Coleman, {Jennifer A.} and Brian Collisson and Conway, {Morgan A.} and Corker, {Katherine S.} and Curran, {Paul G.} and Fiery Cushman and Dagona, {Zubairu K.} and Ilker Dalgar and {Dalla Rosa}, Anna and Davis, {William E.} and {De Bruijn}, Maaike and {De Schutter}, Leander and {De Vries}, Marieke and Hans Ijzerman and Yoel Inbar and Esther Maassen and Pollmann, {Monique M. H.} and {Van Aert}, {Robbie C. M.}",
note = "All data and materials have been made publicly available via the Open Science Framework and can be accessed at https://osf.io/8cd4r/. The design and analysis plans were preregistered at the Open Science Framework and can be accessed at https://osf.io/c97pd/. The complete Open Practices Disclosure for this article can be found at http://journals.sagepub.com/doi/suppl/10.1177/2515245918810225. This article has received badges for Open Data, Open Materials, and Preregistration. More information about the Open Practices badges can be found at http://www.psychologicalscience.org/publications/badges.",
year = "2018",
doi = "10.1177/2515245918810225",
language = "English",
volume = "1",
pages = "443--490",
journal = "Advances in Methods and Practices in Psychological Science",
issn = "2515-2459",
number = "4",

}

Many Labs 2 : Investigating variation in replicability across samples and settings. / Many Labs 2.

In: Advances in Methods and Practices in Psychological Science, Vol. 1, No. 4, 2018, p. 443-490.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - Many Labs 2

T2 - Investigating variation in replicability across samples and settings

AU - Many Labs 2

AU - Klein, Richard A.

AU - Vianello, Michelangelo

AU - Hasselman, Fred

AU - Adams, Byron G.

AU - Adams, Reginald B.

AU - Alper, Sinan

AU - Aveyard, Mark

AU - Axt, Jordan R.

AU - Babalola, Mayowa T.

AU - Bahník, Štěpán

AU - Batra, Rishtee

AU - Berkics, Mihály

AU - Bernstein, Michael J.

AU - Berry, Daniel R.

AU - Bialobrzeska, Olga

AU - Binan, Evans Dami

AU - Bocian, Konrad

AU - Brandt, Mark J.

AU - Busching, Robert

AU - Rédei, Anna Cabak

AU - Cai, Huajian

AU - Cambier, Fanny

AU - Cantarero, Katarzyna

AU - Carmichael, Cheryl L.

AU - Ceric, Francisco

AU - Chandler, Jesse

AU - Chang, Jen-ho

AU - Chatard, Armand

AU - Chen, Eva E.

AU - Cheong, Winnee

AU - Cicero, David C.

AU - Coen, Sharon

AU - Coleman, Jennifer A.

AU - Collisson, Brian

AU - Conway, Morgan A.

AU - Corker, Katherine S.

AU - Curran, Paul G.

AU - Cushman, Fiery

AU - Dagona, Zubairu K.

AU - Dalgar, Ilker

AU - Dalla Rosa, Anna

AU - Davis, William E.

AU - De Bruijn, Maaike

AU - De Schutter, Leander

AU - De Vries, Marieke

AU - Ijzerman, Hans

AU - Inbar, Yoel

AU - Maassen, Esther

AU - Pollmann, Monique M. H.

AU - Van Aert, Robbie C. M.

N1 - All data and materials have been made publicly available via the Open Science Framework and can be accessed at https://osf.io/8cd4r/. The design and analysis plans were preregistered at the Open Science Framework and can be accessed at https://osf.io/c97pd/. The complete Open Practices Disclosure for this article can be found at http://journals.sagepub.com/doi/suppl/10.1177/2515245918810225. This article has received badges for Open Data, Open Materials, and Preregistration. More information about the Open Practices badges can be found at http://www.psychologicalscience.org/publications/badges.

PY - 2018

Y1 - 2018

N2 - We conducted preregistered replications of 28 classic and contemporary published findings, with protocols that were peer reviewed in advance, to examine variation in effect magnitudes across samples and settings. Each protocol was administered to approximately half of 125 samples that comprised 15,305 participants from 36 countries and territories. Using the conventional criterion of statistical significance (p < .05), we found that 15 (54%) of the replications provided evidence of a statistically significant effect in the same direction as the original finding. With a strict significance criterion (p < .0001), 14 (50%) of the replications still provided such evidence, a reflection of the extremely high-powered design. Seven (25%) of the replications yielded effect sizes larger than the original ones, and 21 (75%) yielded effect sizes smaller than the original ones. The median comparable Cohen’s ds were 0.60 for the original findings and 0.15 for the replications. The effect sizes were small (< 0.20) in 16 of the replications (57%), and 9 effects (32%) were in the direction opposite the direction of the original effect. Across settings, the Q statistic indicated significant heterogeneity in 11 (39%) of the replication effects, and most of those were among the findings with the largest overall effect sizes; only 1 effect that was near zero in the aggregate showed significant heterogeneity according to this measure. Only 1 effect had a tau value greater than .20, an indication of moderate heterogeneity. Eight others had tau values near or slightly above .10, an indication of slight heterogeneity. Moderation tests indicated that very little heterogeneity was attributable to the order in which the tasks were performed or whether the tasks were administered in lab versus online. Exploratory comparisons revealed little heterogeneity between Western, educated, industrialized, rich, and democratic (WEIRD) cultures and less WEIRD cultures (i.e., cultures with relatively high and low WEIRDness scores, respectively). Cumulatively, variability in the observed effect sizes was attributable more to the effect being studied than to the sample or setting in which it was studied.

AB - We conducted preregistered replications of 28 classic and contemporary published findings, with protocols that were peer reviewed in advance, to examine variation in effect magnitudes across samples and settings. Each protocol was administered to approximately half of 125 samples that comprised 15,305 participants from 36 countries and territories. Using the conventional criterion of statistical significance (p < .05), we found that 15 (54%) of the replications provided evidence of a statistically significant effect in the same direction as the original finding. With a strict significance criterion (p < .0001), 14 (50%) of the replications still provided such evidence, a reflection of the extremely high-powered design. Seven (25%) of the replications yielded effect sizes larger than the original ones, and 21 (75%) yielded effect sizes smaller than the original ones. The median comparable Cohen’s ds were 0.60 for the original findings and 0.15 for the replications. The effect sizes were small (< 0.20) in 16 of the replications (57%), and 9 effects (32%) were in the direction opposite the direction of the original effect. Across settings, the Q statistic indicated significant heterogeneity in 11 (39%) of the replication effects, and most of those were among the findings with the largest overall effect sizes; only 1 effect that was near zero in the aggregate showed significant heterogeneity according to this measure. Only 1 effect had a tau value greater than .20, an indication of moderate heterogeneity. Eight others had tau values near or slightly above .10, an indication of slight heterogeneity. Moderation tests indicated that very little heterogeneity was attributable to the order in which the tasks were performed or whether the tasks were administered in lab versus online. Exploratory comparisons revealed little heterogeneity between Western, educated, industrialized, rich, and democratic (WEIRD) cultures and less WEIRD cultures (i.e., cultures with relatively high and low WEIRDness scores, respectively). Cumulatively, variability in the observed effect sizes was attributable more to the effect being studied than to the sample or setting in which it was studied.

U2 - 10.1177/2515245918810225

DO - 10.1177/2515245918810225

M3 - Article

VL - 1

SP - 443

EP - 490

JO - Advances in Methods and Practices in Psychological Science

JF - Advances in Methods and Practices in Psychological Science

SN - 2515-2459

IS - 4

ER -