Comments on “Statistics and the replication crisis”

A team led by Brian Nosek

Kaj Sotala 2020-07-03

A team led by Brian Nosek repeated a hundred well-known psychology experiments, and found a statistically significant result in less than half. (“Estimating the reproducibility of psychological science,” Science, 28 August 2015, 349:6251.)

This sounds somewhat misleading; “a hundred well-known experiments” implies that they were particularly famous or important studies, but the selection criteria was “articles published in 2008 in three important psychology journals, for which we could find replication teams with relevant interests and expertise”. Most of the studies in question had been cited less than a hundred times; one had as few as 6 citations. ( https://osf.io/fgjvw/ )

In fairness to psychology, it might also be worth noting that at least one project to specifically investigate famous and well-known findings ended up replicating all of them ( https://digest.bps.org.uk/2017/06/05/these-nine-cognitive-psychology-findings-all-passed-a-stringent-test-of-their-replicability/ ).

Revised accordingly

David Chapman 2020-07-04

Thanks, footnote updated.

Over 50% of claims made on the internet are false

Sian 2020-07-31

“There are no right or wrong answers (so long as you stay within the formal domain of applicability) “…”Science can’t be reduced to any fixed method, nor evaluated by any fixed criterion.”

Colour me confused, but don’t you make lots of rather strong claims that most/many (> 50%) of research claims in (some/many/most?) sciences are false, with a right or wrong answer based on a fixed method and criterion?

“In large-scale replication efforts, the false positive rate typically comes out greater than 50%.”

Well, that is not really “true”. There are a growing number of large scale replication studies, with different headline numbers attached to them. One for example, as a previous commenter notes, show 100% success in replication. Though convenient for science-trashing polemical purposes, there is no “typically”, because it is not meaningful to say typical because the studies are quite different with varying goals. And sadly we can’t calculate what the real false positive rate is (as compared to Ioannidis’s implausible’s and fairly well discredited models), because aside from some reasonably rare exceptions (e.g. ESP) we don’t normally know whether an underlying effect is “true” or not. I think what you really mean to say is that when a study is replicated (say, done many years later with a different sample and non-exact methods) where the replication was significant and the original was, then the original finding was “false”. This view is not really tenable for a number of reasons. One is that the methodological issue arises that many studies (i.e. in some of those from replication projects) are not actually falsifiable, as the original studies are so vague and under powered that they can’t be tested and verified (see “Why most of psychology is statistically unfalsifiable” by Lakens and Morey). So this would be a case of “not even false”, rather than false. But these days it is common to refer to replications being consistent or inconsistent with a previous finding (though even that can be tricky).

Now obviously this is still a significant problem for some scientific disciplines, and doesn’t detract massively part of the argument you are making but I find the framing interesting.

So you suggest the deep cause of the replication crises in many fields is incentives, which seems about right. But in terms of solutions we have a problem, because “form” or “certainty” is a powerful incentive, no matter how comfortable you are with uncertainty. We can see this in your post here (and probably in my response too), where you make stronger conclusions than are warranted, presumably due to some incentives to put forward that view.

I am reminded of someone I get into arguments with (as much as I try to avoid them). They have some deeply held convictions on the state of the world, but when I offer my own view they play the nebulosity card, and I am told “well, we don’t know really know that for sure, who really knows”. This can be quite annoying.

"Typically"

David Chapman 2020-08-02

Thanks; I’ve changed that footnote to say “In several large-scale replication efforts, the false positive rate was found to be greater than 50%.”

Could you point me to papers discrediting Ioannidis’ methodology? I don’t know about this.

Plausibility isn't as sexy as bold

Sian 2020-08-03

I have may have over egged “discredited” (because it is still widely cited, though when it comes to contrarian Covid claims, quite applicable: https://www.wired.com/story/prophet-of-scientific-rigor-and-a-covid-contrarian/) but will stand by implausible.

Jager and Leek (2014) put their estimate at 14% (based on evidence) for biomedical research. There is from a special issue devoted to this, and I think the summary of that is >50% is implausible (and 14% optimistic). But importantly Ioannidis is a medical researcher, and the model was based on a hypothetical genomic study, which didn’t take into account the corrections for multiple comparisons which actually takes place in this research field leading to a “unrealistic straw man” (Samsa, 2013)

Critical to the Ioannidis example was the prior probability of a hypothesis being true. As noted in the comment above, this is unknown (and not subject to cross field generalisations), but I suspect that the number Ioannidis used was geared to get a > 50% result, so that he could get a sexy “most” paper title (and it sure has earned him citations).

This prior can’t apply to “most” fields as a general rule - Ioannidis uses an atheoretical example with a very low discovery rate, which likely isn’t very reflective of most scientific endeavour, such as of social psychology (Stroebe, 2016). The prior probability will vary a lot across fields. Based on an empirically derived estimate, Schimmack puts a false discovery risk for some fields of psychology between 17.6% and 8.6% (see:
https://replicationindex.com/2019/01/15/ioannidis-2005-was-wrong-most-published-research-findings-are-not-false/ ).
Of course, even if it is 20% instead of > 50%, that is still pretty bad. But it isn’t quite as good as a headline.

Absence of evidence

Erik 2024-02-17

You wrote: “But science doesn’t work, most of the time, even. The replication crisis consists of the realization that, in many sciences, most of what had been believed, based on statistical analyses, was actually false.”

That may be true, but if a replication experiment does not reach “p < 0.05” it does not imply that the original claim was false. After all, the absence of evidence is not the same as evidence of absence!