Leiter Reports: A Philosophy Blog

News and views about philosophy, the academic profession, academic freedom, intellectual culture, and other topics. The world’s most popular philosophy blog, since 2003.

  1. Fool's avatar
  2. Santa Monica's avatar
  3. Charles Bakker's avatar
  4. Matty Silverstein's avatar
  5. Jason's avatar
  6. Nathan Meyvis's avatar
  7. Stefan Sciaraffa's avatar

    The McMaster Department of Philosophy has now put together the following notice commemorating Barry: Barry Allen: A Philosophical Life Barry…

Only about a third of published psychological findings (even less in social psychology) are reliable

NY Times reports, including a link to the Science article.  Haven't had time to read the article carefully; any readers care to comment on which results of interest to philosophers failed the reliability tests?

Leave a Reply to RobH Cancel reply

Your email address will not be published. Required fields are marked *

14 responses to “Only about a third of published psychological findings (even less in social psychology) are reliable”

  1. This isn't a comment about which results failed the reliability test, just some general thoughts on the relevant issues.

    I've seen a figure quoted that 10% of hypotheses are true (not that they come out "true" when tested, just as a science-independent fact about the class of hypotheses scientists are willing to test). On the assumption that all correlations published are p = .05, it follows that at minimum 1/3 of published studies are actually wrong. (Consider 20 studies. 10% right is 2 on average, but the correct hypotheses might not be found; the p-values say that there's a 1/20 chance of a false positive; so you find less than 2 true positives and around 1 false positive.) If your studies have less power (ability to detect true positives) and you're not adequately controlling for biases, the likelihood of published "true" results can go down even further.

    What's more, it wouldn't be surprising if in psychology or social psychology, the "base rate" of true hypotheses is even lower. Things are complicated, and there's lots and lots of variables. So I would think the true takeaway would be not that such-and-so researcher got it wrong, or is bad, but that we should really cast a much more skeptical eye toward such research. Not that it shouldn't be done! It is genuine evidence. It's just that a philosopher, for instance, would be utterly irresponsible for taking a published "true" correlation (with p = .05) to be anything more than something like a 50-50 coinflip in favor of the hypothesis. There's nothing wrong with the research, it just needs to be seen differently. (NB this is highly oversimplified, and as the NYTimes article states, effect size matters, and large effects were most often replicated; plus, lower p-values are better, etc. etc.)

  2. One study of interest to the free will crowd that failed to replicate is Vohs and Schooler's 2008 study "The Value of Believing in Free Will." That's the one where participants read a passage by Francis Crick that endorsed a form of determinism and were then more likely cheat on subsequent tests than those who didn't read that passage.

  3. Vohs and Schooler (2008)finding re:belief in determinism and cheating did not replicate. That is of interest to some moral philosophers.

    Perusing the list of studies trying to be replicated I came across one author (Förster) who is under suspicion of fraudulent behavior on 8 papers and has already had one retracted.

    It would be interesting to know how many of the studies under investigation are by authors in similar situations. The sample is already tiny (100 studies from 98 articles…all from 2008 it appears) so making generalizations might be….hasty.

  4. One non-replicated result of possible philosophical import: inducing people to think that free will might be an illusion cause them to cheat more.

  5. One factor not mentioned in the NYTimes: everyone who follows experimental work knows of cases of important results that people try to replicate and fail, even with lots of help from the original experimenters who got the effect. Then…some labs succeed in replicating the phenomenon. Some important phenomena are fragile and it is hard to make them manifest.

  6. I would love to know what the rates of non-replicability are in the physical sciences, especially in chemistry and physics.

  7. Frankly I think that, from a philosophical point of view, the most interesting aspect of these difficulties in replication is the existence of the difficulties themselves.

    How do we account for them? How much is publication bias, how much is statistical fluke, how much is "fragile" phenomena, how much is bad methodology, how much is ideological corruption, how much is out-of-control ambition, how much is outright fraud? How seriously should we take a given result, or a set of related results, in the face of these difficulties? If philosophy of science is to be put to important use, this might be a good place to start.

  8. From the Reproducibility Report: "The most important point is that a failure to replicate does not directly indicate that the original effect is false. It may also not replicate because of insufficient power, design problems, or known and unknown limiting conditions. As such, the Reproducibility Project is investigating factors such as replication power, the evaluation of the study design by the original authors, and the original study’s sample and effect sizes as predictors of reproducibility. Identifying the contribution of these factors to reproducibility is useful because each has distinct implications for interventions to improve reproducibility."

    And the abstract from a February 2015 study of repeatability in computer systems research: "We describe a study into the extent to which Computer Systems researchers share their code and data and the extent to which such code builds. Starting with 601 papers from ACM conferences and journals, we examine 402 papers whose results were backed by code. For 32.3% of these papers we were able to obtain the code and build it within 30 minutes; for 48.3% of the papers we managed to build the code, but it may have required extra effort; for 54.0% of the papers either we managed to build the code or the authors stated the code would build with reasonable effort. We also propose a novel sharing specification scheme that requires researchers to specify the level of sharing that reviewers and readers can assume from a paper."

  9. I'm surprised this didn't become news when the Economist reported a similar effect (along with some of the analysis for which Lexington asks) in 2013: http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble. I suppose sometimes these things hit a nerve, though, and other times not.

  10. A certain amount is appropriate size of the replication study too. The first study may have detected a true effect with the advantage of some degree of luck. In contexts I am familiar with, repeating the same study may have only a 20% of replicating a true finding. There are a few papers on "Reproducibility Probability" going back to Goodman (1992) which just address the statistical test itself, let alone external variables.

  11. If the results of only 1/3 of such replicability studies were themselves replicable, what should we conclude?

  12. Just want to mention that failing to find an effect is not at all the same as finding there is no effect. So a single failure of replication does not of itself imply the original finding was false.

  13. Good question! As physical sciences drift to topics of present social/funding importance (IMHO a good thing), many projects ultimately are dead-ends with no further analysis. Nobody knows what percent of those papers contain nonsense just for the sake of publication after time spent. Of the papers in hot topics such as catalysis in chemistry, renewable energy in materials, etc. work tends to piggyback on recent research fairly closely, so those are likely legit. Certainly there's high-profile debacles such as high performing "platinum-free" catalysts created in platinum-containing crucibles, or the "oxo-wall" mishap, but those are high-profile because of their rarity. This doesn't answer the question, but gives honest weight towards socially important topics.

  14. On a related note, this recent New Yorker story on "methods videos" points to a hurdle for replication in the "hard" sciences: http://www.newyorker.com/tech/elements/how-methods-videos-are-making-science-smarter The problem that methods videos attempt to solve–the difficulty in providing precise, clear instructions to be used by future researchers interested in reproducing a study–seems distinct from the one identified in the Reproducibility Report, namely, that the incentives are skewed against embarking on a reproduction study at all. I'll throw another element into the mix, this LGM post about a paper by Kieran Healy taking to task social sciences colleagues who demand more complexity (what he refers to as "nuance") in their theories (or at least in their colleagues' theories): http://www.lawyersgunsmoneyblog.com/2015/08/fuck-nuance Healy may be right, he may be wrong. I'm inclined to agree with one of the comments at LGM suggesting that Healy's paper is a roundabout way of saying that the perfect is the enemy of the good. But it is kind of funny that the NYT story ends with a cell biologist who laments "this pressure to publish cleaner, simpler results that don’t tell the entire story, in all its complexity."

    —–
    KEYWORDS:
    Primary Blog

Designed with WordPress