On the replication crisis in psychology

8/28/2019

Brian Leiter

An interesting read; from the conclusion:

The replication crisis, if nothing else, has shown that productivity is not intrinsically valuable. Much of what psychology has produced has been shown, empirically, to be a waste of time, effort, and money. As Gibson put it: our gains are puny, our science ill-founded. As a subject, it is hard to see what it has to lose from a period of theoretical confrontation.

Thoughts from readers?

Uncategorized

17 responses to “On the replication crisis in psychology”

Mohan Matthen

August 28, 2019 at 8:09 am

This is a very broad attack, seemingly motivated by broad scepticism regarding ANY attempt to understand the mind empirically. It's worth noting the area of psychology in which replication has proved to be a problem: studies of unconscious influences on personal attitudes such as belief, evaluation, motivation, and social behaviour. Thus, the Open Science Collaboration (Aarts et al) write that their sample of 100 problematic psychology articles were "coded as representing cognitive (n = 43 studies) or social-personality (n = 57 studies)." Is there a corresponding problem in e.g. perceptual psychology? For example, in the experiments that use statistical methods (and excluding brain-probing methods such as single neuron or MRI studies) on visual attention, perceptual illusions, cross-modal effects etc? I think not. Psychology isn't in crisis; social psychology isn't the whole of the discipline.

Loading…

Reply
Brad Cokelet

August 28, 2019 at 9:16 am

The incentives/pressures that presumably led to the problems are not isolated to social psych (e.g. the speed of publishing and the need to secure grants) so I think we need more details to avoid default global worries (about psychology and other fields too).

I am not expert, but if anyone is interested in the presumptive case for global worries, I recommend starting with Brian Nosek's publications (https://scholar.google.com/citations?user=ztt_j28AAAAJ&hl=en&oi=ao).

For example, in a highly cited 2013 paper he co-authored is titled "Power failure: why small sample size undermines the reliability of neuroscience" and it starts: "It has been claimed and demonstrated that many (and possibly most) of the conclusions drawn from biomedical research are probably false. A central cause for this important problem is that researchers must publish in order to succeed, and publishing is a highly competitive enterprise, with certain kinds of findings more likely to be published than others…" https://www.nature.com/articles/nrn3475#ref1

I think at this point the burden is on optimists (such as Matthen above) to show that the problem is isolated to social psych, e.g. that there is not "a corresponding problem in e.g. perceptual psychology." It will be great to learn that some sub-areas are safe, but it might be a mixed bag where-ever one looks (even if some areas are less rife with false result claims than others).

I would happy to be shown that their are no problems in a specific area(s), but I think we need to have pointer to details.

Loading…

Reply
Ned Block

August 28, 2019 at 10:32 am

I agree with what Mohan Matthen says. Putting aside dishonesty which can happen in any field, most areas of perceptual psychology do not have a problem with power because running many trials is relatively cheap and easy.

Loading…

Reply
Mohan Matthen

August 28, 2019 at 10:33 am

I'm not an optimist (or a pessimist either). I am just pointing out that the replication crisis, as exposed by such authors as Aarts et al, isn't as wide as sometimes advertised. (And of course there have been widely publicized instances of fraud in some of the problematic areas, which has increased public perception of unreliability.) There is no justification for saying that in cognitive psychology in general, "our gains are puny; our science ill-founded."
I don't particularly want to go into substantive arguments about methodology in psych of perception, and I certainly agree that there are pressures to publish (just as there are in medicine and in physics, and even in moral philosophy, for that matter). But I do want to point out that lower sample sizes are tolerated in some fields than others . . . perhaps because the phenomena themselves are assumed to be more uniform across subjects. So for example with metacontrast or colour opponency studies, the sample is sometimes just the authors themselves plus a few grad students. Look at the classic studies by Hurwich and Jameson. These studies have never been challenged, and are considered well-established.

Loading…

Reply
Brad Cokelet

August 28, 2019 at 11:06 am

Thanks. I took (perhaps inaptly) your first comment to imply that the problems are constrained to social psych. I think that’s a mistake, but I defer to your knowledge on the science of perception. And I agree that there is no reason to think our gains are puny, our science is ill-founded.

Loading…

Reply
Ned Block

August 28, 2019 at 11:09 am

A comment on Mohan Matthen: " But I do want to point out that lower sample sizes are tolerated in some fields than others . . . perhaps because the phenomena themselves are assumed to be more uniform across subjects. So for example with metacontrast or colour opponency studies, the sample is sometimes just the authors themselves plus a few grad students. Look at the classic studies by Hurwich and Jameson. These studies have never been challenged, and are considered well-established."
Many of the best established phenomena in perception used small numbers of subjects because power was achieved by running many trials with those subjects. For example, the landmark mental rotation article of 1971 by Shepard and Metzler used only 8 subjects but each of them had to make judgments about 1600 pairs of figures. When this article came out, everyone knew it was right (I remember!!) and of course it has been replicated many times in many different ways.

Loading…

Reply
Peter Carruthers

August 28, 2019 at 11:13 am

There is a nice paper by Van Bavel and colleagues in PNAS showing that the extent of psychological-study reproducibility is well-predicted by a priori (but blind to the fact of reproduction or not) estimates of the extent to which the phenomenon in question is likely to be influenced by contextual factors (or have "hidden moderators"). Even an attempted replication that keeps the same methods may unwittingly change stuff that makes a difference. It doesn't necessarily mean that the original study was flawed or that the phenomenon isn't real. See: Context sensitivity in scientific reproducibility, Jay J. Van Bavel, Peter Mende-Siedlecki, William J. Brady, Diego A. Reinero, Proceedings of the National Academy of Sciences Jun 2016, 113 (23) 6454-6459; DOI: 10.1073/pnas.1521897113

Loading…

Reply
GradStudent

August 28, 2019 at 2:00 pm

Peter Carruthers, concerning changing stuff that makes a difference, do you think that means that the core issue is one of poor experimental design (designs permitting too many potential contextual variables)? Is it a lack of care in controlling relevant variables? Or genuine ignorance about what variables might be relevant?

Also, pushing against the study, why aren't the contextual factors they studied too broad to be meaningful? Why doesn't that study just repeat the claim "these studies do not replicate" but with irrelevant noise attached to it, so that they just reflect the underlying lack of replicability? They seem too broad to be interesting, don't they? Wouldn't it be better to try to think of more specific factors that might influence the results?

Loading…

Reply
Evan Westra

August 28, 2019 at 3:03 pm

Mohan is right that the replication crisis doesn't extend to all areas of psychology. Personality psychology – in particular, the relations between the Big Five and life outcomes – also has fairly high replication rates (see Soto, C. J. (2019). How replicable are links between personality traits and consequential life outcomes? The Life Outcomes of Personality Replication Project. Psychological Science, 30(5), 711-727. https://journals.sagepub.com/doi/abs/10.1177/0956797619831612).

David Funder (UC Riverside) has a nice post on his blog where he gives a few reasons for why this might be the case: https://funderstorms.wordpress.com/2016/05/12/why-doesnt-personality-psychology-have-a-replication-crisis/

Loading…

Reply
Mohan Matthen

August 28, 2019 at 5:08 pm

Hurwich and Jameson used exactly two "observers," one entitled H and the other J! (And, of course, their four-part paper was considered revolutionary, and though their theory has been challenged, their results haven't been significantly challenged.)

Loading…

Reply
David Duffy

August 29, 2019 at 12:20 am

If you look at the Many Labs 2 replication (Klein et al 2018; DOI: 10.1177/2515245918810225), they repeated 28 studies, including a few xphi classics, with a view to looking at contextual factors. Around half replicated, which is about what I would expect (usual sample size for initial findings in area, prior plausibility):

"For 12 of the 28 effects, moderators or sample characteristics that may be necessary to observe the effect were identified…[e]vidence consistent with the hypothesized moderation was obtained for just 1 of the 12 effects…and weak or partial evidence was obtained for 2".

However, the I2 statistic, a measure of between-lab "non-random" differences for the same experiment, was above 50% ("moderate heterogeneity") for two-thirds of the studies. To study such differences, which are undoubtedly cultural background of participants for variables such as "Moral foundations of liberals versus conservatives" (overall standardized effect 0.3, labs varying between -0.1 to +1, I2=64%), one needs even larger sample sizes.

Loading…

Reply
Keith Douglas

August 29, 2019 at 9:02 am

I'm an outsider, being no longer working in philosophy and only peripherally involved in these things even back as a philosophy student. However, I would point out that "replication" does come as a matter of degree: and there the "not enough rationalism and too much empiricism" (so to speak) does in a way play a role. This is because some finding may not be *quite* what someone thinks it is (hence, taken pessmistically) false – but it might, so to speak, have a grain of truth in it. For example, imagine what would have happen if someone had said that they had failed to replicate Galileo's findings on falling bodies because they showed it wasn't true (which it isn't, because the finding seems to show that what we now "g" is constant) because they had (e.g., dropped from slightly further heights somehow). This is one way to understand Peter Carruthers' remarks. Only "linking theories" (as far as I can tell) can help to explore what might count as variables to control, after all. (How one does this I really do not know in general, but it does seem to be the way to approach the matter.)

As it happens I encounter this "too much empiricism" amongst some of my programmer colleagues (I work in software security) who often do not have a good mental model of how their software system is built and hence how it can "go wrong" – so the problem outline in the referenced article is IMO not limited to psychology, but seems to apply more generally. In software are now realizing how hard (memory, for example) it is to have a good model of modern systems; perhaps this too applies.

The confound of "get it done fast" does seem to apply also in both disciplines, alas, so there's also that.

Loading…

Reply
Avram Hiller

August 30, 2019 at 8:47 am

A concern perhaps similar to that expressed by GradStudent:

The point of psychology is presumably to uncover generalizable truths about the human mind. If the results of studies that fail to replicate are so highly contingent on minor differences in circumstances, why should we have any confidence that they tell us anything interesting about how people's minds work in our highly complex everyday situations?

It would have behooved van Bavel et al. to address this concern, but they do not.

Loading…

Reply
Anon

August 30, 2019 at 11:34 am

Why isn't behavior being "highly contingent on minor differences in circumstances" something "interesting about how people's minds work in our highly complex everyday situations"?

Loading…

Reply
Mohan Matthen

August 30, 2019 at 11:36 am

I am not sure if this addresses your concern, and maybe I have failed to grasp the nub of the issue, but it seems to me that there are (at least) two sorts of cases of replication failure.

In the systematic-error case, the facts are inherently variable and any positive result is just a fluke. In the less pernicious, or more redeemable, kind of case, replication failure is the result of bad methodology and could be avoided given better experimental design.

As I understand van Bavel et al, they studies of a certain type are more likely to fall into the first kind of case. In other words, they are saying that given optimal methodology, many cases of replication failure must be traced to variability of "time, culture, or location" and cannot be remedied. So, they are not saying that you can never place any faith in generalizations about the human mind. Rather, they are saying that only some (not all) facts about the human mind are highly variable.

Does such unreliability in one area infect every other area? Not if we have good criteria of demarcation. I was suggesting (comments 1 and 4 above) that perceptual psychology was not fatally infected because these phenomena are more uniform across history and across culture.

Loading…

Reply
Avram Hiller

August 30, 2019 at 5:13 pm

Thanks for the replies, Anon and Mohan Matthen.

Anon: I agree with you, but that the human mind is “highly contingent on minor differences in circumstances” was not the conclusion of all those studies that failed to replicate. That’s why the replication crisis is so bad.

Indeed one of von Bavel et al.’s conclusions was that responses to some studies are highly context-sensitive, and they want get psychologists to focus more on how contextual factors can influence results, and to that degree I applaud their work.

But my sense is that van Bavel et al. missed a larger point: if a phenomenon is observed only in the confines of specific test circumstances (which are often highly artificial – e.g., they often involve “WEIRD” university student filling out forms or doing proscribed activities in an academic building), then the value of the first-order results of the studies is much lower than had been previously regarded. Instead of being able to conclude, e.g., that people walk slower when they think about their grandparents, all that we can conclude is that a certain subclass of people walk more slowly in a certain experimental context when thinking about their grandparents. So the replication crisis is still a crisis, because it still undermines at the very least a lot of the generality of the first-order results of the studies that failed to replicate.

Mohan Matthen – from what I can see, I think you are precisely right that some areas have had much more success in replication and are less context-sensitive. My main concern is that there is a third kind of case of replication failure in addition to the two you list. It is that the result that failed to replicate is not a fluke – it is a robust result, but only exists in a limited context. Experimental design by the replication effort that was identical to the original study would have replicated the same result. But if a replication effort that used a quite similar but slightly different methodology got a different result, that is very bad news for the generalizability and value of the result of the study. As above, there is value in learning that certain phenomena are context-sensitive, but that is far from what was taken to be the value of the results of the original studies.

Loading…

Reply
jus22

August 30, 2019 at 5:48 pm

All Social Sciences are in bad shape for anyone with a background in the "hard sciences". We need lots more whistleblowers and people who are willing to call a spade a spade:

—–
KEYWORDS:
Primary Blog

Loading…

Reply

My former colleagues at another university in Middle East have also been moved to online teaching indefinitely, with the students…

If much of the interest of high-quality papers lies between the lines—in the metaphorical fire that a paper lights in…

I would also recommend that potential grad students make inquiries into how far the compensation package actually goes towards cost…

It’s a mix. I’m still in the UAE with my family, and we feel safe. But some students and faculty…

In the above comment, Michel wrote: “As an aside, every once in a while I check out how the chatbots…

I could imagine LLMs having saved me a *ton* of time in graduate school–e.g., by having supplied reasonable answers to…

The McMaster Department of Philosophy has now put together the following notice commemorating Barry: Barry Allen: A Philosophical Life Barry…

Leiter Reports: A Philosophy Blog