Wednesday, January 28, 2015

The Dripping Stone Fallacy: Confirmation Bias in the Roman Empire and Beyond

What to do when the crops are failing because of a drought? Why, we persuade the Gods to send rain of course! I'll let the fourth Roman Emperor, Claudius, explain:

Derek Jacobi stuttering away as 
Claudius in the TV series I Claudius
There is a black stone called the Dripping Stone, captured originally from the Etruscans and stored in a temple of Mars outside the city. We go in solemn procession and fetch it within the walls, where we pour water on it, singing incantations and sacrificing. Rain always follows--unless there has been a slight mistake in the ritual, as is frequently the case.*
It sounds an awful lot as if Claudius is weighing in on the replication debate, coming down squarely on the side of replication critics, researchers who raise the specter of hidden moderators as soon as a non-replication materializes. Obviously, when a replication attempt omits a component that is integral to the original study (and was explicitly mentioned in the original paper), omission of that component borders on scientific malpractice. But hidden moderators are only invoked after the fact--they are "hidden" after all and so could by definition not have been omitted. Hidden moderators are slight mistakes or imperfections in the ritual that are only detected when the ritual does not produce the desired outcome. As Claudius would have us believe, if the ritual is performed correctly, then rain always follows. Similarly, if there are no hidden moderators, then the effect will always occur, so if the effect does not occur, there must have been a hidden moderator.**

And of course nobody bothers to look for small errors in the ritual when it is raining cats and dogs, or for hidden moderators when p<.05

I call this the Dripping Stone Fallacy.

Reviewers (and readers) of scientific manuscripts fall prey to a mild(er) version of the Dripping Stone Fallacy. They scrutinize the method and results sections of a paper if they disagree with its conclusions and tend to give these same sections a more cursory treatment if they agree with the conclusions. Someone surely must have investigated this already. If not, it would be rather straightforward to design an experiment and test the hypothesis. One could measure the amount of time spent reading the method section and memory for it in subjects who are known to agree or disagree with the conclusions of an empirical study.

Even the greatest minds fall prey to the Dripping Stone Fallacy. As Raymond Nickerson describes: Louis Pasteur refused to accept or publish results of his experiments that seemed to tell against his position that life did not generate spontaneously, being sufficiently convinced of his hypothesis to consider any experiment that produced counterindicative evidence to be necessarily flawed.

Confirmation bias comes in many guises and the Dripping Stone Fallacy is one of them. It makes a frequent appearance in the replication debate. Granted, the Dripping Stone Fallacy didn't prevent the Romans from conquering half the world but it is likely to be more debilitating to the replication debate.


* Robert Graves, Claudius the God, Penguin Books, 2006, p. 172.
** This is and informal fallacy; it is formally correct (modus tollens) but is based on a false premise.

Sunday, January 18, 2015

When Replicating Stapel is not an Exercise in Futility

Over 50 of Diederik Stapel’s papers have been retracted because of fraud. This means that his “findings,” have now ceased to exist in the literature. But what does this mean for his hypotheses?*

Does the fact that Stapel has committed fraud count as evidence against his hypotheses? Our first inclination is perhaps to think yes. In theory, it is possible that Stapel ran a number of studies, never obtained the predicted results, and then decided to take matters into his own hands and tweak a few numbers here and there. If there were evidence of a suppressed string of null results, then yes, this would certainly count as evidence against the hypothesis; it would probably be a waste of time and effort to try to “replicate” the “finding.” Because the finding is not a real finding, the replication is not a real replication. However, by all accounts (including Stapel’s own), once he got going, Stapel didn’t bother to run the actual experiment. He just made up all the data.

This means that Stapel’s fraud has no bearing on his hypotheses. We simply have no empirical data that we can use to evaluate his hypotheses. It it still possible that a hypothesis of his is supported in a proper experiment. Whether or not it makes sense to test that hypothesis is purely a matter of theoretical plausibility. And how do we evaluate replication attempts that were performed before the fraud had come to light? At the time the findings were probably seen as genuine--they were published, after all.

Prior to the exposure of Stapel’s fraudulent activities, Dutch social psychologist Hans IJzerman and some of his colleagues had embarked on a cross-cultural project, involving Brazilian subjects, that built on one of Stapel's findings. They then found out that another researcher in the Netherlands, Nina Regenberg, had already tried—and failed—to replicate these same findings in 9 direct and conceptual replications. As IJzerman and colleagues wryly observe:

At the time, these disconfirmatory findings were seen as ‘failed studies’ that were not worthy of publication. In hindsight, it seems painfully clear that discarding null effects in this manner has hindered scientific progress.

Ironically, the field that made it possible for Stapel to publish his made-up findings also made it impossible to publish failed replications of his work that involved actual findings. 

But the times they are a-changin’. IJzerman and Regenberg joined forces and together with their colleagues Justin Saddlemyer and Sander Koole they have written a paper, currently in press in Acta Psychologica, that reports 12 replications of a—now retracted—series of experiments published by Diederik Stapel and Gün Semin. Semin, of course, was unaware of Stapel's deception.**

Here is the hypothesis that was advanced by Stapel and Semin: priming with abstract linguistic categories (adjectives) should lead to a more abstract perceptual focus, whereas priming with concrete linguistic categories (action verbs) should lead to a more concrete perceptual focus. This linguistic category priming hypothesis is based on the uncontroversial observation that specific linguistic terms are recurrently paired with specific situations. As a result, Stapel and Semin hypothesized, linguistic terms may form associative links with cognitive processes. Because these associative links are stored in memory, they may be activated or “primed” whenever people encounter the relevant linguistic terms.

Stapel and Semin further hypothesized that verbs are associated with actions at a more concrete level than nouns. A verb like hit is used in a context like Harry is hitting Peter whereas an adjective like agressive is used in a more abstract description of the situation, as in Harry is being aggressive toward Peter. Because abstract information is more general, it may be associated with global perceptions, whereas concrete information may become with local perceptions. So far so good; I bet that many psychologists can follow this reasoning. Due to these associations, Stapel and Semin reason, priming verbs may elicit a focus on local details (i.e., the trees), while priming adjectives may elicit a focus on the global whole (i.e., the forest). This is a bit of a leap for me but let’s follow along.

Stapel and Semin reported four experiments in which they found evidence supporting their hypothesis. Priming with verbs led to more concrete processing than priming with adjectives. But of course these experiments were actually never performed and the findings were fabrications.

Let's look at some real data. Here is IJzerman et al.'s forest plot of the standardized mean difference scores between verb and adjective primes on global vs. local focus in twelve replications of the Stapel and Semin study.

Of the 12 studies, only one showed a significant effect (and it was not in the predicted direction). Overall, the standardized mean difference between the condition was practically zero. No shred of support for the linguistic category priming hypothesis, in other words.

Are these findings the death blow (to use the authors’ term) to the notion of linguistic category priming? IJzerman and his colleagues don’t think so. In perhaps a surprise twist, they conclude:

[I]t remains to be seen whether the effect we have investigated does not exist, or whether it depends on identifying the right contexts and measurements for the linguistic category priming effects among Western samples.

My own conclusions are the following.

  1. Replications of findings proven to be fraudulent are important. Without replications, the status of the hypotheses remains unclear. After all, the findings were previously deemed publishable by peer reviewers, presumably based in part on theoretical considerations. Without relevant empirical data, the area of research will remain tainted and researchers will steer clear from it. While this may not be bad in some cases, it might be bad in others.
  2. The Pottery Barn rule should hold in scientific publishing: you break it, you buy it. If you published fraudulent findings, you should also publish their nonreplications. Many journals do not adhere to this rule. Sander Koole informed me that the Journal of Social and Personality Psychology (JPSP) congratulated IJzerman and colleagues on their replication attempts but rejected their manuscript nonetheless, even though they had previously published the Stapel and Semin paper. It is a good thing the editors at Acta Psychologica have taken a more progressive stance on publishing failed replications.***
  3. It is a good sign that the climate for the publications failed replications is improving somewhat. Dylan’s right, the times are a-changin'. I am glad that the authors persevered and that their work is seeing the light of day.

*     I thank Hans IJzerman and Sander Koole for feedback on a previous version of this post. 
**   Semin was the doctoral advisor of both IJzerman and Regenberg and was initially involved in the replication attempts but let his former students use the data.
*** Until January 2014 I was Editor-in-Chief at Acta Psychologica. I was not involved in the handing of the IJzerman et al. paper and am therefore not patting myself on the back.