In my previous post I described our replication attempt of Experiment 1 from Vohs and Schooler (2008). They found large effects of a manipulation of belief in free will (via the reading of passages) on people’s reported belief in free will and on subsequent cheating behavior. We tried to replicate these findings using Mechanical Turk but obtained null results.
What might account for the stark differences between our findings and those of V&S? And, in the spirit of the educational roots of this project, what lessons can we learn from this attempt at replication?
One obvious difference between our findings and those of V&S is in subject populations. Our subjects had an average age of 33 (range 18-69) and were native speakers of English residing in the US (75 males and 77 females). The distribution of education levels was as follows: high school (13%), college no-degree (33%), associate’s degree (13%), bachelor (33%), and master’s/PhD (8%).
How about the subjects in the original study? V&S used… 30 undergraduates (13 females, 17 males); that’s all it says in the paper. Kathleen Vohs informed us via email that the subjects were undergraduates at the University of Utah. Specifically, they were smart, devoted adults about half of whom were active in the Mormon Church. One would think that it is not too trivial to mention in the paper. After all, free will is not unimportant to Mormons, as is shown here and here. It is quite true that Psychological Science imposes rather stringent word limits but still…
Lesson 1: Critical information should not be omitted from method sections. (This sounds like flogging a dead horse, but try to replicate a study and you’ll see how much information is often missing.)
So there clearly is a difference between our subject population and that of the original experiment. We did not ask about religious affiliation (we did not know this was important, as it was not mentioned in the original paper), but I doubt that we are going to find 30 Mormons, 15 among them active in the Mormon Church, in our sample.
What we can do, however, is match our sample in terms of age (this is also not specified in the original article, but let’s assume late teens to mid-twenties) and level of education. In an analysis of 30 subjects meeting these criteria, we found no significant effects on the manipulation check and cheating behavior.
So differences in age and level of education from the original sample do not seem to account for our null findings. We cannot be sure, however, whether membership in the Mormon Church plays a role.
Another big difference between our experiment and the original is that our experiment was conducted online and the original in the lab. It has been demonstrated that many classical findings in the psychological literature can be replicated using online experiments (e.g., here) but this doesn’t mean online experiments are suitable for any task.
An obvious issue is that an online study cannot control the environment. To get some idea about the subjects' environment we always ask them to indicate on a 9-point scale the amount of noise in their environment, with 1 being no noise and no distractions and 9 noise and many distractions. The average score was 1.6 on this scale. The majority of subjects (73%) indicated that they were in a quiet environment with no distractions. An additional 11% indicated they were in a quiet environment with some distractions. Very few people indicated being in a noisy environment with distractions. Of course, these are self-report measures but they do suggest that environmental distractions are not a factor.
Perhaps subjects did not read the manipulation-inducing passages. There is no information on this in the original study but we measured reading times. The average reading time for the passages was 380 ms/word, which is quite normal for texts of this type. There were a few subjects with unusually short reading times. Eliminating their data did not change the results. So from what we can tell, the subjects read the texts and did not click through them. There is no information about reading times in the original experiment. In fact, it would have been even better (for both the original study and the replication attempt) to also have comprehension questions about the passages at the end of the experiment.
Lesson 2: gather as much information about your manipulation-inducing stimuli as possible.
Another potential problem, which was pointed out by a commenter on the previous post, is that some subjects on Mechanical Turk, “Turkers,” may already have participated in similar experiments and thus not be naïve to the manipulation (see here for a highly informative paper on this topic).
We always ask subjects about their perceptions of the manipulation and this experiment is no exception. We coded a perception as “aware of the manipulation” if it mentioned “honesty”, “integrity”, “pressing the space bar,” “looking at the answer”, “following instructions,” or something similar. We coded someone as “unaware” if they explicitly stated that they had no idea or if the mentioned a different purpose of the experiment. Some examples are: (1) The study was about judgments and quickness, (2) Deterioration of short term memory, and (3) How quickly people can solve math problems.
According to these criteria, about half the subjects were “aware” of the manipulation. We performed a separate follow-up analysis on the “unaware” subjects. There still was no effect of the manipulation on the amount of cheating. We did find a slightly higher number of incidences of cheating among the “aware” subjects than on the “unaware” subjects. All in all, though, the level of cheating was much lower than in the original study.
So does awareness of the manipulation explain our null findings? I don’t think so. Some commenters on the previous post decried our study for having so many “aware” subjects. They should realize that we don't even know if all 30 subjects in the original study believed the cover story; there is no information on this in the article.
Lesson 3: always ask subjects about their perceptions of the purpose of the experiment.
I find it hard to believe that the subjects in the original experiment all bought the cover story. Unlike in our experiment, the original study has no information on how many people disbelieved the cover story. Some commenters have suggested that it is easier to convince people of the cover story if you have an actual experimenter. This seems plausible although it still doesn't seem likely to me that everyone would have believed the story. And of course it would be an awful case of circular reasoning to say that the subjects must have believed the manipulation simply because there was a large effect.
But there is a bigger point. If the large effect reported in the original study hinges on the acting skills of the experimenter, then there should be information on this in the paper. The article merely states that the subjects were told of the glitch. We incorporated what the students were told in our instruction. But if it is not the contents of what they were told that is responsible for the effect but rather the manner in which it is told, then there should be information on this. Did the experimenter act distraught, confused, embarrassed, or neutrally? And was this performance believable and delivered with any consistency? If the effect hinges on the acting skills of an experimenter, experimentation becomes an art and not a science. In addition to voodoo statistics, we would have voodoo experimentation. (A reader of this post pointed me to this highly relevant article on the ubiquity of voodoo practices in psychological research.)
It should be obvious but I’d like to state it explicitly anyway, I’m not saying that V&S performed voodoo experimentation. I am just saying that if the claim is that the effect relies on factors that are not (or cannot be) articulated and documented—and I’ve heard people (not V&S) make this claim—then we have voodoo experimentation.
Lesson 4: Beware of Voodoo Experimentation
It is striking that we were not even able to replicate the manipulation check that V&S used. I was told by another researcher (who is also performing a replication of the V&S experiment) that the reliability of the original manipulation check is low (we had not thought to examine this, but we did use the updated version of this scale, the FAD-plus). I do not want to steal this researcher’s thunder, and so will not say anything more about this issue at this point (I will provide an update as soon as the evidence from the that researcher's experiment is available). But the fact that we did not replicate the large effect on the manipulation check that was reported in the original study might not count as a strike against our replication attempt.
So where does this leave us? <p><span style="display:none">claimtoken-515be493dc514</span></p> The fact that the large (!) effect of the original study completely evaporated in our experiment cannot be due to (1) the age or education levels of the subjects, (2) subjects not reading the manipulation-inducing passages (if reading times are any indicator), and (3) subjects’ awareness of the manipulation. The original paper provides no evidence regarding these issues.
The evaporation of the effect could, however, be due to (1), the special nature of the sample of the original sample (2) the undocumented acting skills of a real-life experimenter (voodoo experimentation), or of course (3) the large effect being a false positive. I am leaning towards the third option, although I would not find a small effect implausible (in fact, that is what I was initially expecting to find).