My previous post was about the why of replication studies. This one is about my first foray into the replication business. That is, my first venture outside the file drawer (where several nonreplications of other people’s work reside, as well as nonreplications of studies of my own that were never submitted because we were unable to replicate the initial finding). I’m coming out of the file drawer, so to speak.
I’m not going to discuss the contents of the study here. I’m just going to talk about a couple of things my co-author, Diane Pecher, and I learned from our replication efforts.
I’ve got the power!
Psychology experiments are chronically underpowered. Simmons, Nelson, and Simonsohn suggest you need at least 20 subjects per condition, which is more than many psychology experiments have. At a recent symposium, a statistician even said that to be informative, experiments should have at least 100 subjects; otherwise they are merely exploratory (I’m paraphrasing). I have heard people scoff at these suggestions (they may not be feasible for studies using special populations and not necessary for psychophysics experiments) but whatever the right number is, it is true that Ns are too small in the vast majority of psychology experiments, including my own.
Keeping false positives at Bayes
With a large N, the likelihood of false positives is high in classic Null-hypothesis significance tests. An inconsequential difference might show up as significant. An alternative is to compute the Bayes factor, which is a likelihood ratio that allows you to assess the strength of the alternative hypothesis versus the Null hypothesis or the other way around. To be conclusive, the Bayes factor requires more evidence for the alternative hypothesis with larger samples than does for example a t-test but it also allows you to determine whether a small effect is consequential. Bayes factors can easily be computed using Jeffrey Rouder’s web site at the University of Missouri. You just put in a t-value and the sample size and it will return the Bayes factor—actually three of them; we used the JZS Bayes factor.
Unlike standard hypothesis-testing statistics, Baysian statistics don’t force you to define your sampling plan ahead of time. According to a very insightful paper by Wagenmakers and colleagues—in a must-read special issue of Perspectives on Psychological Science—you can continue collecting data until the Bayes factor seems to stabilize (I must admit the article is a bit hazy on this part, or maybe I am). In our case it meant that we could compute a combined Bayes factor over two experiments that were essentially identical, which gave us even more power. This move was suggested to us by Eric-Jan Wagenmakers, an expert in Bayesian statistics (which I am most definitely not).
Two heads are better than one
Armed with our large samples and Bayes factors, we were ready to analyze the data. And here we did something that I think is highly unusual in psychological research. We each performed our own analysis of the data and then compared our results. We were humbled to see that on several occasions we didn’t get the same outcome. True, we weren’t far apart and the differences were inconsequential and easy to resolve, but it taught us a good lesson. It is important to have multiple people analyze the data—an error is easily made (my bet is that the literature is replete with them). The files that I created to analyze the data (which include the raw data) can be found here.
Taking the experimenter and the lab out of the loop
One big advantage of on-line experiments is that there is no experimenter involved, so there cannot be any experimenter effects. Whatever results you obtain, they cannot be caused by the professional demeanor, friendly attitude, white lab coat, or short skirt of your research assistant.
There is another advantage. Turkers don’t go to the lab to participate in experiments. They might be at home on the couch, in the office pretending to do their regular job, on the train, in the airport, or in a coffee shop (though preferably not in what the Dutch call a coffee shop). We ask subjects about their environment and the noise level in it, and they generally tell us that they work in quiet environments. We tend to believe them because they are highly conscientious subjects. They often provide thoughtful feedback on our experiments.
But how can lack of control be an advantage? It is an advantage in terms of reproducibility. Evidently, results like ours were not caused by the academic setting of the experiment, the color of the walls in the experiment room, the close confines of a cubicle (though some Turkers probably operate from cubicles), the red light on the door of the experiment room, and so on.
This means that replication attempts of on-line studies are relatively straightforward. For example, if anyone wanted to replicate us, they can get our data-collection programs (contact me, as I still need to post them online), create a link to them on Mechanical Turk, and with a couple hundred dollars, you’re in business. You will have your data within a day or so.
But what did you find and what does it mean?