My previous post was prompted by a new paper by Andrew Gelman and Eric Loken (GL) but it did not discuss its the main thrust because I had planned to defer that discussion to the present post. However, several comments on the previous post (by Chris Chambers and Andrew Gelman himself) leapt ahead of the game and so there already is an entire discussion in the comment section of the previous post about the topic of our story here. But I’m putting the pedal to the metal to come out in front again.
Simply put, GL’s basic claim is that researchers often unknowingly create false positives. Or, in their words: it is possible to have multiple potential comparisons, in the sense of a data analysis whose details are highly contingent on data, without the researcher performing any conscious procedure of fishing or examining multiple p-values.
|My copy of the Dutch Translation|
Here is one way in which this might work. Suppose we have a hypothesis that two groups differ from each other and we have two dependent measures. What constitutes evidence for our hypothesis? If the hypothesis is not more specific than that, we could be tempted to interpret a main effect as evidence for the hypothesis. If we find an interaction with the two groups differing on only one of the two measures, then we would also count that as evidence. So now we actually had three bites at the apple but we’re working under the assumption that we only had one. And this is all because our hypothesis was rather unspecific.
GL characterize the problem succinctly: There is a one-to-many mapping from scientific to statistical hypotheses. I would venture to guess that this form of inadvertent p-hacking is extremely common in psychology, perhaps especially in applied areas, where the research is less theory-driven than in basic research. The researchers may not be deliberately p-hacking, but they’re increasing the incidence of false positives nonetheless.
In his comment on the previous post, Chris Chambers argues that this constitutes a form of HARKING (Hypothesizing After the Results are Known). This is true. However, this is a very subtle form of HARKING. The researcher isn’t thinking well, I really didn’t predict this but Daryl Bem has told (pp. 2-3) me that I need to go on a fishing expedition about the data, so I’ll make it look like I’d predicted this pattern all along. The researcher is simply working with a hypothesis that is consistent with several potential patterns in the data.
GL noted that articles that they had previously characterized as the product of fishing expeditions might actually have a more innocuous explanation, namely inadvertent p-hacking. In the comments on my previous post, Chris Chambers took issue with this conclusion. He argued that GL looked at the study, and the field in general, through rose-tinted glasses.
The point of my previous post was that we often cannot reverse-engineer from the published results the processes that generated them on the basis of a single study. We cannot know for sure whether the authors of the studies initially accused by Gelman of having gone on a fishing expedition really cast out their nets or whether they arrived at their results in the innocuous way GL describe in their paper, although GL now assume it was the latter. Chris Chambers may be right when he says this picture is on the rosy side. My point, however, is that we cannot know given the information provided to us. There often simply aren’t enough constraints to make inferences about the procedures that have led to the results of a single study.
However, I take something different from the GL paper. Even though we cannot know for sure whether a particular set of published results was the product of deliberate or inadvertent p-hacking, it seems extremely likely that, overall, many researchers fall prey to inadvertent p-hacking. This is a source of false positives that we as researchers, reviewers, editors, and post-publication reviewers need to guard against. Even if researchers are on their best behavior, they still might produce false positives. GL provide suggestions to remedy the problem, namely pre-registration but point out that this may not always be an option in applied research. It is, however, in experimental research.
GL have very aptly named their article after a story by the Argentinean writer Jorge Luis Borges (who happens to be one of my favorite authors): The Garden of Forking Paths. As is characteristic of Borges, the story contains the description of another story. The embedded story describes a world where an event does not lead to a single outcome; rather, all of its possible outcomes materialize at the same time. And then the events multiply at an alarming rate as each new event spawns a plethora of other ones.
I found myself in a kind of garden of forking paths when my previous post produced both responses to that post and responses I had anticipated after this post. I’m not sure it will be as easy for the field to escape from the garden as it was for me here, but we should definitely try.