Thursday, May 7, 2015

p=.20, what now? Adventures of the Good Ship DataPoint

You’ve dutifully conducted a power analysis, defined your sample size, and conducted your experiment. Alas, p=.20. What now? Let’s find out.

The Good Ship DataPoint*
Perspectives on Psychological Science’s first registered replication project, RRR1, was targeted at verbal overshadowing, the phenomenon that describing a visual stimulus, in this case a human face, is detrimental to later recognition of this face compared to not describing the stimulus. A meta-analysis of  31 direct replications of the original finding provided evidence of verbal overshadowing. Subjects who described the suspect were 16% less likely to make a correct identification than subjects who performed a filler task.

One of my students wanted to extend (or conceptually replicate) the verbal overshadowing effect for her master’s thesis by using different stimuli and a different distractor task. I’m not going to talk about the contents of the research here. I simply want to address the question that’s posed in the title of this post: p=.20, what now? Because p=.20 is what we found after having run 148 subjects, obtaining a verbal overshadowing effect of 9% rather than RRR1's 16%.** 

Option 1. The effect is not significant, so this conceptual replication “did not work,” let’s file drawer the sucker. This response is probably still very common but it contributes to publication bias.

Option 2. We consider this a pilot study and now perform a power analysis based on it and run a new (and much larger) batch of subjects. The old data are now meaningless for hypothesis testing. This is better than option 1 but is rather wasteful. Why throw away a perfectly good data set?

Option 3. Our method wasn’t sensitive enough. Let’s improve it and then run a new study. Probably a very common response. But it may be premature and is not guaranteed to lead to a more decisive result. And you’re still throwing away the old data (see option 1).

Liverpool FC, victorious in the 2005 Champions League final 
in Istanbul after overcoming a 3-0 deficit against AC Milan
Option 4. The effect is not significant, but if we also report the Bayes factor, we can at least say something meaningful about the Null hypothesis and maybe get it published. This seems to become more common nowadays. It is not a bad idea as such, but it is likely to get misinterpreted as: H0 is true (even by the researchers themselves). The Bayes factor tells us something about the support for a hypothesis relative to some other hypothesis given the data such as they are. And what the data are here is: too few. We found BF10= .21, which translates to about 5 times more evidence for H0 than for H1, but this is about as meaningful as the score in a soccer match after 30 minutes of play. Sure, H0 is ahead but H1 might well score a come-from-behind victory. There are after all 60 more minutes to play! 

Option 5.  The effect is not significant but we’ll keep on testing until it is. Simmons et al. have provided a memorable illustration of how problematic optional stopping is. In his blog, Ryne Sherman describes a Monte Carlo simulation of p-hacking, showing that it can inflate the false positive rate from 5% to 20%. Still, the intuition that it would be useful to test more subjects is a good one. And that leads us to…

Option 6. The result is ambiguous, so let’s continue testing—in a way that does not inflate the Type I error rate—until we have decisive information or we've run out of resources. Researchers have proposed several ways of sequential testing that does preserve the normal error rate. Eric-Jan Wagenmakers and colleagues show how repeated testing can be performed in a Bayesian framework and DaniĆ«l Lakens has described sequential testing as it is performed in the medical sciences. My main focus will be on a little-known method proposed in psychology by Frick (1998), which to date has been cited only 17 times in Google Scholar. I will report Bayes factors as well. The method described by Lakens could not be used in this case because it requires one to specify the number of looks a priori. 

Frick’s method is called COAST (composite open adaptive sequential test). The idea is appealingly simple: if your effect is >.01 and <.36, keep on testing until the p-value crosses one of these limits.*** Frick’s simulations show that this procedure keeps the overall alpha level under .05. Given that after the first test our p was between the lower and upper limits, our Good Ship DataPoint was in deep waters. Therefore, we continued testing. We decided to add subjects in batches of 60 (barring exclusions) so as to not overshoot and yet make our additions substantive. If DataPoint failed to reach shore before we'd reached 500 subjects, we would abandon ship. 

Voyage of the Good Ship Data Point on the Rectangular Sea of Probability

Batch 2: Ntotal=202, p=.047. People who use optional testing would stop here and declare victory: p<.05! (Of course, they wouldn’t mention that they’d already peeked.) We’re using COAST, however, and although the good ship DataPoint is in the shallows of the Rectangular Sea of Probability, it has not reached the coast. And BF10=0.6, still leaning toward H0.

Batch 3: Ntotal = 258, p=.013, BF10=1.95. We’re getting encouraging reports from the crow’s nest. The DataPoint crew will likely not succumb to scurvy after all! And the BF10 now favors H1.

Batch 4: Ntotal =306, p=.058, BF10=.40. What’s this??? The wind has taken a treacherous turn and we’ve drifted  away from shore. Rations are getting low--mutiny looms. And if that wasn’t bad enough, BF is  <1 again. Discouraged but not defeated, DataPoint sails on.

Batch 5: Ntotal =359, p=.016, BF10=1.10. Heading back in the right direction again.

Batch 6: Ntotal =421, p=.015, BF=1.17. Barely closer. Will we reach shore before we all die? We have to ration the food.

Batch 7: Ntotal =479, p=.003, BF10=4.11. Made it! Just before supplies ran out and the captain would have been keelhauled. The taverns will be busy tonight.

Some lessons from this nautical exercise:

(1) More data=better.

(2) We now have successfully extended the verbal overshadowing effect, although we found a smaller effect, 9% after 148 subjects and 10% at the end of the experiment.

(3) Although COAST gave us an exit strategy, BF10=4.11 is encouraging but not very strong. And who knows if it will hold up? Up to this point it has been quite volatile.

(4) Our use of COAST worked because we were using Mechanical Turk. Adding batches of 60 subjects would be impractical in the lab.

(5) Using COAST is simple and straightforward. It preserves an overall alpha level of .05. I prefer to use it in conjunction with Bayes factors.

(6) It is puzzling that methodological solutions to a lot of our problems are right there in the psychological literature but that so few people are aware of them.

Coda

In this post, I have focused on the application of COAST and largely ignored, for didactical purposes, that this study was a conceptual replication. More about this in the next post.



Footnotes

Acknowledgements: I thank Samantha Bouwmeester, Peter Verkoeijen, and Anita Eerland for helpful comments on an earlier version of this post. They don't necessarily agree with me on all of the points raised in the post.
*Starring in the role of DataPoint is the Batavia, a replica of a 17th century Dutch East Indies ship, well worth a visit.
** The original study, Schooler and Engstler-Schooler (1990), has a sample of 37 subjects and the RRR1 studies typically had 50-80 subjects. We used chi-square tests to compute p-values. Unlike the replication studies, we did not collapse the conditions in which subjects made a false identification and in which they claimed the suspect was not in the lineup because we thought these were two different kinds of responses. I computed Bayes factors using the BayesFactor package in R. I used the contingencyTableBF function with sampleType = "indepMulti", fixedMargin = "rows", priorConcentration= 1. In this analysis, we separated false alarms from misses, unlike in the replication experiments. This precluded us, however, from using one-sided tests.
*** For this to work, you need to decide a priori to use COAST. This means, for example, that when your p-value is >.01 and <.05 after the first batch, you need to continue testing rather than conclude that you've obtained a significant effect.