Monday, August 7, 2017

Publishing an Unsuccessful Self-replication: Double-dipping or Correcting the Record?

Collabra: Psychology  has a submission option called streamlined review. Authors can submit papers that were previously rejected by another journal for reasons other than a lack of scientific, methodological, or ethical rigor. Authors request permission from the original journal and then submit their revised manuscript with the original action letters and reviews. Editors like me then make a decision about the revised manuscript. This decision can be based on the ported reviews or we can solicit further reviews.

One recent streamlined submission had previously been rejected by an APA journal. It is a failed self-replication. In the original experiment, the authors had found that a certain form of semantic priming, forward priming, can be eliminated by working-memory load, which suggests that forward semantic priming is not automatic. This is informative because it contradicts theories of automatic semantic priming. When they tried to follow up on this work for a new paper, however, the researchers were unable to obtain this elimination effect in two experiments. Rather than relegating the study to the file drawer, they decided to submit it to the journal that had also published their first paper on the topic. Their submission was rejected. It is now out in Collabra: Psychology. The reviews can be found here.

[Side note: I recently conducted a little poll on Twitter asking whether or not journals should publish self-nonreplications. A staggering 97% of the respondents said journals should indeed publish self-nonreplications. However, if anything, this is evidence of the Twitter bubble I’m in. Reality is more recalcitrant.]

I thought the other journal’s reviews were thoughtful. Nevertheless, I reached a different conclusion than the original editor. A big criticism in the reviews was the concern about “double-dipping.” If an author publishes a paper with a significant finding, it is unfair to let that same author then publish a paper that reports a nonsignificant finding, as this gives the researcher two bites at the apple.

I understand the point. What drives this perception of unfairness is our current incentive system.
People are (still) rewarded for the number of articles they publish, so letting someone first publish a finding and then a nonreplication of this finding is unfair. It is as if in football (the real football, where you use your feet to propel the ball) you get a point for scoring a goal and then an additional point for missing a shot from the same position.

However understandable, this idea loses its persuasive power once we take the scientific record into account. As scientists, we want to understand the world and lay a foundation for further research. It is therefore important to have good estimates of effect sizes and the confidence we should have in them. A nonreplication serves to correct the scientific record. It tells us that the effect is less robust than we initially thought. This is useful information for meta-analysts, who can now include both findings in their collection. And even more importantly, it is very useful for researchers who want to build on this research. They now know that the finding is less reliable than they previously thought. It might prevent them from wandering into a potential blind alley.

As with anything in science, allowing the publication of self-nonrreplications opens the door to gaming the system. People could p-hack their way to a significant finding, publish it and then fail to “replicate” the finding in a second paper. As an added bonus, the self-nonreplication will also give them the aura of earnest, self-critical, and ethical researchers. Moreover, the self-nonreplication pretty much inoculates the finding from “outside” replication efforts. Why try to replicate something that even the authors themselves could not replicate?

That’s not two, not three, but four birds with one stone! You might think that I’m making up the inoculation motive for dramatic effect. I’m not. A researcher I know actually suspects another researcher of using the inoculation strategy.

How worried should we be about the misuse of self-nonreplications? I’m not sure. One potential safeguard is to have the authors explain why they performed the replication. Did they think there was something wrong with the original finding or were they just trying to build on it and were surprised to discover they couldn’t reproduce the original finding? And if a researcher makes a habit of publishing self-nonreplications, I’m sure people would be on to them in no time and questions would be asked.

So I think we should publish self-nonreplications. (1) They help to make the scientific record more accurate. (2) They are likely to prevent other researchers from ending up in a cul-de-sac.

The concern about double-dipping is only a concern given our current incentive system, which is one more indication that this system is detrimental to good science. But that’s a topic for a different post.

Wednesday, July 26, 2017

Defending .05: It’s Not Enough to be Suggestive

Today another guest post. In this post, Fernanda Ferreira and John Henderson respond to the recent and instantly (in)famous multi-authored proposal to lower the level of statistical significance to .005. If you want to discuss this post, Twitter is the medium for you. The authors' handles are @fernandaedi and @JhendersonIMB.

Fernanda Ferreira
John M. Henderson

Department of Psychology and Center for Mind and Brain
University of California, Davis

The paper “Redefine Statistical Significance” (henceforth, the “.005 paper”), written by a consortium of 72 authors, has already made quite a splash even though it has yet to appear in Nature Human Behavior. The call to a redefinition of statistical significance from .05 to .005 would have profound consequences across psychology, and it is not clear to us that the broad implications across the field have been thoroughly considered. As cognitive psychologists, we have major concerns about the advice and the rationale for this severe prescription.

In cognitive psychology we test theories motivated by a body of established findings, and the hypotheses we test are derived from those theories. It is therefore rarely true that any experimental outcome will be treated as equally likely. Our field is not effects-driven—we’re in the business of building and testing functional theories of the mind and brain, and effects are always connected back to those theories.

Standard practice in our subfield of psychology has always been based on replication. This has been extensively discussed in the literature and in social media, but it seems helpful to repeat the point: All of us were trained to design and conduct a theoretically motivated experiment, then design and conduct follow-ups that replicate and extend the theoretically important findings, often using converging operations to show that the patterns are robust across measures. This is why the stereotype emerged that cognitive psychology papers were typically three experiments and a model, where “model” is the subpart of the theory tested and elaborated in this piece of research.

Standard practice is also to motivate new research projects from theory and existing literature; the idea for a study doesn’t come out of the blue. And the first step when starting a new project is to make sure the finding or phenomenon to be built upon replicates. Then the investigator goes on to tweak it, play with it, push it, etc., all in response to refined hypotheses and predictions that fall out of the theory under investigation.*

Now, at this point, even if you agree with us, you might be thinking, “Well what would be the harm in going to a more conservative statistical criterion? Requiring .005 would only have benefits, because then we guard against Type I error and we avoid cluttering up the literature with non-results.” Unfortunately, as many have pointed out in informal discussions concerning the .005 paper, and as the .005 paper acknowledges as well, there are tradeoffs.

First, if you do research on captive undergraduates or you use M-Turk samples, then Ns in the hundreds might be no big deal. In the article, the authors estimate that a shift to .005 will necessitate at least a 70% increase in sample sizes, and they suggest this is not too high a price to pay. But setting aside the issue of non-convenience samples, this estimate is for simple effects, and we’re rarely looking for simple effects. In our business it’s all about statistical interactions, and for those, this recommendation can lead to much larger increases in sample size. And if your field requires you to test non-convenience samples such as heritage language learners, or people with any type of neuropsychological condition such as aphasia, or people with autism, dyslexia, or ADHD, or even just typically developing children, then these Ns might be unattainable. Testing such participants also requires trained, expensive staff. And yet the research might be theoretically and practically important. So if you work with these non-convenience samples, subject testing is costly. It probably requires real money to pay those subjects and the research assistants doing the testing, and the money is almost always going to come from research grants. And we all know what the situation is with respect to research funding—there’s very little of it. But even if you had the money, and you didn’t care that it came at the expense of the funding of maybe some other scientist’s project, where would you find the large numbers of people that this shift in alpha level would require? What this means in practice is that some potentially important research will not get done.

Let’s turn now to Type II error. The authors of the .005 piece, to their credit, discuss the tradeoff between settling for Type I versus Type II error, and they come down on the side that Type I is costlier. But this can’t be true as a blanket statement. Missing a potential effect because you’ve set the false positive rate so conservatively could have major implications for theory development as well as for practical interventions. A false positive is a thing that a researcher might follow up and discover to be illusory; but a false negative is not a thing and therefore is likely to be ignored and never followed up, which means that a potentially important discovery will be missed.

Some have noted that the negative reaction to the .005 article has been surprisingly strong. A response we’ve heard to the kinds of concerns we’ve expressed is that the advocates of the .005 paper are not urging .005 as a publication standard, but merely as the alpha level that permits the use of the word “significant” to describe results. However, it is easy to foresee a world in which (if these recommendations are adopted) editors and reviewers start demanding .005 for significance and use it as a publication standard. After all, the goal of the piece presumably isn’t just to fiddle with terminology.

We think the strong reaction against .005 is also in part because the nature of common practice in different areas of psychology are not well represented by those advocating for major changes to research practice like the .005 proposal. Relatedly, we think it’s unfortunate that, today, in the popular media, one frequently sees references to “the crisis in psychology”, when those of us inside psychology know that the entire field is not in crisis. The response from these advocates might be to say that we’re in denial, but we’re not – as we outlined earlier, the approach to theory building, testing, replication, and cumulative evidence that’s standard in cognitive psychology (and other subareas of psychology) makes it unlikely that a cute but illusory effect will survive.

So our frustration is real. We would like to see the conversation in psychology about scientific integrity broadened to include other subfields such as ours, and many others.

*When we say these are standard practices in cognitive psychology, we don’t intend to imply that these practices are not standard in other areas; we’re simply talking about cognitive psychology because it’s the area with which we’re most familiar.

Tuesday, May 16, 2017

Sometimes You Can Step into the Same River Twice

              A recurring theme in the replication debate is the argument that certain findings don’t replicate or cannot be expected to replicate because the context in which the replication is carried out differs from the one in which the original study was performed. This argument is usually made after a failed replication.

In most such cases, the original study did not provide a set of conditions under which the effect was predicted to hold, although the original paper often did make grandiose claims about the effect’s relevance to variety of contexts including industry, politics, education, and beyond. If you fail to replicate this effect, it's a bit like you've just bought a car that was touted by the salesman as an "all-terrain vehicle," only to have the wheels come off as soon as you drive it off the lot.*

            As this automotive analogy suggests, the field has two problems: many effects (1) do not replicate and (2) are grandiosely oversold. Dan Simons, Yuichi Shoda, and Steve Lindsay have recently made a proposal that provides a practical solution to the overselling problem: researchers need to include in their paper a statement that explicitly identifies and justifies the target populations for the reported findings, a constraints on generality (COG) statement. Researchers also need to state whether they think the results are specific to the stimuli that were used and to the time and location of the experiment. Requiring authors to be specific about the constraints on generality is a good idea. You probably wouldn't have bought the car if the salesman had told you its performance did not extend beyond the lot. 

          A converging idea is to systematically examine which contextual changes might impact which (types of) findings. Here is one example. We always assume that subjects are completely naïve with regard to an experiment, but how can we be sure? On the surface, this is primarily a problem that vexes on-line research using databases such as Mechanical Turk, which has forums on which subjects discuss experiments. But even with the good old lab experiment we cannot always sure that our subjects are naïve to the experiment, especially when we try to replicate a famous experiment. If subjects are not blank slates with regard to an experiment, a variation of population has occurred relative to the original experiment. We've gone from sampling from a population of completely naïve subjects to sampling from one with an unknown percentage of repeat-subjects.

            Jesse Chandler and colleagues recently examined whether prior participation in experiments affect effect sizes. They tested subjects in a number of behavioral economics tasks (such as sunk cost and anchoring and adjustment) and then retested these same individuals a few days later. Chandler et al. found an estimated 25% reduction in effect size, suggesting that the subjects’ prior experience with the experiment did indeed affect their performance in the second wave. A typical characteristic of these experiments is that they require reasoning, which is a controlled process. How about tasks that tap more into automatic processing?

             To examine this question, my colleagues and I examined nine well-known effects in cognitive psychology, three from the domain of perception/action, three from memory, and three from language. We tested our subjects in two waves, the second wave three days later than the first one. In addition, we used either the exact same stimulus set or a different set (with the same characteristics, of course).

            As we expected, all effects replicated easily in an online environment. More importantly, in contrast to Chandler and colleagues' findings, repeated participation did not lead to a reduction in effect size in our experiments. Also, it did not make a difference if the exact same stimuli were used or a different set.

            Maybe you think that this is not a surprising set of findings. All I can say that before running the experiments, our preregistered prediction was that we would obtain a small reduction of effect sizes (smaller than the 25% of Chandler et al.). So we at least were a little surprised to find no reduction.

            A couple of questions are worth considering. First, do the results indicate that the initial participation left no impression whatsoever on the subjects? No, we cannot say this. In some of the response-time experiments, for example, we obtained faster responses in wave 2 than in wave 1. However, because the responses also became less varied in their performance, the effect size did not change appreciably. A simple way to put it would be to say that the subjects became better at performing the task (as they perceived it) but remained equally sensitive to the manipulation. In other cases, such as the simple perception/action tasks, responses did not speed up, presumably because subjects were already performing at asymptote level.

            Second, how non-naïve were our subjects in wave 1? We have no guarantee that the subjects in wave 1 were completely naïve with regard to our experiments. What our data do show, though, is that the 9 effects replicate in an online environment (wave 1) and that repeating the experiment a mere few days later (wave 2) by the same research group does not reduce the effect size.

           So, in this sense, you can step into the same river twice. 

* Automotive metaphors are popular in the replication debate, see also this opinion piece in Collabra: Psychology by Simine Vazire.


Monday, May 8, 2017

Concurrent Replication

I’m working on a paper with Alex Etz, Rich Lucas, and Brent Donnellan. We had to cut 2,000 words and the text below is one of the darlings we killed. I’m reviving it as a blog post here because even though it made sense to cut the segment from the manuscript (I cut it myself, the others didn’t make me), the notion of concurrent replication is an important one.

The current replication debate has, for various reasons, construed replication as a retrospective process. A research group decides to replicate a finding that is already in the published literature. Some of the most high-profile replication studies, for example, have focused on findings published decades earlier, for example the registered replication projects on verbal overshadowing (Alogna et al, 2014) and facial feedback (Wagenmakers et al., in press). This retrospective approach, however timely and important, might be partially responsible for the controversial reputation that replication currently enjoys.
A form of replication that has received not much attention yet is what I will call concurrent replication. The basic idea is this. A research group formulates a hypothesis that they want to test. At the same time, they desire to have some reassurance about the reliability of the finding they expect to obtain. They decide to team up with another research group. They provide this group with a protocol for the experiment, the program and stimuli to run the experiment, and the code for the statistical analysis of the data. The experiment is preregistered. Both groups then each run the experiment and analyze the data independently. The results of both studies are included in the article, along with a meta-analysis of the results. This is the simplest variant. A concurrent replication effort can involve more groups of researchers.
A direct exchange of experiments (a straight “study swap”) is the simplest model of concurrent replication. It is possible to accomplish such study swaps on a larger scale where participating labs offer and request subject hours. This will likely result in a network of labs each potentially simultaneously engaged in forming and testing novel hypotheses as well as concurrently replicating hypotheses formed by other labs. The Open Science Framework features a site that has recently been developed to facilitate concurrent replication, Study Swap, see also this article.  At the time of this writing, there are four projects listed on Study Swap. We hope this number will increase soon.
Aside from this, there already are several large-scale concurrent replication efforts. An example is the Pipeline Project, a systematic effort to conduct prepublication replications, independently performed by separate labs. The first instalment was recently published (Schweisberg et al. 2016) and a second project is underway.
Concurrent replication has several advantages. First, researchers have a better sense of the reliability of their findings prior to publication.  After all, the results have been independently replicated before submission of the article. Likewise, journal editors and reviewers will have more confidence in the findings reported in the manuscript they are asked to evaluate. Journals have the luxury of publishing findings that have already been independently replicated. As a result, the reproducibility of the findings in the literature will start to increase. The Schweisberg et al. (2016) study demonstrates that concurrent replication is not only possible but also useful.
Concurrent replication forces researchers to be explicit about the procedure by which they expect to obtain the effect. If they do indeed obtain the finding both in the original study and in an independent replication, they have what amounts to a scientific finding according to the criteria established by Popper: They can describe a procedure by which the finding can reliably be produced. It will be easy and natural to include the protocol into the method section of the article. A positive side-effect of this will be a marked improvement in the quality of method sections in the literature. As a result, researchers who want to build on these findings have two advantages that researchers currently do not enjoy. First, they can build on a firmer foundation. After all, the reported finding has already been independently replicated. Second, a replication recipe doesn’t have to be laboriously reconstructed. It is readily available in the article.
Of course, concurrent replication is not without challenges. For instance, how should authorship be determined given such an arrangement? A flexible approach is best here. At one extreme the original group’s hypothesis might be very close to the replicating group’s own interest. In this case it would therefore be logical to make members of both groups co-authors; each group may have something to add to the paper both in terms of data and analysis and in terms of theory. At the other extreme, the second group has no direct interest in the hypothesis but may be willing to run a replication, perhaps in exchange for a replication of one of their own experiments. In this case it might be sufficient to acknowledge the other group’s involvement without offering co-authorship.
Thus far, the discussion here has only involved a scenario in which the hypothesis is supported in both the initiating as in the replicating lab. However, other scenarios are also possible. The second scenario is one in which the hypothesis is supported in one of the labs but not in the other. If the meta-analysis shows heterogeneity among the findings, researchers might hypothesize about a potential difference between the experiments, preregister that hypothesis and test it, again with a direct replication. If the meta-analysis does not show heterogeneity, it might be decided that it is sufficient to report the meta-analytic effect. If neither lab shows the effect, the research groups might report the results without engaging in follow-up studies. Alternatively, they might decide the experimental procedure was suboptimal, revise it, preregister the new experiment and run it, along with one or more concurrent replications.
To summarize, concurrent replication forms an underrepresented but potentially extremely valuable form of replication. Several concurrent large-scale replication efforts are currently underway and a platform that also facilitates conducting smaller-scale projects is available for use. The fact that concurrent replications are often viewed positively by the field is further evidence of the importance of replication for scientific endeavors.


Alogna, V. K., Attaya, M. K., Aucoin, P., Bahnik, S., Birch, S., Birt, A. R., ... Zwaan, R. A. (2014). Registered replication report: Schooler & Engstler-Schooler (1990). Perspectives on Psychological Science, 9, 556–578.
Schweinsberg, M. et al. (2016). The pipeline project: pre-publication independent replications of a single laboratory's research pipeline. journal of experimental social psychology, 66, 55–67.
Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B., Jr., . . . Zwaan, R. A. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11, 917–928.