Thursday, May 12, 2016

Disentangling Reputation from Replication

With increasing attention paid to reproducibility in science, a natural worry for researchers is, “What happens if my finding does not replicate?” With this question, Charles Ebersole, Jordan Axt, and Brian Nosek open their new article on perceptions of noveltyand reproducibility, published today in PLoS Biology.

There are several ways to interpret this question, but Ebersole and colleagues are most concerned with reputational issues. In an ideal world, they note, reputations shouldn’t matter; the focus should be on the findings. But reality is different: findings are treated as possessions.

Ebersole and his co-authors draw a contrast between innovation and reproducibility in evaluating reputations. Drawing this contrast is not without precedent. Some years back, I served on the National Science Foundation program Perception, Action, and Cognition. We were told that innovation was to be an overriding criterion in evaluating proposals. Up to that point, as I understood it, the program’s predecessor had been perceived as an “old-boys-network” in which researchers who had been funded before pretty much had a ticket to renewed funding, whereas younger researchers were struggling to get in on the funding. In our program discussions the word “solid” in a review was a kiss of death for the proposal, it being a code word for “more of the same old boring stuff.”

In the last decade, we have seen the pendulum switch from “solid” to “innovative.”* The pendulum metaphor invites the idea to align reproducible with boring and nonreproducible with innovative. Ebersole and colleagues create this stark contrast in their survey. Enter AA and BB, two scientists in some unspecified field. AA produces “boring but certain” results; BB produces “exciting but uncertain” results. Ebersole and colleagues asked two large samples from the general public several questions about these scientific opposites. When presented with this stark choice the general public clearly preferred AA over BB. Good for AA.

However, Ebersole and his co-authors are quick to point out that AA and BB are caricatures; after all, nobody embarks on a career to produce boring or uncertain results. The contrast is misleading because there are temporal dependencies at play. You first obtain an exciting finding and then you decide what to next: replicate and extend this exciting finding or move on to the next exciting finding? And if our reputation is at stake, how should we respond when others attempt to replicate our findings to increase certainty independently?

The authors investigated these questions in a further survey featuring the researchers X and Y. The respondents read several scenarios involving X and Y after having received an introduction about the scientific publication process. The respondents first rated researcher X’s ability, ethics, and the level of truth of the finding.  The average rating of the researcher’s ability was then used as a baseline for several scenarios that introduced researcher Y as someone who replicated or failed to replicate X’s original finding. Of interest were the reputational consequences of this for X. This figure displays the results.



I have to admit that the figure is giving me bouts of OCD (am I alone in feeling compelled to pull apart the superimposed letters?), but the message is clear. Reputation depends not so much on whether your finding is true but rather on how you respond to failed replication.
If Y does not replicate the finding, then the original result is perceived as less true. X suffers some reputational damage as well, being perceived as somewhat less ethical and less capable than before. However, what matters crucially is how X responds to the failed replication. For example, there is considerably more reputational damage if X discredits Y’s replication result. I suspect this would vary as a function of whether or not X’s criticism was perceived as justified, but this was not investigated. In contrast, there is a big reputational gain if X accepts Y’s result (see here for an actual example) and concludes that the original result might not be correct; the original effect is perceived as less true, of course. Interestingly, the finding is perceived as less true than when X criticizes the replication. The reputation gain is even bigger if X starts a replication attempt to investigate the difference between the original and replication results. Curiously, the original result is now perceived as truer than before the failed replication. The reputation gain is somewhat smaller than this if X fails to self-replicate the original finding and the original finding is perceived as less true. There is considerable reputational damage if X performs an unsuccessful self-replication and decides not to report it or doesn’t follow up on the finding at all. The former is a bit hypothetical, of course, because if X doesn’t report the failed self-replication, no one is the wiser. And if X doesn’t follow up, it is unclear whether people would pick up on the lack of a follow-up.

So much for the general public. How about students and scientists? Ebersole and colleagues presented the same scenarios to 428 students and 313 researchers (from graduate students to full professors). It turns out that scientists are more forgiving than the general public, especially when it comes to pursuing new ideas rather than following up on a initially published finding. The authors attribute this to the aforementioned drive toward innovation.

Not surprisingly, the researchers displayed a more realistic (pessimistic?) assessment of the current job market than the general population. They viewed the exciting, uncertain scientist as more likely to get a job, keep a job, and be more celebrated by wide margins.

“Despite that,” the authors note, “researchers were slightly more likely to say that they would rather be, and more than twice as likely to say that they should be, the boring, certain scientist.” Demand characteristics are likely to have played a role here. As I said earlier, who wants to be boring? The students responded more like the general public than like the scientists.

What do we make of this set of results? Clearly, it is quite artificial to presenting respondents with a set of idealized and decontextualized scenarios. On what basis are respondents making judgments when presented with these scenarios? Especially the general public. On the other hand, the convergence among the responses from the three different groups (general public, students, researchers) is reassuring.

The set of scenarios that was used is not only idealized but also limited. It does not exhaust the space of possible scenarios, as the authors acknowledge. For example, there is no scenario that involves a (failed) replication that is flawed because it distorts or omits (either accidentally or intentionally) parts of the original experiment. It would be important to include such a scenario in a follow-up study and then ask questions about the ability and ethics of the replicator and truth of the replication finding as well. After all, just as original experiments can be flawed, so can replications. So it only makes sense to approach replications critically.

What I take away from the article is this.

(1) We should disentangle reputation from replication. This becomes easier if we self-replicate.

(2) We should stop seeing innovation and replication as opposites. The drive to innovate means that we are bound to pursue wrong leads in most cases. Competently performed replications are a reality check. Innovation and replication are not enemies. They are two necessary components of the best mechanism at our disposal to learn about the world: science.


Note:

*Although some might see this as the main reason for the reproducibility crisis, the only way we can tell for sure is if there are more replication attempts of “boring” research. I’m willing to bet that there are considerable reproducibility issues with that kind of research as well.

6 comments:

  1. Psychologists already are self-replicating (and extending at the same time, which is even better, right?); look at the introductions to Studies 2 through N in almost any article in JPSP. And amazingly, all these replication and extension studies tend to have the same positive results as the first one.

    I read somewhere that the multi-study article format was introduced in the first place as an antidote to a previous replication crisis. If that's true, it seems like it got (in effect) comprehensively gamed.

    ReplyDelete
    Replies
    1. A pre-registered self-replication is a good way to go when submitting a manuscript. This can then be followed up by independent replications. The bigger point I wanted to make is that there aren't two camps: innovators vs. replicators. Everyone should be both.

      Delete
  2. It would be interesting to see if introducing pre-registration would change people's judgement about how true the finding is and how ethical the researcher is. I imagine it wouldn't make much difference to the public (unless it was explained to them?) but it could have an impact on the students and scientists.

    PS I agree that there shouldn't be a divide between innovators & replicators. Hopefully we will get to a stage where pre-registered replications are the norm (though may not be feasible for all studies of course).

    ReplyDelete
    Replies
    1. Good point. Preregistration would make a huge difference. In fact, preregistration rather than replication is the way to solve a lot of our problems.

      Delete
  3. Rolf:

    What Nick said.

    I agree that self-replication is a good idea. But it can be done well or poorly. An example of self-aware replication is Nosek et al.'s famous "50 shades of gray" article. Other times I've seen purported self-replications that aren't replications at all, where the conditions are changed, where questionable data-selection rules are used, or even where a replicated result is not at the .05 level but it is still reported as a success. I recently have been involved in a replication of one of my own papers and it's hard. So I see the virtue of external replications.

    ReplyDelete
    Replies
    1. I agree Andrew. The key is preregistration. The plus is that it gives researchers a better idea of the reproducibility of their own findings. Even better is a replication by another lab, of course. A group of us is working on something called StudySwap, which is intended as a platform for facilitating concurrent cross-lab replications. More about this soon.

      Delete