Friday, October 24, 2014

ROCing the Boat: When Replication Hurts

Though failure to replicate presents a serious problem, even highly-replicable results may be consistently and dramatically misinterpreted if dependent measures are not carefully chosen. This sentence comes from a new paper by Caren Rotello,Evan Heit, and Chad Dubé to be published in Psychonomic Bulletin & Review. 

Replication hurts in such cases because it reinforces artifactual results. Rotello and colleagues marshal support for this claim from four disparate domains: eyewitness memory, deductive reasoning, social psychology, and studies of child welfare. In each of these domains researchers make the same mistake by using the same wrong dependent measure.

Common across these domains is that subjects have to make detection judgments: was something present or was it not present? For example, subjects in eyewitness memory experiments decide whether or not the suspect is in a lineup. There are four possibilities.
             Hit: The subject responds “yes” and the suspect is in the lineup.                     
             False alarm: The subject responds “yes” but the subject is not in the lineup.  
             Miss: The subject responds “no” but the subject is in the lineup.                      
             Correct rejection: Responds “no” and the subject is not in the lineup.             

It is sufficient to only take the positive responses, hits and false alarms, into account if we want to determine decision accuracy (the negative responses are complementary to the positive ones). But the question is how we compute accuracy from hits and false alarms. And this is where Rotello and colleagues say that the literature has gone astray.

To see why, let’s continue with the lineup example. Lineups can be presented simultaneously (all faces at the same time) or sequentially (one face at a time). A meta-analysis involving data from 23 labs involving 13,143participants concludes that sequential lineups are superior to simultaneous ones. Sequential lineups yield a 7.72 diagnosticity ratio and simultaneous ones only 5.78; in other words, sequential lineups are 1.34 (7.72/5.78) times more accurate than simultaneous ones. Rotello and colleagues mention that 32% of police precincts in the United States now use sequential lineups. They don’t state explicitly that this is because of the research but this is what they imply.

The diagnosticity ratio is computed by dividing the number of hits by the number of false alarms. Therefore, the higher the ratio, the better the detection rate. So the notion of sequential superiority rides on the assumption that the diagnosticity ratio is an appropriate measure of diagnosticity. Well, you might think, it has the word diagnosticity in it, so that’s at least a start. But Rotello and colleagues demonstrate, this may be all that it has going for it.

If you compute the ratio of hits and false alarms (or the difference between them, as is often done), you’re assuming a linear relation. The straight lines in Figure 1 connect all the hypothetical subjects who have the same diagnosticity ratio. So the lowest line here connects the subjects who are at chance performance, and thus have a diagnosticity ratio of 1 (# hits = # false alarms). The important point to note is that you get this ratio for a conservative responder with 5% hits and 5% false alarms but also for a liberal responder with 75% hits and 75% false alarms.

The lines in the figure are called Receiver Operating Characteristics (ROC). (So now you know what that ROC is doing in the title of this post.) ROC is a concept that was developed by engineers in World War II who were trying to improve ways to detect enemy objects in battlefields and then was introduced to the field of psychophysics. 

Now let’s look at some real data.The triangles in the figure represent data from an actual experiment (by Laura Mickes, Heather Flowe, and John Wixted) comparing simultaneous (open triangles) and sequential (closed triangles) lineups. Every point on these lines reflects the same accuracy but a different tendency to respond “yes.” The lines that you can fit through these data points will be curved. Rotello and colleagues note that curved ROCs are consistent with the empirical reality and straight lines assumed by the diagnosticity ratio are not.

Several large-scale studies have used ROCs rather than diagnosticity and found no evidence whatsoever for a sequential superiority effect in lineups. In fact, all of these studies found the opposite pattern: simultaneous was superior to sequential. So what is the problem with the diagnosticity ratio? As you might have guessed by now, it is that it does not control for response bias. Witnesses presented with a sequential lineup are just less likely to respond “yes I recognize the suspect” than witnesses presented with a simultaneous lineup. ROCs based on empirical data unconfound accuracy with response bias and show a simultaneous superiority effect.

Rotello and colleagues demonstrate convincingly that this same problem bedevils the other areas of research I mentioned at the beginning of this post but the broader point is clear. As they put it: This problem – of dramatically and consistently 'getting it wrong' – is potentially a bigger problem for psychologists than the replication crisis, because the errors can easily go undetected for long periods of time. Unless we are using the proper dependent measure, replications are even going to aggravate the problem by enshrining artifactual findings in the literature (all the examples discussed in the article are “textbook effects”). To use another military reference: in such cases massive replications will produce what in polite company is called a Charlie Foxtrot.

Rotello and colleagues conclude by considering the consequences of their analysis for ongoing replication efforts such as the Reproducibility Project and the first Registered Replication Report on verbal overshadowing that we are all so proud of. They refer to a submitted paper that argues the basic task in the verbal overshadowing experiment is flawed because it lacks a condition in which the perpetrator is not in the lineup. I haven’t read this study yet and so can’t say anything about it, but it sure will make for a great topic for a future post (although I’m already wondering whether I should start hiding under a ROC).

Rotello and colleagues have produced an illuminating analysis that invites us once more to consider how valid our replication attempts are. Last year, I had an enjoyable blog discussion about this very topic with Dan Simons, it even uses the verbal overshadowing project as an example. Here is a page with links to this diablog.


I thank Evan Heit for alerting me to the article and for feedback on a previous draft of this post.

The Diablog on Replication with Dan Simons

Dan Simons
Last year, I had a very informative and enjoyable blog dialogue, or diablog, with Dan Simons about the reliability and validity of replication attempts. Unfortunately, there was never an easy way for anyone to access this diablog. It has only occurred to me today (!) that I could remedy this situation by creating a meta-post. Here it is.

In my first post on the topic, I argued that it is important to consider to consider not only the reliability but also the validity of replication attempts because it might be problematic if we try to replicate a flawed experiment. 

Dan Simons responded to this, arguing that deviations from the original experiment, while interesting, would not allow us to determine the reliability of the original finding.

I then had some more thoughts.

To which Dan wrote another constructive response.

My final point was that direct replications should be augmented with systematic variations of the original experiment.







Thursday, September 18, 2014

Verbal Overshadowing: What Can we Learn from the First APS Registered Replication Report?

Suppose you witnessed a heinous crime being committed right before your eyes. Suppose further that a few hours later, you’re being interrogated by hard-nosed detectives Olivia Benson and Odafin Tutuola. They ask you to describe the perpetrator. The next day, they call you in to the police station and present you with a lineup. Suppose the suspect is in the lineup. Will you be able to pick him out? A classic study in psychology suggest Benson and Tutuola have made a mistake by first having you describe the perpetrator because the very act of describing the perpetrator will make it more difficult for you to identify him out of the lineup.

This finding is known as the verbal overshadowing effect and was discovered by Jonathan Schooler. In the experiment that is of interest here, he and his co-author, Tonya Engstler-Schooler, found that verbally describing the perpetrator led to a 25% accuracy decrease in identifying him. This is a sizeable difference with practical implications. Based on these findings, we’d be right to tell Benson and Tutuola to lay off interviewing you until after the lineup identification.

Here is how the experiment worked.

 

Subjects first watched a 44 second video clip of a (staged) bank robbery. Then they performed a filler task for 20 minutes, after which their either wrote down a description of the robber (experimental condition) or listed names of US states and their capitals (control condition). After 5 minutes, they performed the lineup identification task.

How reliable is the verbal-overshadowing effect? That is the question that concerns us here. A 25% drop in accuracy seems considerable. Schooler himself observed subsequent research yielded progressively smaller effects, something he referred to as “the decline effect.” This clever move created a win-win situation for him. If the original finding replicates, the verbal overshadowing hypothesis is supported. If it doesn’t, then the decline effect hypothesis is supported.

The verbal overshadowing effect is the target of the first massive Registered Registration Report under the direction of Dan Simons (Alex Holcombe is leading the charge on the second project) that was just published. Thirty-one labs were involved in direct replications of the verbal overshadowing experiment I just described. Our lab was one of the 31. Due to the large number of participating labs and the laws of the alphabet, my curriculum vitae now boasts an article on which I am 92nd author.

Due to an error in the protocol, the initial replication attempt had the description task and  a filler task in the wrong order before the line-up task, which made the first set of replications, RRR1, a fairly direct replication of Schooler’s Experiment 4 rather than, as was the plan, his Experiment 1. A second set of experiments, RRR2, was performed to replicate Schooler’s Experiment 1. You see the alternative ordering here.



In Experiment 4, Schooler found that subjects in the verbal description condition were 22% less accurate than those in the control condition. A meta-analysis of the RRR1 experiments yielded a considerably smaller, but still significant, 4% deficit. Of note is that all the replication studies found a smaller effect than the original study but that study was also less precise due to having a smaller sample size.

Before I tell you about the results of the replication experiments I have a confession to make. I have always considered the concept of verbal overshadowing plausible, even though I might have a somewhat different explanation for it than Schooler (more about this maybe in a later post), but I thought the experiment we were going to replicate was rather weak. I had no confidence that we would find the effect. And indeed, in our lab, we did not obtain the effect. You might argue that this null effect was caused by the contagious skepticism I must have been oozing. But I did not run the experiment. In fact, I did not even interact about the experiment with the research assistant who ran it (no wonder I’m 92nd author on the paper!). So the experiment was well-insulated from my skepticism.

Let's get back on track. In Experiment 1, Schooler found a 25% deficit. The meta-analysis of RRR2 yielded a 16% deficit-- somewhat smaller but still in the same ballpark. Verbal overshadowing appears to be a robust effect. Also interesting is the finding that the position of the filler task in the sequence mattered. The verbal overshadowing effect is larger when the lineup identification immediately follows the description and when there is more time between the video and the description. In fact either of those or a combination of them could be responsible for this difference in effect sizes.

Here are the main points I take a away from this massive replication effort.

1. Our intuitions about effects may not be as good as we think. My intuitions were wrong because a meta-analysis of all the experiments finds strong support for the effect. Maybe I’m just a particularly ill-calibrated individual or an overly pessimistic worrywart but I doubt it. For one, I was right about our own experiment, which didn’t find the effect. At the same time, I was clearly wrong about the overall effect. This brings me to the second point.

2. One experiment does not an effect make (or break).  This goes both for the original experiment, which did find a big effect, as for our replication attempt (and 30 others). One experiment that shows an effect doesn’t mean much, and neither does one unsuccessful replication. We already knew this, of course, but the RRR drives this point home nicely.

3. RRRs are very useful for estimating effect sizes without having to worry about publication bias. But it should be noted that they are very costly. Using 31 labs seems was probably overkill, although it was nice to see all the enthusiasm for a replication project.

4. More power is better. As the article notes about the smaller effect in RRR1: “In fact, all of the confidence intervals for the individual replications in RRR1 included 0. Had we simply tallied the number of studies providing clear evidence for an effect […], we would have concluded in favor of a robust failure to replicate—a misleading conclusion. Moreover, our understanding of the size of the effect would not have improved."

5. Replicating an effect against your expectations is a joyous experience.  This sounds kind of sappy but it’s an accurate description of my feelings when I was told by Dan Simons about the outcome of the meta-analyses. Maybe I was biased because I liked the notion of verbal overshadowing but it is rewarding to see an effect materialize in a meta-analysis. It's a nice example of “replicating up.”

Where do we go from here? Now that we have a handle on the effect, it would be useful to perform coordinated and preregistered conceptual replications (using different stimuli, different situations, different tasks). I'd be happy to think along with anyone interested in such a project.

Update September 24, 2014. The post is the topic of a discussion on Reddit.

Wednesday, July 9, 2014

Developing Good Replication Practices

In my last post, I described a (mostly) successful replication by Steegen et al. of the ”crowd-within effect.” The authors of that replication effort felt that it would be nice to mention all the good replication research practices that they had implemented in their replication effort.

And indeed, positive psychologist that I am, I would be remiss if I didn’t extol the virtues of the approach in that exemplary replication paper, so here goes.

Make sure you have sufficient power.
We all know this, right?

Preregister your hypotheses, analyses, and code.
I like how the replication authors went all out in preregistering their study. It is certainly important to have the proposed analyses and code worked out up front.

Make a clear distinction between confirmatory and exploratory analyses.
The authors did here exactly as the doctor, A.D. de Groot in this case, ordered. It is very useful to perform exploratory analyses but they should be separated clearly from the confirmatory ones.

Report effect sizes.
Yes.

Use both estimation and testing, so your data can be evaluated more broadly, by people from different statistical persuasions.

Use both frequentist and Bayesian analyses.
Yes, why risk being pulled over by a Bayes trooper or having a run-in with the Frequentist militia? Again, using multiple analyses allows your results to be evaluated more broadly.

Adopt a co-pilot multi-software approach.
A mistake in data analysis is easily made and so it makes sense to have two or more researchers analyse the data from scratch. A co-author and I used a co-pilot approach as well in a recent paper (without knowing the cool name for this approach, otherwise we would have bragged about it in the article). We discovered that there were tiny discrepancies between our analyses with each of us making a small error here and there. The discrepancies were easily resolved but the errors probably would have gone undetected had we not used the co-pilot approach. Using a multi-software approach seems a good additional way to minimize the likelihood of errors.

Make the raw and processed data available.
When you ask people to share their data, they typically send you the processed data but the raw data are often more useful. The combination is even more useful as it allows other researchers to retrace the steps from raw to processed data. 

Use multiple ways to assess replication success.
This is a good idea in the current climate where the field has not settled on a single method yet. Again, it allows the results to be evaluated more broadly than with a single-method approach.

Maybe these methodological strengths are worth mentioning too?, the first author of the replication study, Sara Steegen, suggested in an email.

Check.


I thank Sara Steegen for feedback on a previous version of this post.

Thursday, July 3, 2014

Is There Really a Crowd Within?

In 1907 Francis Galton (two years prior to becoming “Sir”) published a paper in Nature titled “Vox populi” (voice of the people). With the rise of democracy in the (Western) world, he wondered how much trust people could put in public judgments. How wise is the crowd, in other words?

As luck would have it, a weight-judging competition was carried on at the annual show of the West of England Fat Stock and Poultry Exhibition (sounds like a great name for a band) in Plymouth. Visitors had to estimate the weight of a prize-winning ox when slaughtered and “dressed” (meaning that its internal organs would be removed).

Galton collected all 800 estimates. He removed thirteen (and nicely explains why) and then analyzed the remaining 787 ones. He computed the median estimate and found that it was less than 1% from the ox’s actual weight. Galton concludes: This result is, I think, more creditable to the trust-worthiness of a democratic judgment than might have been expected. 

This may seem like a small step to Galton and a big step to the rest of us but later research has confirmed that in making estimates the average of a group of people is more accurate than the predictions of most of the individuals. The effect hinges on when some of the errors in the individual estimates are statistically independent from one another.

In 2008 Edward Vul and Hal Pashler gave an interesting twist to the wisdom of the crowd idea. What would happen, they wondered, if you allow the same individual to make two independent estimates? Would the average of these estimates be more accurate than each of the individual estimates?

Vul and Pashler tested this idea by having 428 subjects guess answers to questions such as What percentage of the world’s airports are in the United States? Vul and Pashler further reasoned that the more the estimates differed from each other, the more accurate their average would be. To test this idea, they manipulated the time between the first and second guess. One group second-guessed themselves immediately whereas the other group made the second guess three weeks later.

Here is what Vul and Pashler found.   



They did indeed observe that the the average of the two guesses was more accurate than each of the guesses separately (the green bars representing the mean squared error are lower than the blue and red ones). Furthermore, the effect of averaging was larger in the 3-week delay condition than in the immediate condition.

Vul and Pashler conclude that forcing a second guess leads to higher accuracy than is obtained by a first guess and that this gain is enhanced by temporally separating the two guesses. So "sleeping on it" works.

How reproducible are these findings? That is what Sara Steegen, Laura Dewitte, Francis Tuerlinckx, and Wolf Vanpaemel set out to investigate in a preregistered replication of the Vul and Pashler study in a special issue of Frontiers in Cognition that I’m editing with my colleague René Zeelenberg. 

Steegen and colleagues tested Flemish psychology students rather than a more diverse sample. They obtained the following results.


Like Vul and Pashler, they obtained a crowd-within effect. The average of the two guesses was more accurate than each of the guesses separately both in the immediate and in the delayed condition. Unlike in Vul and Pashler (2008), the accuracy gain of averaging both guesses compared to guess 1 was not significantly larger in the delayed condition (although it was in the same direction). Instead, the accuracy gain of the average was larger in the delayed condition than in the immediate condition when it was compared to the second guess.

So this replication attempt yields two important pieces of information: (1) the crowd-within effect seems robust, (2) the effect of delay on accuracy gain needs to be investigated more closely. It's not clear yet whether or when "sleeping on it" works.

Edward Vul, the first author of the original crowd-within paper was a reviewer of the replication study. I like how he responded to the results in recommending acceptance of the paper:

The authors carried out the replication as they had planned.  I am delighted to see the robustness of the Crowd Within effect verified (a couple of non-preregistered and thus less-definitive replications had also found the effect within the past couple of years).  Of course, I'm a bit disappointed that the results on replicating the contrast between immediate and delayed benefits are mixed, but that's what the data are.  

The authors have my thanks for doing this service to the community" [quoted with permission]

 Duly noted. 

Thursday, June 5, 2014

Who’s Gonna Lay Down the Law in Psytown?

These are troubled times in our little frontier town called Psytown. The priest keeps telling us that deep down we’re all p-hackers and that we must atone for our sins.

If you go out on the streets, you face arrest by any number of unregulated police forces and vigilantes.

If you venture out with a p-value of .065, you should count yourself lucky if you run into deputy Matt Analysis. He’s a kind man and will let you off with a warning if you promise to run a few more studies, conduct a meta-analysis, and remember never to use the phrase “approaching significance” ever again.

It could be worse.

You could be pulled over by a Bayes Trooper. “Please step out of the vehicle, sir.” You comply. “But I haven’t done anything wrong, officer, my p equals .04.” He lets out a derisive snort “You reckon that’s doin’ nothin’ wrong? Well, let me tell you somethin’, son. Around these parts we don’t care about p. We care about Bayes factors. And yours is way below the legal limit. Your evidence is only anecdotal, so I’m gonna have to book you.”

Or you could run into the Replication Watch. “Can we see your self-replication?” “Sorry, I don’t have one on me but I do have a p<.01.” “That’s nice but without a self-replication we cannot allow you on the streets.” “But I have to go to work.” “Sorry, can’t do, buddy.” “Just sit tight while we try to replicate you.”

Or you could be at a party when suddenly two sinister people in black show up and grab you by the arms. Agents from the Federal Bureau of Pre-registration. “Sir, you need to come with us. We have no information in our system that you’ve pre-registered with us.” “But I have p<.01 and I replicated it” you exclaim while they put you in a black van and drive off.

Is it any wonder that the citizens of Psytown stay in most of the day, fretting about their evil tendency to p-hack, obsessively stepping on the scale worried about excess significance, and standing in front of the mirror checking their p-curves?

And then when they are finally about to fall asleep, there is a loud noise. The village idiot has gotten his hands on the bullhorn again. “SHAMELESS LITTLE BULLIES” he shouts into the night. “SHAMELESS LITTLE BULLIES.”

Something needs to change in Psytown. The people need to know what’s right and what’s wrong. Maybe they need to get together to devise a system of rules. Or maybe a new sheriff needs to ride into town and lay down the law.

Thursday, May 29, 2014

My Take on Replication

There are quite a few comments on my previous post already, both on this blog and elsewhere. That post was my attempt to make sense of the discussion that all of a sudden dominated my Twitter feed (I’d been offline for several days). Emotions were runing high and invective was flying left and right. I wasn’t sure what the cause of this fracas was and tried to make sense of where people were coming from and suggest a way forward. 

Many of the phrases in the post that I used to characterize the extremes of the replication continuum are paraphrases of what I encountered online rather than figments of my own imagination. What always seems to happen when you write about extremes, though, is that people rush in to declare themselves moderates. I appreciate this. I’m a moderate myself. But if we were all moderates, then the debate wouldn’t have spiralled out of control. And it was this derailment of the conversation that I was trying to understand.

But before (or more likely after) someone mistakes one of the extreme positions described in the previous post for my own let me state explicitly how I view replication. It's rather boring.
  • Replication is by no means the solution to all of our problems. I don’t know if anyone seriously believes it is.
  • Replication attempts should not be used or construed as personal attacks. I have said this in my very first post and I'm sticking with it.
  • A failed replication does not mean the original author did something wrong. In fact, a single failed replication doesn’t mean much, period. Just like one original experiment doesn’t mean much. A failed replication is just a data point in a meta-analysis, though typically one with a little more weight than the original study (because of the larger N). The more replication attempts the better.
  • There are various reasons why people are involved in replication projects. Some people distrust certain findings (sometimes outside their own area) and set out to investigate. This is a totally legitimate reason. In the past year or so I have learned that that I’m personally more comfortable with trying to replicate findings from theories that I do find plausible but that perhaps don’t have enough support yet. I call this replicating up. Needless to say, this can still result in a replication failure (but at least I’m rooting for the effect). And then there are replication efforts where people are not necessarily invested in a result, such as the reproducibility project and the registered replication projects. Maybe this is the way of the future. Another option is adversarial replication. 
  • Direct vs. conceptual replication is a false dichotomy. Both are necessary but neither is sufficient. Hal Pashler and colleagues have made it clear why conceptual replication by itself is not sufficient. It’s biased against the Null. If you find an effect you'll conclude the effect has replicated. If you don’t, you’ll probably conclude that you were measuring a different construct after all (I’m sure I must have fallen prey to this fallacy at one point or another). Direct replications have the opposite problem. Even if you replicate a finding many times over, it might be that what you’re replicating is, in fact, an artifact. You’ll only find out if you conduct a conceptual replication, for example with a slightly different stimulus set. I wrote about the reliability and validity of replications earlier, which resulted in an interesting (or so we thought) “diablog” with Dan Simons on this topic (see also here, here, and here).
  • Performing replications is not something people should be doing exclusively (at least, I’d recommend against it). However, it would be good if everyone were involved in doing some of the work. Performing replications is a service to the field. We all live in the same building and it is not as solid as we once thought. Some even say it’s on fire.