Friday, June 21, 2013

The Tyrion Lannister Paradox: How Small Effect Sizes can be Important


There has been a lot of debate lately about effect sizes. On the one hand, there are effects in the social priming literature that seem surprisingly large given the subtlety of the manipulation, the between-subjects design, and the (small) sample size. On the other hand, some researchers can be heard complaining about small effect sizes in other areas of study (for example cognitive psychology). Why would we want to study small effects?

This is not a new question.  We could go further back in history but let’s stop in 1992, the year in which an insightful article on small effect sizes appeared, authored by Deborah Prentice and Dale Miller. Prentice and Miller argue that there are two valid reasons why psychologists study small effects.

The first reason is that researchers are trying to establish the minimal conditions under which an effect can be found. They accomplish this by minimally manipulating the independent variable.  The second reason is that researchers are invested in showing that an effect occurs even under very unfavorable conditions.

According to this analysis, there are two modes of experimentation. One is targeted at accounting for maximal variance and is therefore interested in big effects. And the other is aiming to provide the most stringent tests of a hypothesis. 

So researchers who study small effects generally (generally being the operative word here) aren’t doing this because they enjoy being obscure, esoteric, fanciful, eccentric, absurd, ludicrous, kooky, or wacky. They are simply trying to be good scientists. An experiment might look farfetched but this doesn’t mean it is. It might very well be the product of rigorous scientific thought.

If we accept experiments with small effect sizes as scientifically meaningful, then the next question becomes how to evaluate these experiments. Here Prentice and Miller make an important observation. They point out that researchers who perform small-effect-size-experiments are not committed to a specific operationalization of a finding. It is one out of many operationalizations that might have been used.

Take for example (my example, not that of Prentice and Miller) a simple semantic priming experiment. The hypothesis is that words (e.g., doctor) are more easily recognized when preceded by a semantically related word (e.g., nurse) than when preceded by a semantically unrelated word (e.g., bread).

There are many ways semantic priming (and more generally the theory of semantic memory) can be tested. For example, we could present the prime words on a list and then present target words as word stems (e.g., do---r). Our prediction then would be that subjects are more likely to complete the word stem as doctor (as opposed to, say, dollar, or dormer) when primed with nurse than when primed with bread.

We could test the same idea in a response-time paradigm, for instance by using a lexical-decision task—in which subjects decide as quickly and as accurately as possible whether a given string of letters is a genuine word—or a naming task, in which subjects merely read the words aloud. The prediction is that lexical decisions and naming are faster for primed words (nurse-doctor) than for words that are not primed (bread-doctor).

Such response-time paradigms open up a plethora of options. It is possible to vary: the amount of time that elapses between the presentation of the prime and that of the target, the presentation duration of the prime, whether or not the prime is masked, the nature of the words being used, the strength of the semantic relation between prime and target, the number of words in the entire experiment, font size, capitalization, font color, and so on.

Combine this with the various ways in which response times can be trimmed or transformed before analysis and you’ve got a huge number of options. Each combination of options will yield a different effect size. But effect size is not the name of the game here. At issue is whether semantic priming occurs or not.

Any combination of options may give rise to an experiment that is diagnostic with respect to the semantic priming hypothesis. The most diagnostic experiment will not be the one with the largest effect size. Rather, it will be the one in which the effect is least likely to occur. There's a good chance this will be the experiment with the smallest effect size. Let’s look at some evidence for this claim.

In a lexical decision task subjects make judgments about words. In a naming task they simply read the words aloud; there is no decision involved and access to the word’s meaning is not necessary to perform the task. This absence of a need to access meaning makes it more difficult to find semantic priming effects in naming than in lexical decision. And indeed, a meta-analysis shows that semantic-priming effects are about twice as large in lexical decision experiments (Cohen’s d =.33) than in naming experiments (Cohen’s d =.16). Still, priming effects are more impressive in naming than in lexical decision.

Prentice and Miller argue that authors should consider the two different goals of experimentation (accounting for maximal variance vs. using the most minimal manipulation) when designing and reporting their studies. I can't recall ever having come across such reporting in the papers I have read but it seems like a good idea.

The take-home message is that we should not dismiss small effects so easily. Tyrion Lannister may be the character that is smallest in stature in Game of Thrones but he is also one of the game’s biggest players.



Thursday, June 13, 2013

Wacky Hermann and the Nonsense Syllables: The Need for Weirdness in Psychological Experimentation


Earlier this week I attended a symposium in Nijmegen on Solid Science in Psychology. It was organized mostly by social psychologists from Tilburg University along with colleagues from Nijmegen. (It is heartening to see members of Stapel’s former department step up and take such a leading role in the reformation of their field.) The first three speakers were Uri Simonsohn, Leif Nelson, and Joe Simmons of false-positive psychology fame. I enjoyed their talks (which were not only informative but also quite humorous for such a dry subject) but I already knew their recent papers and I agree with them, so their talks did not change my view much.

Later that day the German social psychologist Klaus Fiedler spoke. He offered an impassioned view that was somewhat contrary to the current replication rage. I didn’t agree with everything Fiedler said but I did agree with most of it. What’s more, he got me to think and getting your audience to think is what a good speaker wants.

Fiedler’s talk was partly a plea for creativity and weirdness in science. He likened the scientific process to evolution. There are phases of random generation and phases of selection. If we take social psychology, many would say that this field has been in a state of relatively unconstrained generation of ideas (see my earlier posts on this). According to Fiedler, this is perfectly normal.

Also perfectly normal is the situation that we find ourselves in now, a phase in which many people express doubts about the reliability and validity of much of this research. These doubts are finding their way into various replication efforts as ways to select the good ideas from the bad ones. As I’ve discussed earlier (and in the ensuing blogalog with Dan Simons, here, here, here, here, and here), direct replications are a good start, but somewhat less direct replications are also necessary to select the most valid ideas. 

So I’m glad we’re having this revolution. At the same time, I confess to having an uneasy feeling. During Fiedler’s talk, I had a Kafkaesque vision of an army of researchers dutifully but joylessly going about their business: generating obvious hypotheses guaranteed to yield large effect sizes, performing power analyses, pre-registering their experiments, reporting each and every detail of their experiments, storing and archiving their data, and so on. Sure, this is science. But where is the excitement? Remember, we’re scientists, not librarians or accountants. To be sure, I have heard people wax poetic about initiatives to archive data. But are these people for real? Archiving your data is about as exciting as filing your taxes.

Wacky Hermann
Creativity and weirdness are essential for progress in science. This is what Fiedler argued and I agree. Heck, people at the time must have found it pretty silly that Hermann Ebbinghaus spent hours each day to memorize completely useless information (nonsense syllables) by reciting them—with the same neutral voice inflection each time— to the sound of a metronome. Try telling that at a party when asked what you do for a living! And yet psychology would not have been the same if Ebbinghaus had decided to spend his time in a more conventional manner, for example by discussing the latest political issues in the local Biergarten, by taking his dog on strenuous walks, or by forming his own garage band (though Wacky Hermann and the Nonsense Syllables would have been a killer name).

So I agree with Klaus Fiedler. We need creativity and weirdness in science. We need to make sure that the current climate does not stifle creativity. But we also need to have mechanisms in place to select the most valid ideas. I think we can have our cake and eat it too by distinguishing between exploratory and confirmatory research, as others have already suggested.

It is perfectly okay to be wacky and wild (the wackier and wilder the better as far as I’m concerned), as long as you indicate that your research is exploratory (perhaps there should be journals or journal sections devoted to exploratory ideas). But if your research is confirmatory (and I think each researcher should to both exploratory and confirmatory research), then you do need to do all the boring things that I described earlier. Because boring as they might be, they are also unavoidable if we want to have a solid science.


Wednesday, June 5, 2013

The Diablog on Replications and Validity Continues


In the latest conversational turn in my ongoing dialog (diablog?) with Dan Simons about replications and validity, Dan provides some useful insights into what qualifies as a direct replication:

a direct replication can be functionally the same if it uses the same materials, tasks, etc. and is designed to generalize across the same variations as the original.

I agree completely. As Dan notes, no replication can be exact and some changes are inevitable for the experiment to make sense. At Registered Replication Reports (RRR), Dan and his colleague Alex Holcombe have instituted some interesting procedures:

Our approach with Registered Replication Reports is to ask the original authors to specify a range of tolerances on the parameters of the study. 

This is a great idea. What I like even more is that Dan not only talks the talk but also walks the walk. He is using this approach in his own papers by adding a paragraph to the method section in which he states the generalization target for his experiments. It would be tremendously useful if we all did this. For example, I’d be very interested to know whether authors think their experiments can be extended to Mechanical Turk.

Of course, allowing authors to define the scope provides them with a way to obstruct replication attempts. For example, an author could claim that the effect can only be found in cubicles of such-and-such dimensions on the 12th floor of a building in a medium-sized Dutch city on a sunny Thursday afternoon. Fortunately, RRR has a robust way of dealing with such shenanigans:

…we should then treat the original effect as unreliable and unreproducible, and it should not factor into larger theorizing…

This sounds exactly right to me.

I think there is only one issue that needs to be clarified. It is very well possible that I did not make myself clear enough. Dan states the bottom-line of my previous post as follows:

Rolf's larger point, though, is that it should be considered a direct replication to vary things in ways that are consistent with the theory that governs the study itself. 

I’m not sure this is how I would characterize my larger point. Rather, my point was that a direct replication could be augmented with slight variations in the manner I described. This would enable us to transcend idiosyncrasies of the original study to produce an authoritative test of the hypothesis. So the direct replication would be part of a larger constellation of studies. This way we would have (direct replication) our cake and eat (validity) it too. This is exactly what Dan describes in his last sentence.

So the way I see it, we are in complete agreement. Direct replications are the first important step but they should be followed up or combined with slight variations to enhance the validity of our findings. Commenting on my first post on this subject, Etienne LeBel suggests that this goal can often be achieved by adding one or more conditions to the direct replication. If the original experiment has a between-subjects design, this is a great idea (though its execution may not always be feasible). For within-subjects designs, it would clearly be a bad idea, as it would significantly alter the experiment.

I look forward to hearing more about the thinking behind RRR as it progresses. I already think it is a major positive development in our field and it is likely to become even better.

Tuesday, June 4, 2013

More Thoughts on Validity and Replications


In my previous post I described how direct replications provide insight into the reliability of findings but not so much their validity. Dan Simons yesterday wrote a insightful post in response to this (I love the rapid but thoughtful scientific communication afforded by blogs). As Dan said, we are basically in agreement: it is very important to conduct direct replications and it is also important to assess the validity of our findings.

Dan’s post made me think about this issue a little more and I think I can articulate it more clearly now (even though I have just been sitting in the sun drinking a couple of beers). To clarify, let me first quote Dan when he describes my proposal:

This approach, allowing each replication to vary provided that they follow a more general script, might not provide a definitive test of the reliability of a finding. Depending on how much each study deviated from the other studies, the studies could veer into the territory of conceptual replication rather than direct replication. Conceptual replications are great, and they do help to determine the generality of a finding, but they don't test the reliability of a finding.

There is no clear dividing line between direct and conceptual replications but what I am advocating are not conceptual replications. Here is why.

Most of the studies that have been focused on in discussions on reproducibility are what you might call one-shot between-subjects studies. Examples are the typical social priming studies such as the by now notorius professor prime and bingo walking speed experiments. Another example is the free will study by Vohs and Schooler that I discussed in an earlier post. The verbal overshadowing experiment that was the topic of my previous post is yet another example and the one I want to focus on here.

In these one-shot experiments the manipulation is between subjects and there essentially is only one prime-target pair. People see one video. Then they either describe the bank robber or they name capitals of U.S. states and then both groups perform the same line-up test. 

It is instructive to contrast this type of design with that of a typical cognitive psychology experiment. Let’s take a Stroop experiment. In a one-shot variant of this type of study one group of subjects sees the word red in green (experimental condition) and the other group sees the word red in red (control condition). We compute the mean naming time (across subjects) for each condition, compare them and voilĂ : Bob's our uncle.

However, people would justifiably complain about this experiment. Is red really a representative color word? How about blue, yellow, green, purple, orange, magenta, turquoise, and so on? And how do we know our one group of subjects is comparable to the other?

This is why many cognitive experiments have a within-subjects repeated-measures design. In such a design each subject would see not only red but also red, as well as green and green, and so on, with order of presentation counterbalanced across subjects. This design allows us to assess for each word whether and how much it is read faster in the congruent than in the incongruent condition.

The benefit of this design is that we will be able to assess whether our finding generalizes across items. In the typical experiment not all items will show the effect, just like not all subjects will show the effect. But thanks to early work by Herb Clark and recent work by others, we have methods at our disposal to assess the generalizability of our findings across items.

It is logically impossible to analyze the generalizability across items for one-shot studies. And I guess that this is what makes me uncomfortable about them and what has prompted the proposal I made in my previous post. According to this proposal, in the verbal-overshadowing study a direct replication of the original study would be one item pair, for example red-red. Another study would include a video and line-up that meet pre-specified constraints; this would constitute the second item pair (green-green) and so on. The next step would involve taking a meta-perspective. A composite effect size of all the experiments (or something like it) would be the decisive test of the effect.

This is the analogy I am thinking of. Dan’s post made me realize that in a way you could call green-green a conceptual replication of red-red. However, I prefer to think of it as assessing the validity of a finding across a set of items that are pulled from a pool of possible items. Such a concerted replication effort plus meta-analytic approach would have higher validity than a set of direct replications and the loss in reliability would be relatively small.

To be sure, I am not advocating following this approach instead of performing direct replications. Rather, I propose that we do this in addition to direct replications, possibly as a next step for findings that have proven directly replicable. Obviously, this approach is especially relevant with regard to one-shot studies but it might be applied more broadly.

Reliability is important but so is validity. We ultimately want to know what the strength of our theories is.

Monday, June 3, 2013

How Valid are our Replication Attempts?


Direct replications are very useful, especially given the current state of our field. However, direct replications do have their limitations.

The other day, I was talking with my colleagues Samantha Bouwmeester and Peter Verkoeijen about the logistics of a direct replication that we have signed on to do for Perspectives on Psychological Science. We ended up agreeing that a direct replication informs us about the original finding but not so much about the theory that predicted it. We're obviously not the only ones who are aware of this limitation of direct replications, but here is the gist of our discussion infused with some of my afterthoughts.

We are scheduled to perform a direct replication of Jonathan Schooler’s verbal overshadowing effect. In the original study, subjects were shown a 30-second video clip of a bank robbery. Subsequently, they either wrote down a description of the robber’s face (experimental condition), or they listed the names of the capitals of American states (control condition). Then the subjects solved a crossword puzzle. Finally, they had to pick the bank robber’s face out of a line-up. The subjects in the experimental condition performed significantly worse than those in the control condition, an effect that was attributed to verbal overshadowing.

In this replication project we—along with several other groups—are following a protocol that is tailored after the original study. This makes perfect sense given that we are trying to replicate the original finding. The protocol requires researchers to test subjects between the ages of 18 and 25. They will be shown the same 30-second video clip as was shown in the original study. They will also be shown the same line-up pictures as in the original study. The experiment will be administered in person rather than online.

My colleagues and I wondered how many of these requirements are intrinsic to the theory. For example, the theory does not postulate that verbal overshadowing only occurs in 18-25 year olds. In fact, it would be bordering on the absurd to predict that a 25-year old will fall prey to verbal overshadowing whereas a 26-year old will not. Verbal overshadowing is a theory about different types of cognitive representations (verbal and visual) and the conditions under which they interfere with one another. So what do we buy by limiting the sample to a specific age group? It is clear that we are not testing the theory of verbal overshadowing, rather we are testing the reproducibility of the original finding and not whether the finding itself says something useful about the theory.

Let’s look at the protocol again. As I just said, the control condition (which, incidentally, was not described in the original study but is described in the protocol) is one in which subjects generate the capitals of American states. The idea behind the control condition evidently is to give the subjects something to do that involves retrieval from memory and language production, which is what they are assumed to do in the experimental condition as well.

But a nitpicker like me might argue that even if you find a difference between the experimental condition and the control condition, which the original study did and which the replication attempts might as well, this does not provide evidence of overshadowing. Perhaps it is merely the task of describing something—whatever it is— that is responsible for the effect and not the more specific task of describing the robber’s face. I’m not saying this is true but we won’t be able to rule it out empirically.

A better control condition might be one in which subjects are required to describe some target that was not in  the video they just saw. For example, they could describe a famous landmark or the face of a celebrity. After all, the theory is not that describing per se is responsible for the effect. The theory is that describing the face that you’re supposed to recognize later from a line-up is going to interfere with your visual memory for that particular face.

So even if all of our replication attempts nicely converge on the finding that the control condition outperforms the experimental condition (and effect sizes are similar), this does not necessarily mean that we’ve strengthened the support for verbal overshadowing. It is still possible that a third condition in which people describe something else than the bank robber would also perform more poorly than the state capital condition. This would lead to the conclusion that simply describing something, anything really, causes verbal overshadowing.

So the question is what we want to achieve with replications. Replications as they are being (or about to be) performed right now—in Perspectives on Psychological Science (PoPS), the Open Science Framework, or elsewhere (e.g., here and here)—inform us about the reproducibility of specific empirical findings. In other words, they tell us about the reliability of those findings. They don’t tell us much about their validity. Direct replications largely have a meta-function by providing insight into the way we do experiments. It is extremely useful to conduct direct replications and I think the editors of PoPS have done an excellent job in laying out the rules of the direct replication game.

But let’s take a look at this picture that I stole from Wikipedia. Even if all replication attempts reproduce the original finding, we might be in a situation represented by the lower left panel. Sure, all of our experiments show similar effects but none have hit the bull’s eye. The findings are reliable but not valid. Where we want to be is in the bottom right panel where high reliability is coupled with high validity.

How do we get there? Here is an idea: by extending the protocol-based paradigm. For example, a protocol could be extracted from the work verbal overshadowing that is consensually viewed as the optimal or most incisive way to test this theory. This protocol might be like a script (of the Schank & Abelson kind) with slots for things like stimuli and subjects. We would then need to specify the criteria for how each slot should be filled.

We’d want the slots to be filled slightly differently across studies; this would prevent the effect from being attributable to quirks of the original stimuli and thus enhance the validity of our findings. To stick with verbal overshadowing, across studies we’d want to use different videos. We’d also want to use different line-ups. By specifying the constraints that stimuli and subjects need to meet we would end up with a better understanding of what the theory does and does not claim. 

So while I am fully supportive of (and engaged in) direct replication efforts, I think it is also time to start thinking a bit more about validity in addition to reliability. In the end, we’re primarily interested in having strong theories.