Wednesday, January 27, 2016

Linguistic Cues and Firing Guns: Some Background on a Registered Replication Report

Today the second Registered Replication Report (RRR) was published in Perspectives on Psychological Science.  I am one of its authors. Here is some background on the project. I'll take you on a brief trip across three fields: linguistics, cognitive psychology, and social psychology. I'll discuss the results of the RRR and conclude with some lessons learned.

Language is a tool with which we can “shape events in each other’s brains with exquisite precision” [1]. Linguistic analysis shows how subtle a tool language is. Take for example grammatical aspect. We can describe the same event by saying He was running to the finish line or by saying He ran to the finish line. Linguists have argued that these utterances “construe” the event differently [2]. Whereas the past progressive (was running) opens up the internal structure of the event, the simple past (ran) describes the event as completed.

To get a better feel for this distinction, let's append a clause to the sentence: He was running to the finish line but he tripped and broke his ankle. This makes perfect sense—he was in the middle of something and then something else happened and he never got to finish the first thing. But now let's append the clause to the perfective sentence: He ran to the finish line but he tripped and broke his ankle. This doesn’t make sense; he’s already at the finish line and so the event’s internal structure is not accessible. There is no room, so to speak, for him to trip and break his ankle.

Unlike linguists, cognitive psychologists are charged with figuring out whether such distinctions impact language processing. A long time ago, I took a stab at this. Take the sentence He stirred/was stirring his coffee. The action of stirring coffee normally involves a spoon. The idea was to measure priming for the word spoon as a function of grammatical aspect, the hypothesis being that there should be more priming after was stirring than after stirred. My student and I found initial evidence for this hypothesis—subjects responded faster to spoon after a was stirring than after a stirred sentence (we obviously used many such sentences)and presented it at a conference [3]. The plan was to conduct follow-up experiments but all of a sudden my student informed me of his decision to move back to his home state and the project was abandoned.

Fortunately, other researchers took the idea an ran with it [4]. The evidence showed that grammatical aspect indeed seems to modulate how much we activate about an event. For example, there is more activation for arena after was skating than after skated. My student Carol Madden and I later found that comprehenders are focused on the endpoint of the action in the perfective but don’t show a clear temporal focus when presented with the imperfective [5]. Joe Magliano and his student showed that in progress activities had a higher likelihood of being perceived as ongoing in the subsequent context than do completed activities [6].

Social psychologists have taken this work one step further [7]. The researchers hypothesized that grammatical aspect can modulate the attribution of intentionality, which could be crucial in legal cases. Take the following vignette.
Varying grammatical aspect had a large effect on judgments of criminal intentionality, as well as on two other dependent measures. For example, criminal intentionality was rated higher in the imperfective case (was firing) than in in the perfective case (fired). 

At a conference in Seattle in 2011, a group of us were talking about this study. We found it an interesting extension of our own work on aspect. We were also somewhat puzzled by the effect sizes reported in the paper, given the more modest effect sizes usually found in cognitive experiments and the fact that the intention attribution presupposes a cognitive representation of the event. If anything, you’d expect the more indirect effect to be smaller than the more direct effect.

We reasoned that the indirect effect might be relatively large for three reasons. First, the study only used one vignette and had a between-subjects design. Had a cognitive psychologist designed the experiment, they’d probably have each subject evaluate something like 24 vignettes, with 12 per aspect condition, and a counterbalancing scheme and an equal number of fillers thrown in to hide the manipulation. We suspected that maybe within-subjects designs depressed the effect sizes in cognitive experiments.  Subjects somehow got used to the manipulation, so that its effects weakened over time. What spoke against this is that in informal analyses of other experiments, we’d never seen much of a decline of the effect across trials, just a general speed up. But maybe knowing ahead of time whether you’d be reading a single or 48 vignettes made the difference.

Second, vignettes in cognitive psychology characteristically are rather bland affairs (people stirring coffee or skating on ice rinks), whereas here we had to do with extreme violence. Maybe higher emotional content generates larger effects.

And third, the cognitive experiments assessed response times and one cannot reliably assess response times in a single-observation between-subjects design. We speculated that maybe ratings were more stable measures.

We decided to build on this research given our combined interests in language and legal decision making (some of us even blogged about it), using the same paradigm but larger samples of subjects. We were particularly interested in manipulating the provocation of a violent event. Would people judge the crime differently as a function of the severity of the provocation? We manipulated severity via grammatical aspect. And we were interested in verdicts.  Would people’s ultimate judgments of first vs. second degree murder be swayed by verb aspect? To cut a long story short, the evidence for intentionality was disappointingly weak and mixed, although we found strong effects on a cognitive measure [8]. Based on the results reported in [7], we’d expected to find more robust effects, given that we used upwards of 2.5 their number of subjects.

Was it the case, perhaps, that people were insensitive to aspect altogether in this paradigm? To examine this, we included (from Experiment 2 onwards) a measure that asked subjects: how many times did X hit Y? Here we consistently found highly robust effects of grammatical aspect. If the progressive was used (e.g., was hitting), subjects estimated more instances of hitting than with the simple past (e.g., hit). Evidently, people were sensitive to grammatical aspect in this paradigm but this did not extend convincingly to intentionality judgments. Because of this, our fourth experiment was a close replication of Experiment 3 in [7]. We found no effect.

At this point, we decided to take a few steps back to check the foundations on which we were building. We initiated a registered replication project, performing direct replications of the Experiment 3 in [7]. The results have now been published in Perspectives on Psychological Science [9]

The forest plot below represents one of the three main dependent measures, criminal intentionality. It shows the original study at the top (the difference between the simple past and progressive in terms of intentionality ratings). Underneath are the 12 replication studies. The meta-analytic effect (which excluded the MTurk study and the original experiment, as dictated by the protocol) is practically zero; if anything, it goes in the opposite direction from what was predicted. [Note how the one MTurk study is spot on with regard to the meta-analytic effect, whereas the original effect is not.] None of the replication experiments showed an effect in the predicted direction, although two showed an effect in the opposite direction. The same pattern emerged for the two other dependent measures, intention attribution and detailed processing: no effects in the predicted direction.



There was one change made to the original vignette. The editor, Alex Holcombe, noted that it is odd to state he was pulling out and he was pointing because you cannot be doing both at the same time. This makes sense in light of the description of the imperfective aspect that I gave at the beginning of this post. In the vignette, was pulling indicates that we're in the middle of a pulling event. Given the temporal dependency between pulling and pointing--you first have to finish pulling a gun before you can point it--you cannot be pulling and pointing at the same time. So the proper way to convey that the action was completed is to use the perfective, pulled. Both the original authors and replicators agreed with the editor's proposed change to the protocol. After all, it would be silly if the original effect hinged on a nonsensical phrase.

And the evidence we have suggests that it doesn't. As the Replication Report mentions, one more replication was run, by Chris Kurby. This experiment used the original text. It could not be included in the meta-analysis because it was not preregistered but it is available on the Open Science Framework. This study did not yield significant effects for any of the three dependent measures (all ps >.28). For example, the means for criminal intentionality are 3.79 for imperfective and 3.86 for perfective, which is quite similar to the meta-analytic effect shown above.

So what are we to make of this? Evidently, the large original effect was thoroughly non-replicated. Yet our own findings, with much more power, showed some, but inconsistent, effects of grammatical aspect on intentionality judgments and legal decision making, although not in the experiment (Experiment 4) that was modeled after Hart and Albarracin (2011). At the same time, we found strong evidence that our subjects were sensitive to the semantics of the aspect manipulation. This sensitivity, however, clearly failed to extend consistently to intentionality judgments, most notably in experiments that were modeled after the original experiment.

So I think the most reasonable conclusion at this point is this. Linguistic analysis and cognitive processing data suggest that grammatical aspect can have an impact on event comprehension. However, we need to think of a different paradigm to investigate whether these effects extend to social cognition.

There are several lessons we can draw from this.
  1. First conduct a direct replication before you decide to build on a finding. Of course, Richard Feynman already said this (pp. 12-13) a long time ago.
  2. Even if you find large cognitive effects based on a linguistic manipulation, these do not readily translate to consistent effects at the social level. We need to know more about the mechanisms involved.
  3. A single nonreplication may not mean much, but a dozen surely does.
  4. Don’t call something by the wrong name in a scientific contribution, as I did by saying verb aspect rather than grammatical aspect, because later authors will copy your mistake (see references).

References

[1] Pinker, S. (1994). The Language Instinct. New York, NY: Harper Perennial Modern Classics, p. 15.

[2] Langacker, R.W. (1987). Foundations of cognitive grammar. Vol. 1, Theoretical prerequisites. Stanford, CA: Stanford University Press.

[3] Truitt , T. P., & Zwaan, R. A. (1997, November). Verb aspect affects the generation of instrument inferences. Paper presented at the 38th Annual Meeting of the Psychonomic Society, Philadelphia.



[6] Magliano, J.P. & Schleich, M.C. (2000) Verb Aspect and Situation Models, Discourse Processes, 29, 83-112.

[7] Hart, W., & AlbarracĂ­n, D. (2011). Learning about what others were doing: Verb aspect and attributions of mundane and criminal intent for past actions. Psychological Science, 22, 261-266. doi: 10.1177/0956797610395393

Tuesday, December 8, 2015

Stepping in as Reviewers

Some years ago, when I served on the Academic Integrity Committee investigating allegations of fraud against Dirk Smeesters, it fell upon me to examine Word documents of some of his manuscripts (the few that were not “lost”). The “track changes” feature afforded me a glimpse of earlier versions of the manuscripts as well as of comments made by Smeesters and his co-authors. One thing that became immediately obvious was that while all authors had contributed to the introduction and discussion sections, Smeesters alone had pasted in the results sections. Sometimes, the results elicited comments from his co-authors: “Oh, I didn’t know we also collected these measures” to which Smeesters replied something like “Yeah, that’s what I routinely do.” Another comment I vividly remember is: “Wow, these results look even better than we expected. We’re smarter than we thought!” More than a little ironic in retrospect.

On the one hand I found these discoveries reassuring. I had spent many hours talking in person or via Skype with some of Smeesters’ co-authors. Their anguish was palpable and had given even me a few sleepless nights. The Word documents seemed to exonerate them. We had asked Smeesters to indicate for each study who had had access to the data, which he had dutifully done. For each study deemed problematic, he indicated he had sole access to the data and the Word documents confirmed this. I was relieved on behalf of the co-authors.

On the other hand, I found the co-authors’ lack of access to the data disturbing. You could fault them for apparently being uninterested in seeing the data and Smeesters for not sharing them. But how common it is to share data among co-authors anyway, I wondered? Smeesters obviously had his reasons for not sharing the data but there are also far more innocent reasons why co-authors may not want to share the data. For example, researchers may find it unpleasant to have somebody looking over their shoulder, as it might imply a perceived a lack of competence on their part. “I’m a Ph.D. now, I can analyze my own data, thank you very much.” Not trying to cause offense may make co-authors reluctant to ask for the data. Sometimes, the researcher analyzing the data may have used idiosyncratic steps in the process that are not easy to follow by others. Sharing the data would be onerous for such a person because this would require making every step that is second nature to them explicit for the benefit someone else. The perceived burdensomeness of the task could be another barrier against sharing data.

If there are barriers against sharing data among co-authors, then one might expect that the barriers against sharing data with third parties, such as reviewers, and other interested researchers are substantially higher. Indeed, this turns out to be the case in psychology, even after the turmoil that the field has recently gone through.

It seems that we like to play it close to the vest when our data are concerned. But science is not a poker game. When we take a few steps back from our own concerns, this becomes clear. We need to back up our claims with data and not with a poker face. We also have a responsibility towards our fellow researchers. Sure, they may be our competitors in some respects but together we’re in the business of knowledge acquisition. This process is greatly facilitated when there is open access to data. And finally, we have a responsibility to the society at large, which funds our research.

For these reasons, I’m proud to be part of the Peer Reviewers’ Openness Initiative. The basic idea behind the Initiative is that reviewers can step in to enhance the openness of our science. They do this by pledging not to offer comprehensive review for, nor recommend the publication of, any manuscript that does not meet several minimal requirements, which you can find on the website. I’ll just highlight three of them here.

(1) The data should be made publicly available.

We just discussed this.

(2) The stimuli should be made publicly available.

Just as we all benefit from access to data, we also benefit from access to stimulus materials. I cannot speak for other areas in psychology but in cognitive psychology, the sharing of stimuli has been common for decades. Back in the day it was not possible to have a printed journal article with a 10-page appendix with stimulus materials. Authors would provide a few sample stimuli and there would be a note that the complete materials were available from the corresponding author upon request. In my experience, the stimuli were always promptly sent when requested. Since the advent of the internet, there are no physical or financial limits to posting stimuli. At least for cognitive psychologists, therefore, this second PRO requirement should not be different from what is already common practice in the area.

(3) If there are reasons why data and/or stimuli cannot be shared, these should be specified.

It is important to note here that under the PRO Initiative, reviewers provide no evaluation of these reasons. In other words, under the PRO Initiative, reviewers are by no means arbiters of what counts as a valid reason and what not. The only requirement is that the reasons become part of the scientific record.

My father was a chain smoker for most of his life until he declared at one point:  “smoking is a filthy habit!” (“You can say that again!” I remember replying.) After this epiphany, my father never touched a cigarette again. I hope that the PRO Initiative will contribute to the field reaching a similar epiphany about lack of openness.  

If you’ve already had this epiphany, you may wish to sign the Initiative here.

For other views related to the Initiative, see blog posts by Richard Morey and Candice Morey.


Thursday, June 25, 2015

Diederik Stapel and the Effort After Meaning

Sir Frederic, back when professors still
looked like professors.
Take a look at these sentences:

A burning cigarette was carelessly discarded.
Several acres of virgin forest were destroyed.

You could let them stand as two unrelated utterances. But that’s not what you did, right? You inferred that the cigarette caused a fire, which destroyed the forest. We interpret new information based on what we know (that burning cigarettes can cause fires) to form a coherent representation of a situation. Rather than leaving the sentences unconnected, we impose a causal connection between the events described by the sentences.

George W. Bush exploited this tendency to create coherence by continuously juxtaposing Saddam and 9-11, thus fooling three-quarters of the American public into believing that Saddam was behind the attacks, without stating this explicitly.

Sir Frederic Bartlett proposed that we are continuously engaged in an effort after meaning. This is what remembering, imagining, thinking, reasoning, and understanding are: efforts to establish coherence. We try to forge connections between what we see and what we know. Often, we encounter obstacles to coherence and we strive mightily to overcome them. 


Take for example the last episode of Game of Thrones. One of the characters, Stannis Baratheon, barely survives a battle and is shown wounded and slumped against a tree. Another character strikes at him with a sword. But right before the sword hits, there is a cut to a different scene. So is Stannis dead or not? This question is hotly debated in news groups (e.g., in this thread). The vigor of the debate is testament to people's intolerance for ambiguity and their effort after meaning.

Stannis Baratheon, will he make it or not?
The arguments pro or contra Stannis being dead are made at different levels. Some people try to resolve the ambiguity at the level of the scene. No, Stannis could not have been killed: the positioning of the characters and the tree suggests that the sword would have struck the tree rather than Stannis. Other people jump up to the level of the story world. No, Stannis cannot be dead because his arc is not complete yet. Or: yes, he is dead because there is nothing anymore for him to accomplish in the story—let’s face it, he even sacrificed his own daughter, so what does he have left to live for! Yet other people take the perspective of the show. No, he is not dead because so far every major character on the show that is dead has been shown to have been killed; there are no off-screen deaths. Finally, some people take a very practical view. No Stannis cannot be dead because the actor, Stephen Dillane, is still under contract at HBO.

The internet is replete with discussions of this type, on countless topics, from interpretations of Beatles lyrics to conspiracy theories about 9-11. All are manifestations of the effort after meaning.

Science is another case in point. In a recent interview in the Chronicle for Higher Education, Diederik Stapel tries to shed light on his own fraud by appealing to the effort after meaning:

I think the problem with some scientists […], is you’re really genuinely interested. You really want to understand what’s going on. Understanding means I want to understand, I want an answer. When reality gives you back something that’s chaos and is not easy to understand, the idea of being a scientist is that you need to dig deeper, you need to find an answer. Karl Popper says that’s what you need to be happy with — uncertainty — maybe that’s the answer. Yet we’re trained, and society expects us to give an answer.

You don’t have to sympathize with Stapel to see that he has a point here. Questionable research practices are ways to establish coherence between hypothesis and data, between different experiments, and between data and hypothesis. Omitting nonsignificant findings is a way to establish coherence between hypothesis and data and among experiments. You can also establish coherence between data and hypothesis simply by inventing a new hypothesis in light of the data and pretending it was your hypothesis all along (HARKing). And if you don’t do any of these things and submit a paper with data that don’t allow you to tell a completely coherent story, your manuscript is likely to get rejected.

So the effort after meaning is systemic in science. As Stapel says, when nature does not cooperate, there is a perception that we have failed as scientists. We have failed to come up with a coherent story and we feel the need to rectify this. Because if we don't, our work may never see the light of day.

Granted, data fabrication is taking the effort after meaning to the extreme--let’s call it the scientific equivalent of sacrificing your own daughter. Nevertheless, we would do well to acknowledge that as scientists we are beholden to the effort after meaning. The simple solution is to arrange our science such that we let the effort after meaning roam free where it is needed—in theorizing and in exploratory research—and curb it where it has no place, in confirmatory research. Preregistration is an important step toward accomplishing this.

Meanwhile, if you want to give your effort after meaning a workout, don’t hesitate to weigh in on the Stannis debate.