Wednesday, January 2, 2013

Replicating pencils and eagles but not steaks

What happens when we understand language?  What kind of mental representations do we form? Are they word-like or more perception-like? Traditionally, cognitive psychology has been leaning toward the first option but a little more than a decade ago, people started thinking that the second option might actually be closer to the mark.

Intuitively, the second idea makes a great deal of sense. I recently saw The Hobbit and it looked a lot the way I had imagined it during reading. Okay, perhaps I hadn’t pictured the wizard Radagast’s hair being plastered with bird shit when I read the book but otherwise the resemblance was pretty close. It sure looks like we create perception-like representations when we read.

How to test this intuition?
In my lab we came up with the sentence-picture verification task. Subjects read a sentence like He put the pencil in the cup and then a picture of a pencil. The pencil could either be oriented vertically or horizontally. A vertical pencil is more consistent with the sentence than a horizontal one, but this is irrelevant to the task we asked the subjects to perform. They merely decided whether or not the pictured object was mentioned in the sentence.

If people create word-like representations, the pencil’s orientation should be irrelevant but if they create perception-like representations, they should be faster in responding to the matching pencil than to the mismatching one.

What did the data tell us?

In the first experiment we ever did along these lines, my graduate student Rob Stanfield and I found support for perception-like representations. Subjects were faster on matching trials than on mismatching ones. We then tried to explore the limits of this effect. If it works for orientation, would it also work for shape? For example, if subjects read The ranger saw the eagle in the sky, would the respond faster to a flying eagle than a perched one? They did.

Others started getting into the game as well. One study wondered about color. Subjects read John looked at the steak in the butcher's window and then judged a picture of a steak that was either raw or cooked. And here something interesting happened. In contrast to the earlier studies, subjects now responded faster to the mismatching pictures than to the matching ones. Evidently, the sentences still influenced responses to the pictures, but in the opposite way.

These are the three experiments that we set out to replicate. The goal was to get a better understanding of the role of perception-like mental representations in language comprehension as viewed through the lens of sentence-picture verification.

Penciling in the results
The original orientation study found a small match effect. In both of our replications, we also found a small, but significant effect. But as I mentioned in my previous post, this doesn’t mean much if you have a large sample. And, in fact, the Bayes factor suggested the evidence was ambiguous at best for both replication attempts. However, when we combined the two experiments, the Bayes factor clearly pointed towards the alternative hypothesis. The evidence that people responded faster to the matching picture was about 25 times stronger than the evidence for the Null hypothesis.

It required a lot of power to replicate the effect. Yet, we didn’t have this much power in the original study. Did we fudge the data? Did we use “researcher degrees of freedom”? No we didn’t. In fact, we didn’t throw out any response times at all because we used median response times. We also didn’t run several experiments and just reported the “best” one (which seems a pretty silly thing to do anyway, if you ask me). We were just lucky. Had we not stumbled upon the effect, I’m pretty sure we would have abandoned the entire enterprise, especially because at the time I was still fully on board with the traditional cognitivist notion of word-like mental representations.

Shaping up to be a solid effect
 The original shape experiment produced a convincing effect. We replicated this effect twice. The effect for the alternative hypothesis was at least 100 times stronger than for the Null hypothesis. A resounding victory? Maybe. But one puzzling thing is that the effect size of the replications was about half that of the original experiment. We’re not sure why. There was less variability in the original lab-based experiment than in our Mechanical Turk replications. Also, the original experiment had a smaller sample and might thus have shown a more extreme effect.

Turning colors
The most interesting effect occurred in the color studies. Our replications yielded an almost perfect mirror image of the original results. Those results showed that mismatching pictures were responded to faster than matching pictures. Our replication results were in line with the orientation and shape studies.

Throughout the process, we have been in contact with Louise Connell, the author of the original color study, who has been very helpful by providing her original stimulus materials and by giving useful feedback on our results and manuscript. Did someone make a labeling error? We went back to our data and couldn’t find anything wrong. Louise couldn’t find anything wrong with the labeling in her data either. Do Mechanical Turk experiments for some weird reason show the opposite of lab experiments? The orientation and shape experiments suggest this is not the case. The original experiment was run with English subjects and the replications with Americans (at least, people residing in the US). So is it just a You say "potato," I say "patattah" thing? Unlikely.

The color study has considerably fewer items than the shape and orientation experiments. It also yielded longer response times. It is possible that this makes the color data less stable.  Might this explain the reversal of the pattern? We don’t know. We talked with Louise Connell about a collaborative effort to further investigate this question.

And in the end
This walk down replication lane has proven useful. We now have more confidence that when language implies features of objects, such as their orientation, shape, or color responses to subsequent pictures are facilitated (at least in the context of our experimental tasks). This is consistent with the idea that we form perception-like mental representations while reading. Perhaps a far more complex form of this is at work when we read novels like The Hobbit. This would explain why we experience a déjà vu when the movie adaptation shows us the lands of Middle Earth.

We also have developed a powerful and relatively fast way of doing replication studies that removes experimenter effects, has safeguards against false positives, and uses a broader sample of the population than the typical lab-based experiment. We are currently performing other replication studies with this method. And of course we’re also using it to study novel ideas. I will describe those in a later post.


  1. You note in the paper that Connell was happy that her original results supported an embodied account, even though they were the reverse of the typical pattern. You then find the typical pattern, and are happy that this is evidence for embodied cognition.

    My question is, what are your thoughts about what this says about this style of embodied cognition? Two published papers reporting opposite patterns of results both claim to support the same hypothesis, and no one seems that worried (well, except me :). Do you think this is a problem? If so, how big a problem, and if not, why not?

  2. It would not be appropriate for me to be open about the review process (many years ago I was a reviewer on Connell's CogSci paper). But let's just say I was not at all happy that a finding that was the opposite of earlier findings was interpreted as supporting the same hypothesis, just like you are.

    In Connell's defense, the amodal view predicts a Null effect so the fact that she got an effect suggests that reading the sentence influenced judging the pictures.

    In her second paper on this topic (of which I was not a reviewer), Connell points out why color might be different from shape and orientation. I think this is an interesting hypothesis but as we say in our replication paper, we're not totally convinced.

    So our paper basically solves the problem and your and my concern: all patterns are in the same direction.

  3. To play devil's advocate for a moment: you say your paper solves the problem. Why isn't the score 1-all between results that make sense and results that don't?

    The reason I ask is that I'm not that impressed by small, fragile effects; I think they are a hint that you've asked the wrong question. I'm interested in what people actually doing this research feel about it; are you worried that the effect bounced around, or are you happy now it's going in the expected direction?

    Sorry if this comes off as a bit rude; I'm genuinely interested in the answer but the question is inherently fairly pointed :)

  4. This is something we're going to have to investigate, hopefully in collaboration with Louise Connell. Right now, I'd put my money on our results because of the high power.

    I don't think the sentence-verification shape effect is fragile at all. It's been replicated several times by several labs and twice here, with awesome power.:) The orientation effect is weaker but was also replicated twice. We're currently investigating why the orientation effect is smaller. The main point is that the shape and orientation effects did not bounce around.

    It's important to stress that these experiments are very much pitted AGAINST finding an effect. Subjects are not forced to understand each sentence and are not questioned about the location (let alone orientation) of objects.

    We do this because otherwise we'd run into the standard psycholinguistic complaint that subjects are using "special strategies" (as if presenting them with dozens of ungrammatical sentences is somehow normal). Also, our subjects are reading impoverished materials (just like in 99% of language experiments). It is plausible that you'd find larger effects if the task involved connected discourse. This is also something we're going to investigate.

    I'd say it is pretty impressive that we got these effects against these odds. Moreover, Bayesian analysis guards against false positives. So we have very strong evidence--replicated several times in high-powered experiments--for shape and orientation.

    I appreciate your interest. These are important issues to discuss. I just seems like you think we have not thought of the issues you raise ourselves. I cannot speak for others, but I have thought and spoken about them many times.

    The replication study is an attempt to clarify the issue. As such, your comments describe pretty much our motivation for running the study in the first place. More studies are under way.

  5. I just seems like you think we have not thought of the issues you raise ourselves. I cannot speak for others, but I have thought and spoken about them many times.
    No, I don't think that, and I'm not trying to ride in with my wisdom :)

    That said, I don't see a lot of evidence of concern about this variation in the published literature. You're replicating results, sure, but you also need the extension and the probing of potential mechanisms, including ones that aren't from the conceptualisation embodiment hypothesis. For example, can you break the effect by experimentally perturbing some critical element of the task? What would that element be?

    The upcoming big 14 experiment paper sounds more like this, though, is that right? Lots of poking around?

  6. Indeed, you cannot rely on a single task. This is why we have used other tasks. We have used tasks where the pictures precede the sentences and are ostensibly unrelated to them. The dependent measures are memory, reading times, or the N400. So there are extensions in the literature already. Pictures were found to affect these dependent measures in predictable ways.

    With regard to the sentence-picture verification task, we're investigating right now whether making the object more or less action relevant modulates the effect. If the object is completely irrelevant to the action that is being described, people may not bother to represent it or its orientation. Is this an example of the kind of perturbation you have in mind?

    Many of our studies were attempts to find the limits of the effect. If you find it with orientation, do you also find it with shape? Do beginning comprehenders, who are not yet efficient at decoding, show the effect? Can a similar effect be found with different tasks and different dependent measures? And so on.

    The new study (that we're working on as we speak) is indeed more like probing. However, that study is more about language comprehension in general than it is about embodiment. It relates to my earlier work on situation models, although there are some embodiment aspects to it.

    I'm not interested in embodiment per se, to be honest. I'm interested in it only to the extent that it can help me explain the question how we understand language. That's what I'm really interested in. So I see embodiment as a means and not as an end.

  7. So I see embodiment as a means and not as an end.
    That's a useful context, actually. Thanks, it clarifies a few things nicely, as does the rest of your thoughts. Thanks :)

  8. Actually, blogging and responding to comments like these is useful for me as well in terms of articulating views.

    1. I've found it exceedingly useful too. The turn around is nice a fast as compared with published work, which helps resolve all those little confusions before they turn into major fights :)

  9. The advantage of a blog over an article is that you can provide background. Otherwise people are likely to infer intentions that might be completely alien to the author of the paper.

    I guess this is what I was trying to articulate in my post on funny article titles. Scientific articles will likely become dry, workman-like descriptions of confirmatory research (the typical JEP paper;)). Blogs will provide more context and more flavor. The middle category (e.g., Psych Science) might disappear or become more blog-like.

    1. I like JEP and similar precisely because you have the room to provide that context with the data. But after a few years blogging I now find myself writing papers and wanting to just link to a post I've written, so I know the feeling :)