Monday, March 18, 2013

The Value of Believing in Free Will: A Replication Attempt

update February 26, 2014. Early March we'll be submitting a manuscript that includes both the experiment described here and another replication attempt run in the lab.

Earlier this year I taught a new course titled Foundations of Cognition. The course is partly devoted to theoretical topics and partly to methodological issues. One of the theoretical topics is free will and one of the methodological topics is replication. There is a lab associated with the course and I thought we’d be killing two birds with one stone if we’d try to replicate a study that was discussed in the first, theoretical, part of the course. The students would then have hands-on experience with replication of a study that they were familiar with. Moreover, we could discuss the results in the context of the methodological literature that we read in the second part of the course.

The experiment I had selected for our replication attempt was Experiment 1 from Vohs & Schooler (2008) on whether a lowered belief in free will would lead people to cheat more. I thought that this was a relatively simple experiment—in terms of programming—that could be run on Mechanical Turk (we needed to be able to collect the data fast, given that it was a five-week course). My first impression after a cursory reading of the article was that we might replicate the result.

In the experiment, subjects read one of two texts, both passages from Francis Cricks 1994 book The Astonishing Hypothesis. One passage argues that free will is an illusion and the other passage discusses consciousness but does not mention free will. These texts were cleverly chosen, as they are similar in terms of difficulty and writing style. After reading the passages, the subjects complete the Free Will and Determinism scale and the PANAS.

Next comes the meat of the experiment. Subjects solve 20 mental-arithmetic problems (e.g., 1 + 8 + 18 - 12 + 19 - 7 + 17 - 2 + 8 – 4 = ?) but are told that due to a programming glitch, the correct answer will appear on the screen and that they can make it disappear by pressing the spacebar. So if the subject does not press the spacebar we know they are cheating. Vohs and Schooler (V&S) found that the subjects who had read the anti-free-will text cheated more often than those who had read the neutral text. More about the results later.

My graduate student, Lysanne Post, who is collaborating with me on this, contacted the first author of the paper, informing her about our replication attempt. She was helpful in providing information that could not be gleaned from the paper. It turns out the experiment was run in 2003 and the first author did not remember all of the details of that study. But with the information that was provided and some additional sleuthing we were able to reconstruct the experiment.

We ran the experiment on Mechanical Turk, using 150 subjects. This should give us awesome power because the original experiment used 30 subjects and the effect size was large (.82).

In V&S's study, subjects in the AFW condition reported weaker free will beliefs (M = 13.6, SD = 2.66) than subjects in the control condition (M = 16.8, SD = 2.67).  In contrast, we found no difference between the AFW condition (M = 25.90, SD = 5.35) and the control condition (M = 25.11, SD = 5.37), p = .37. Also, our averages are noticeably higher than V&S’s.

How about the effect on cheating?

V&S found that subjects in the AFW condition cheated more often (M = 14.00, SD = 4.17) than subjects in the control condition (M = 9.67, SD = 5.58), p < .01, an effect of almost one standard deviation! In contrast, we found no difference in cheating behavior between the AFW condition (M = 4.53, SD = 5.66) and the control condition (M = 5.97, SD = 6.83), p = .158. Clearly, we did not replicate the main effect. It is also important to note that the average level of cheating we observed was much lower than that in the original study.

V&S reported a .53 correlation between scores on the Free Will subscale and cheating behavior. We, on the other hand, observed a nonsignificant .03 correlation.

There was a further issue. About half our subjects indicated they did not believe the story about the programming glitch (we kind of feared that this might happen). We analyzed the data separately for “believers” and “nonbelievers” but found no effect of condition in either group.

What might account for this series of stark differences between our findings and those of V&S? I will discuss some ideas in my next post. Then I will also talk about some lessons we learned from this replication attempt. Meanwhile, it might be good to reference my first post, which talks about the why of doing replication studies. 

39 comments:

  1. I look forward to hearing your thoughts on this. In my own research I directly compared a priming study conducted in person and online, and found significant results for the in-person version but not online (not even close). Do you consider the Vohs and Schooler study to be "priming"? My best explanation is that, for any priming effects that actually exist, they could be wiped out by any number of distractions experienced by online participants (e.g., television).

    ReplyDelete
    Replies
    1. Interesting point. I'll consider it in my next post.

      Delete
  2. Plus, how likely is it that one really believes such an online programming glitch?

    ReplyDelete
  3. I think the failed manipulation check and the high number of people who reported not believing the story make it difficult to draw *any* conclusions from this replication attempt other than it might not be worthwhile to run all types of studies on Mturk.

    ReplyDelete
    Replies
    1. So your conclusion is that MTurk is not suitable for running these experiments. Surely this is not the ONLY logically possible reason why we were unable to replicate the manipulation check (which produced a large effect in the original study)? What could be so different about MTurk that we couldn't even replicate this huge effect? More about this in my post tomorrow.

      Delete
    2. Of course there are other reasons, including the effect is essentially non-existent, the populations of the two samples are sufficiently different, etc etc.

      There are two issues, however, that I think are at least as likely.

      1. Although MTurk is suitable for running all sorts of experiments (the literature is now quite filled with them), the nature of the DV (especially) does not seem suitable for MTurk (or any other entirely online environment, e.g., ProjectImplicit). Your own participants point this out.

      2. Mturk is not a contamination-free subject pool. It could be that Mturk workers who participate in these studies have seen similar manipulations (if not the same manipulation) and so have become less reactive to them. I think this is a real problem; psychology has gone from the study of college sophomores to the study of MTurk workers.

      Delete
    3. Your first point is a good one, which I will discuss in my upcoming post. Your second one I have a hard time believing. This seems much more likely for psych intro students than for Turkers.

      Delete
    4. Rolf, actually I think there is good evidence for believing that a very large fraction of Turkers (in some cases perhaps even a majority) participating in any given psychology experiment have participated in literally hundreds of previous online psychology experiments, and that those Turkers have probably seen most of the "common" study paradigms in some form or other in the past.

      Here are some resources that get into this and other issues with Turk samples:

      http://experimentalturk.wordpress.com/2012/10/09/slides-from-acr-2012/

      http://experimentalturk.wordpress.com/2012/02/02/experimenting-on-amt-fundamental-articles/

      In particular see the slides "Non-Naivety among Experimental Participants on Amazon Mechanical Turk" from the first link, and the manuscript "Methodological Concerns and Advanced Uses of Amazon Mechanical Turk in Psychological Research" from the second link.

      --Jake

      Delete
    5. Thanks Jake, very useful info! Some of it even from someone at my own university whom I don't even know! I'll definitely need to address this in my next post.

      Delete
    6. I highly agree with Mark on both accounts. Especially regarding the part about the computer failure in an online environment just simply seems like there's a flaw in the study, which was not there in the original study (although some people point out social psych is a bunch of cheap shots, others have problems replicating the basic designs;)). Chances of the 'failure' of the replication being due to the flaw seems much more likely than that the original study was not true (although, admittedly, replications can resolve the true effect size; .82 seems rather high).

      Gabriele indeed also points out in very nice work that MTurkers take zillions of studies and are probably better informed than many psychology students (and, admittedly, many of the 'classics' make it difficult to run studies again). There's also a lot of cross talk between Turkers on experiment, in particular when they do the same studies over and over again (social and moral dilemmas in particular).

      Delete
    7. Just curious, where in the original paper did you read that everyone believed the cover story about the computer glitch? So how do you know the "flaw" wasn't present there? And where in the paper does it say that there was no crosstalk among the subjects? In other words, how can you be so sure (as you appear to be) that the basic design wasn't replicated?

      The cover story is likely less credible in an online environment but unlike the original at least we have information on who did not believe the cover story. We simply have to take in on faith that everyone in the original study believed it. Among the subjects in our study who did appear to believe the cover story there was also no effect. The manipulation check occurred before the cheating task so our nonreplication there is immune to whatever problems there might be with the cover story.

      The potential problems with the Turkers are more serious.

      Delete
    8. Simply the difference in environments introduces a big factor. I CAN believe that, with some role play, participants believe the glitch in a lab environment. I cannot believe it in the online environment.

      I am not saying replication is not a worthy cause and it SHOULD be done (on the contrary, you know that both Mark and I are very much in favor of it and are doing as much as possible to use it as an important tool). However, replications should be done right. I am also not saying that the original study is perfect. I don't have any stake in it and for all I care you don't replicate it. Yes, you have more information, but why then actually not DO the appropriate replication? It's like the hooligan study in Canadian ice hockey contexts....

      Delete
    9. Also, to ascertain: the point about cross talk and taking zillions of studies is not related to the most important point (that is, the design). I am not sure whether cross talk and zillions of studies is necessarily a problem here. It was mostly to underline points made earlier

      Delete
    10. You can believe it but you don't know it. That's what I meant by taking it on faith. You have no foundation for your claim that the design was not replicated other than something you CAN believe.

      Delete
    11. Experiment 1, there's an effect. Experiment 2, there's no effect. Experiment 2 is a conceptual replication, not a close replication, because the design is different. That so much is clear. Why not run the close replication?

      Delete
    12. What is a "close replication" with studies of this type? There is no easy answer. I'll talk about this in my next post. (One drawback of having a two-part post and having people comment after the first part is that people will say things or prompt you to say things that you were going to address in the second part anyway).

      I'm still unclear as to which meaning of "design" you are using. Surely the experiment has the exact same design in the usual sense of the word.

      Delete
    13. I'll wait until you do your next post then:). But, if I would have seen this study a priori, I would not predict it to work. For many reasons.

      Delete
    14. It would have been good to know how much people in the original study believed the cover story. I totally agree.

      Rolf, I also think that your replication is useful (despite my snarky first reaction). It helps to understand how the effect (and more basically the procedures) work in an online environment. In the original lab study participants had face-to-face contact with the experimenter, in the replication they did not. This may have made people in the original study more committed to the research (e.g., pay more attention to the manipulation), which in turn produced the effects. Etc. etc. etc.

      The differences between replication attempts do help us learn about an effect. I am just hesitant to make many judgments about the effect itself based on one or two replication attempts - especially when there are potential issues related to the samples' experience and the interpretation of the cover-story.

      Anyways, I'll stop commenting now since I'm sure you will cover many of these issues in your next post. I just wanted to follow-up with a slight more reasonable update than my original comment :-)

      Delete
    15. Thanks, Mark, for this more reasonable comment.;) I am indeed going to cover these things. A key point of my next post will be that it is difficult to say with experiments of this type what is essential for the effect to occur. The research assistant's acting skills? A key other point is that if this stuff is so relevant, why is it not mentioned in the paper?

      Delete
    16. I think we all agree on that part. We are underreporting and the whole issue with replication is pointing this out. Some of the reasons for finding an effect we may not even know, so we may not even record that information. But, journals need to change in terms of what's being reported. Experimenter A may have a broad cover story, extensive pilot testing, while Experimenter B accidentally finds effects. We don't know now why.

      Delete
    17. You're stealing my thunder, Hans...

      Delete
    18. Sorry:). But Mark and I are intending to cover part of this in a paper as well. We can forward it to you at some points for comments as well. So in a way I am stealing your thunder of the thunder that you are stealing from us:).

      Delete
    19. I'll be happy to take a look at your paper, looks interesting. And there probably is plenty of thunder to go around.

      Delete
  4. I think you should change the mechanic.

    I would recommend that you have a puzzle game, where the subject puts together a puzzle of different shapes. (http://goo.gl/SDdA8)

    They should be presented a card with the shape on the screen, and be told that the answer is on the back. That they can flip the card to check their answer. This will allow them the choice to cheat, and you can check the state of the puzzle before they flip the card.

    You could also present the card "accidentally" with the answer facing the user. This would allow them the chance to flip it or discard it. We could build in a function to discard, as well as just to move a card to the completed pile and full-scale cheat. This could be pushed with a timer and completion counter.

    You can also introduce an aspect to see how a multiplayer, or monitored option effects the outcome.

    I'll make the game for you.

    ReplyDelete
    Replies
    1. the url was corrupted. It should point to a puzzle game like Tangoes

      Delete
    2. Thanks, that sounds pretty cool.

      Delete
  5. I think the most curious aspect of the replication attempt is demonstrated in the comments. I suspect people would not have been so critical of the methods and MTurk if the result had shown the expected effect. That is a troubling reaction, because the validity of the experiment should be judged by the methods, not by the results.

    ReplyDelete
    Replies
    1. If we judge by the methods, we end up with the same result. Garbage in, garbage out. What's especially troubling is conducting methodologically weak replication attempts under the guise of doing something important for the discipline.

      Delete
    2. Great point, Greg. Also puzzling is the lack of criticism of the original study. Have people even read the paper?

      Delete
  6. Right. Not true, but nice for confirmatory evidence if you are seeking to get confirmation of your hypothesis that (other) people are bad researchers.

    But, that is why the a priori registration (which Rolf is actually doing in his Frontiers Special Issue) is so nice. Rolf has told me about this experiment before (and before telling me the results) and my response was the same.

    ReplyDelete
    Replies
    1. As you well know, Hans, the goal of the replication attempt was not to show that other people are bad researchers. That goal is stated in the post. The post also states that I initially believed we were going to replicate the effect (why else would I have assigned the paper to my class?).

      Delete
  7. I'm confused. What's the value here? There's no measurement or control of the population sampled, environment, experimenter demands, individual differences, etc. So why are these results so different than those collected in the lab?

    Until we begin studies specifically aimed at answering that question, I fail to see any value in compiling a list of failed replications on M-Turk. It's like asking why polar ice cap melting rates can't be replicated in my kitchen. Context matters.

    ReplyDelete
    Replies
    1. Have you read the original study? I'm looking forward to hear about all the details you are going to find in there about the population. All I can find is this: "Participants were 30 undergraduates (13 females, 17 males)" but maybe you can do better.

      As for our own data, we have all sorts of information about the subjects (age, gender, level of education, native language, rating of the amount of noise in the environment, operating system, browser, ideas about the purpose of the experiment, and reading time for the passages). I didn't report all of this in the blog because it's a blog and not an article. I will discuss some of it in my next post. I will then also address the question why our results are so different.

      I wonder why you criticising the replication attempt for not having all sorts of measures (which as I've just told you it does have), which the original study clearly does not have.

      Delete
  8. this is a worth while effort and discussion. I must say that I've also failed on two related tasks when it comes to online administration on Mturk - (1) couldn't make this specific free will manipulation work (meaning, manipulation checks didn't work out), and (2) i couldn't get a meaningful variance in various cheating measures that have worked for me in the lab.

    with that said, i'm not entirely sure this is all about the mturkers or running this online. though i know of a few studies that have successfully ran this manipulation with similar or conceptually related DVs (aggressive behavior, prosocial behavior, etc.), i also know of quite a few failed runs. it would be worthwhile to try and aggregate those findings (meta?) to try and understand when and why this works or doesn't.

    ReplyDelete
    Replies
    1. Thanks. Good to hear something constructive. These are interesting observations. In my next post, which I plan to write tomorrow, I want to systematically address these and other points. Meta-analyses are definitely useful.

      Delete
  9. Meta-analysis is key here - failure to replicate is way too common for my comfort level, even within my own lab! I recall one study I did as a post-doc, finding completely opposite (significantly so) results before and after spring recess. And it was a "simple" memory study. My advisor at the time told me "welcome to human behavior." As they say... if the brain were simple enough to understand, we'd be too simple to understand it.

    ReplyDelete
  10. When a remarkable finding is published, few challenge it. But when the finding does not replicate, readers bend over backwards to explain why the effect might be real. In reports of priming, effects are large and robust, so why small procedural changes wipe it out entirely? Moreover, a number of online priming effects have been published, including studies with MTurk subjects (e.g., Caruso, Vohs, et al. JEPG 2013).

    ReplyDelete
  11. I felt it was a little difficult to understand the experiment result.
    What is the difference between two texts?
    What cause the cheat difference?

    ReplyDelete