When effect sizes matter: The internal (in?)coherence of much of social psychology

This is a guest post by Lee Jussim.  It was originally posted as a comment to the Beginning of History Effect, but it seemed too important to leave as a comment. It has been slightly edited to help it stand alone.

Effect sizes may matter in some but not all situations, and reasonable people may disagree.

This post is about one class of situations where: 1) They clearly do matter; and 2) They are largely ignored. That situation: When scientific articles, theories, writing makes explicit or implicit claims about the relative power of various phenomena (see also David F’s comments on ordinal effect sizes).

If you DO NOT care about effect sizes, that is fine. But, then, please do not make claims about the “unbearable automaticity of being.” I suppose automaticity could be an itsy bitsy teenie weenie effect size that is unbearable (like a splinter of glass in your foot), but that is not my reading of those claims. And it is not just about absolute effect sizes. It would be about the relative effects of conscious versus unconscious processes, something almost never compared empirically.

If you do not care about relative effect sizes, please do not declare that “social beliefs may create reality more than reality create social beliefs” (or the equivalent) as have lots of social psychologists.

If you do not care about at least relative effect sizes, please do not declare stereotypes to be some extraordinarily difficult-to-override “default” basis of person perception and argue that only under extraordinary conditions do people rely on individuating information (relative effect sizes of stereotypes versus individuating information in person perception are r’s=.10, .70, respectively).

If you do not care about at least relative effect sizes, please do not make claims about error and bias dominating social perception, without comparing such effects to accuracy, agreement, and rationality.

If one is making claims about the power and pervasiveness of some phenomenon — which social psychologists apparently often seem to want to do — one needs effect sizes.

Two concrete examples:
Rosenhan’s famous “being sane in insane places” study:
CLAIMED that the “insane were indistinguishable from the sane.” The diagnostic label was supposedly extraordinarily powerful. In fact, his own data showed that the psychiatrists and staff were over 90% accurate in their judgments.

Hastorf & Cantril’s famous “they saw a game” study:
This was interpreted both by the original authors and by pretty much everyone who has ever cited their study thereafter as demonstrating the power of subjective, “constructive” processes in social perception. It actually found far — and I do mean FAR — more evidence of agreement than of bias.

Both of these examples — and many more — can be found in my book (you can get the first chapter, and abstracts and excerpts here:http://www.rci.rutgers.edu/~jussim/TOC.html
(it is very expensive, so, if you are interested, I cannot in good faith recommend buying it, but there is always the library).

If (and I mean this metaphorically, to refer to all subsequent social psychological research, and not just these two studies) all Rosenhan and Hastorf & Cantril want to claim is “bias happens” then they do not need effect sizes. If they want to claim that labels and ingroup biases dominate perception and judgment — which they seemed very much to want to do — they need not only an effect size, but to compare effect sizes for bias to those for accuracy, agreement, rationality, and unbiased responding.

Lee Jussim

This entry was posted in Uncategorized. Bookmark the permalink.

11 Responses to When effect sizes matter: The internal (in?)coherence of much of social psychology

  1. Dan Simons says:

    I completely agree with central theme of Lee’s commentary, but I want to quibble with one point. The post noted that the relative effects of conscious and unconscious processes are almost never compared explicitly. That may well be true for the social cognition literature on purportedly implicit or automatic processes. But, it is not true of studies of implicit/subliminal perception within the perception/cognition literature. In that literature (which is over 100 years old and uses far more refined measures of awareness than are typical of studies in other areas), implicit effects are routinely compared to explicit ones.

    One of the central findings of that literature is effect sizes for allegedly implicit processes diminish in size with increasingly rigorous controls for awareness. That older literature is one reason that some cognitive psychologists have been skeptical of claims of strong and automatic implicit influences on behavior — the controls for awareness typically are minimal, the effect sizes are not compared to those with awareness, and the effect sizes are much larger than the ones typically found for implicit perception studies.

    Minor note to Brent: Experimental psychology includes a wide range of approaches and subdisciplines. Vision research is a form of experimental psychology, but it largely avoids NHST and is focused on parameter estimation and effect sizes. Cognitive psychology also uses experimental methods and NHST, but there tends to be more of an emphasis on effect sizes than in other areas of experimental. Clinical can be considered experimental in many cases as well, and some approaches/studies are better than others. It would be good to be more precise in defining the target of the the aspersions you’re casting. Not all subdisciplines of experimental are equivalent, so it’s not really appropriate to discuss the problems with “experimental” psychology without identifying the particular methods or claims under scrutiny.

    • Lee Jussim says:

      Hey would you send me some of those refs? (jussim at rci.rutgers. with an edu at the end) It is a point well-taken. I should have written “almost never compared in SOCIAL psychology.” At least informally, I have heard nasty stories that I consider credible of how, when folks have tried to compare them, the papers have been rejected because finding evidence of conscious and controlled processes is considered “not interesting” by reviewers and editors. (It is so easy to digress into the many ways in which social psychology has been distorted, that it is, sometimes, hard to stay on topic, which, in this case, is effect size, not substantive bias). Although in fairness to social psychology, my “almost never done” point holds true more for the priming literature than for the IAT literature. IAT research does indeed routinely examine relations between the IAT and explicit processes and measures. (Of course, the IAT lit has its own controversies; it is so hard to stay on topic…).

      • Dan Simons says:

        In my view, the interesting debate (at least in the implicit perception world) is not about whether or not something is conscious or implicit, but whether we can dissociate two processes by showing qualitative differences as a function of the extent of awareness. Both processes might be require some degree of awareness, but the difference might still be important.

        For some of the implicit effects reported in social psychology, I think many of those studying implicit perception just assumed that the effects were occurring with some modicum of awareness (given the methods used to assess awareness). They could still be interesting effects if that were true, but the strongest claims depend on the influences happening without awareness. As I noted earlier, my sense is that the skepticism emerged from the claims being about implicit processing and not from the particular domain of inquiry. There’s a long tradition of skepticism about claims of implicit perception within cognition/perception literature.

        As for references, the classic critical paper is:
        Holender, D. (1996). Semantic activation without conscious identification in dichotic listening, parafoveal vision, and visual masking: A survey and appraisal. Behavioral and Brain Sciences, 9, 1-23.

        There are many other reviews and empirical papers trying to address the criticisms he raised or proposing alternative ways to measure awareness. For example, Phil Merikle and Eyal Reingold fall more on the proponent of implicit processing end of the spectrum (within cognitive) and argue for using a subjective threshold for awareness (along with other criteria). I think most agree that it’s necessary to use signal detection measures to rule out awareness and that post-experiment questioning is inadequate to rule out awareness.

        I’ve written a couple review papers (not completely current, but close enough) that cover the history and controversies (both available from http://goo.gl/C9ZXR). The reference section in the first one is fairly extensive.

        Simons, D. J., Hannula, D. E., Warren, D. E., & Day, S. W. (2007). Behavioral, neuroimaging, and neuropsychological approaches to implicit perception. In P. Zelazo, M. Moscovitch, & E. Thompson (Eds.), Cambridge Handbook of Consciousness (pp. 207-250). New York: Cambridge University Press.

        Hannula, D., Simons, D. J., & Cohen, N. (2005). Imaging implicit perception: Promise and pitfalls. Nature Reviews Neuroscience, 6, 247-255.

  2. Quick question, which I assume would fit into a longer post than this one — in your Rosenhan example, you say that the doctors were over 90% accurate — is that good? A baseball player would kill for a 90% average, but an air traffic controller wouldn’t last very long landing planes with 90% accuracy. On Hastorf & Cantril — when partisans of both teams are looking at the same events, presumably it’s not surprising that there’s substantial agreement — it’s really only the amount of disagreement relative to what we’d expect (which in itself is a very ambiguous term) that matters.

    So this doesn’t suggest that we shouldn’t or don’t need to measure effect sizes, but it does seem to suggest that effect sizes will only give you a limited understanding of a finding’s real significance. I guess you really need to know the effect size relative to other relevant effect sizes.

    Would love some clarification on that point since I’m probably either missing something you said, or didn’t have space to say. Overall, the point that you can’t make claims that A is important, or bigger than B without knowing anything about their relative sizes is well taken.

    • Neal Roese says:

      I agree with this post: even comparative effect sizes give little insight when you are speaking of accuracy vs. bias. One solution is to benchmark the error rate to something that people care about, such as money. On an assembly line, for example, you can monetize the error rate, which converts the error rate to something you (might) care about. For example, is a 1% error rate small or large on a particular assembly line? If you translate that into cost per year, then can consider this in terms of cost savings if you were to improve accuracy to .1%, then you could compute ROI (return on investment), which means you could then decide if it’s “worth it” to invest in assembly line improvements that would lower the error rate. Two key points: a) the meaning of error rates is domain specific, and b) the value of considering comparative effect sizes hinges entirely on context (domain).

  3. Lee Jussim says:

    Dave,

    My point is only that one determinant of whether effect sizes are needed is based on the CLAIMS being made. So, whether or not there are objective standards about reporting effect size (these issues were extensively thrashed out in the discussion of Brent’s Beginning of History Effect post), my points are only that:

    1. IF researchers are making claims about the power of some effect, then yes, it behooves them to report effect sizes; and
    2. IF researchers make claims about relative effect sizes, even implicitly, then, again, it behooves them to report effect sizes for the phenomena they are (perhaps implicitly) comparing.

    On Rosenhan, the short version (really, if you are interested, you should get a copy of my book for the gory details). I am not making claims about what constitutes the degree of error or bias that justifies righteous outrage, concern, judgments of “badness” or whatever. I am pointing out that internal coherence constitutes a clear criterion for knowing when effect sizes are called for. If the claim was “the sane are indistinguishable from the insane” we need some estimate of the size of the successes and failures in diagnosing patients. If doctors get the distinction right 90% of the time, as they did in Rosenhan’s second study (one cannot estimate accuracy in the first study), Rosenhan’s claim is dead wrong.

    Once we have established that the sane are clearly (if not perfectly) distinguishable from the insane, and that Rosenhan’s conclusion was disconfirmed by his own data, it then becomes a matter of scientific and practical judgment regarding what can be substantively salvaged from the study. As you say, the psychiatrists were not perfect. How much to care about that is a matter of judgment that is beyond the scope of my original entry.

    On Hastorf & Cantril. Same type of thing. They declared social perception to be entirely subjective, going so far as to write: “There is no such ‘thing’ as a ‘game’ existing ‘out there’ in its own right which people merely ‘observe'”. Note the quotes around ‘thing’, ‘game’, ‘out there’ and ‘observe’ — all of which appeared in the original. Why? It helps emphasize their claim that there is no reality out there, it is all subjective perception. If that is the claim, then it behooved them to demonstrate HUGE bias effects and little or no accuracy or unbiased responding, which they did not do.

    Of course, anyone is welcome to make an entirely different claim as you did. You wrote:
    “when partisans of both teams are looking at the same events, presumably it’s not surprising that there’s substantial agreement — it’s really only the amount of disagreement relative to what we’d expect (which in itself is a very ambiguous term) that matters.”

    Your claim, in contrast to their’s and to others who cite the study as a testament to subjective and constructive processes (see, e.g., Ross’s 2010 Handbook chapter), includes the assumption that the event actually is the same much of the time. You assume that social reality exists, out there, in its own right, and, except when it is ambiguous, which in this case is not very often, people do indeed merely observe. To which I say: “Exactly!”

    Focusing on ambiguous situations can be theoretically useful. But if one wants to make claims about social perception generally (as did H&C and many of those citing them), one cannot restrict one’s domain to ambiguous situations. Everything counts.

    In contrast to their claims, their results showed at least 95% unbiased agreement, and at most, 5% bias. That is a HUGE social reality effect. It may not be surprising to those of us who believe that social reality exists and is often unambiguous. But it is huge nonetheless. (Although given the relentless drumbeat about subjective and constructive processes, and the extent to which this study has been cited as a testament to and demonstration of such processes, I would argue that it should be surprising to discover that, in fact, most of what they found was unbiased agreement).

    Barbara Mellers would describe this analysis as “counter counter-intuitive.” Both laypeople and many of our colleagues love counter-intuitive findings. Countering counter-intuitive findings is, I suspect, often experienced as subversive. But, as I once wrote about stereotype accuracy, the issue should not be about whether the conclusions are palatable — it should be about whether they are valid.
    Best,

    • Thanks for the thoughtful response. The only thing still niggling at me, and I hope I can articulate it clearly, is that even if we agree on 99% of what “a thing” is, it’s often precisely in the remaining 1% that the action is.

      When Hastorf and Cantril say there is no such thing as a game, they’re clearly not denying that when 60,000 fans show up to a stadium in team paraphernalia they all have wildly different interpretations of what’s about to happen.

      When we’re judging whether a penalty should be assessed on a play, presumably we agree on the vast majority of the mundane details. Yes, there were two teams and they were playing football, and all the players were human, and had arms, and legs, and heads. It was the 2nd quarter, it was 3rd and 4, and Dartmouth tried to run the ball up the middle. But if there’s disagreement at the very margins — Dartmouth fans all say there was holding and Princeton fans say there wasn’t — then the 1% matters a lot. Particularly if the differences there are systematic and predictable. If one player shoves another, you can even agree on the physical substrates of everything that happened, but still disagree about the meaning, depending on your other psychological commitments.

      In that sense it seems less of a stretch to say that there’s no objective “game” out there, despite vast amounts of intersubjective agreement, even between people who don’t agree in the end. It’s often enough for there to be huge overlap and a small amount of disagreement to provoke very meaningful differences in important outcomes.

      I agree that we shouldn’t oversell reality, and to the extent that we quantifiably document the effect sizes (and compare them to reasonable benchmarks, whatever those may be), people can judge things for themselves.

      • Lee Jussim says:

        Dave,

        Two things:
        I. First, on the merits of your points as YOUR points. Of course I agree that bias can and does occur in ambiguous situations; and of course I further agree that those can be very socially important and meaningful. Your point that small effects can be important is well-taken and I am not contesting that.

        II. What One “Means” Versus What One Writes
        That, however, is what YOU mean. I see no evidence in the H&C article that that is what THEY meant. I also have never seen the article cited in the way in which you just described it. Instead, it is usually cited as a testament to subjectivity.

        I have no idea what H&C or anyone else writing about these issues “means” other than what they wrote. I think mindreading is a losing business. If H&C did not mean “There is no such thing as a game” and, instead, meant “For the most part, people’s beliefs are nicely in touch with reality, but there are a tiny number of situations in which social reality is ambiguous, and in those situations subjectivity plays a big role; and, furthermore, even though those situations are few, they can be very meaningful and socially important” — then, with all due respect to H&C, they should not have written “There is no such thing as a game.” They should have written something much more nuanced, balanced, and true to their data. They were clearly smart scientists. I would guess they actually meant to write what they wrote. Most of us usually do. In scientific literature, some interpretation is necessary, but, I think the starting point for interpreting meaning is what the authors actually wrote. I see no ambiguity in what they wrote. I do not see how to translate “There is no such thing as a game” into “Ambiguity occurs in about 1% of the game and that 1% is very important.”

        If others do see how to make that translation, then we will have to agree to disagree. But, with all due respect, I think you may be conflating YOUR interpretation of what they found (which is well-justified) with their own (which is not). This type of extreme and unjustified claim comes up again and again and again and again. Here are just a few more modern examples:

        When Miller & Turnbull (1986) wrote: “Teachers’ expectancies influence students’ academic performance to a greater degree than students’ performance influences teachers’ expectancies” we should interpret it to mean that the the TE–>Sperf relation is larger than the SPerf–>TE relationship (it isn’t — see my book). If it really meant “Oh, well, uhh, self-fulfilling prophecies can be important, but we all know that teacher expectations are based on student performance far more than they cause student performance, it is just that self-fulfilling prophecies are so much more interesting and important” they should have said so.

        When Hare-Mustin & Maracek (1988) wrote: “Constructivism asserts that we do not discover reality, we invent it” I think we interpret their intended meaning literally. Science writing is not, in my view, the place for phantasmagoric/metaphorical statements interpretable as pretty much anything the reader wants to be true. If they “really” meant “Oh, well, sometimes constructive processes lead to the creation of social reality, but we all know that most of the time beliefs are based on social reality” they should have said so.

        When Jost & Kruglanski (2003) wrote: “The thrust of dozens of experiments on the self-fulfilling prophecy and expectancy-confirmation processes, for example, is that erroneous impressions tend to be perpetuated rather than supplanted, because of the impressive extent to which people see what they want to see and act as others want them to act …” I think we take them at face value. If they really meant to say, “Even though people are generally nicely in touch with reality most of the time, and even though biases and self-fulfilling prophecies tend to be quite modest in size and readily reduced or eliminated by disconfirming social information, even the modest effects of biases and self-fulfilling prophecies can sometimes be important” then that far more nuanced statement is what they should have written.

        I have taken to using an image of a puffin in talks I give around the country on this stuff. It is the Puffin of Puffery, to refer to extreme and exaggerated claims (“there is no such thing as a game”) based on tiny biases/SFP effects, which are rarely, if ever, actually provided or compared to unbiased/accuracy/rationality effects.

        I partially agree with Neal Roese’s comment:
        “the value of considering comparative effect sizes hinges entirely on context (domain).”
        I agree that the importance of effect sizes hinges, in part, on domain, and that may, sometimes, provide an objective criterion for deciding whether to report effect sizes.

        But I do not think it entirely rests just on domain. It also hinges on exactly what scientists are claiming. If Dr. X claims “THIS is huge” (think: reign of error; powerful and pervasive expectancy effects; automaticity dominates attitudes, motivation, affect, and behavior; and much more), then Dr. X has branded this theoretical/scientific claim as warranting comparison of effect sizes. To scientifically evaluate Dr. X’s claims, even if X did not do so him/herself and even if X denies the need to do so, it needs to be done. Or, X can retract the claims and say, “Oh, never mind, I did not really mean to say THIS is huge. I just meant to say ‘This happens and sometimes it is consequential.’ “THIS is huge” is not equivalent to “THIS is small but important.”

        Just like doctors’ first rule is “Do no harm,” scientists’ first rule should be “Do not promulgate claims that are manifestly false or distorted.”

        If scientists routinely “really” mean “This is important” when they write “This is huge” or “This is huger than that” then we have an epidemic of bad writing on our hands. I have too much respect for our colleagues’ writing, though, to believe that. I think they usually write exactly what they meant (though, of course, this does not preclude any of us from having our beliefs evolve and change and clarify over time — which is, in fact, what this type of discussion, and science more generally, is all about).

        Reporting effect sizes is not a panacea. It just constitutes one means of evaluating (and, therefore, holding scientists accountable for, and, therefore, reigning in many of) our field’s (often unjustifiably) unbalanced claims about all sorts of psychological phenomena.

  4. That makes sense — the distinction between what they say and what I say is certainly well taken. What’s interesting to me is that I’ve often heard the criticism that psychologists are too unwilling to speculate beyond their data. Perhaps the key point is that when we do speculate beyond our data, we should make it clear in big bold letters the line where the data ends and our speculation begins.

  5. Sanjay Srivastava says:

    A lot of the examples in the post and discussion are about sweeping, global claims about the wholesale construction of social cognition, about the relative importance of persons vs. situations, accuracy vs. bias, etc. It’s not hard to find examples of those kinds of broad, almost meta-theoretical claims, and I think they’re fair game for this kind of critique.

    But a very common way that researchers form and test more focused hypotheses about relative effect sizes is in interactions. If you are saying that Y is a function of an X-by-M interaction, you are essentially saying that the effect of X on Y is different at one level of M than at another. Sometimes interactions are conceptualized purely in terms of the sign of the effect and not its absolue magnitude (i.e., some crossover interactions). But there are also cases of interactions that are about relative magnitude. My sense is that researchers aren’t necessarily using the language of effect size to characterize them, but de facto that is what they are.

  6. Pingback: I don’t care about effect sizes — I only care about the direction of the results when I conduct my experiments | The Trait-State Continuum

Leave a comment