The Beginning of History Effect

Posted on February 5, 2013 by pigee

Doctor, my eyes have seen the years

And the slow parade of fears without crying

Now I want to understand

I have done all that I could
To see the evil and the good without hiding
You must help me if you can
Doctor, my eyes
Tell me what is wrong

Was I unwise to leave them open for so long

Jackson Browne

I’m having a hard time reading scientific journal articles lately. No, not because I’m getting old, or because my sight is failing, though both are true. No, I’m having trouble reading journals like JPSP and Psychological Science because I don’t believe, can’t believe the research results that I find there.

Mind you nothing has changed in the journals. You find tightly tuned articles that portray a series of statistically significant findings testing subtle ideas using sample sizes that are barely capable if detecting whether men weigh more than women (Simons, Nelson, & Simonsohn, 2013). Or, in our new and improved publication motif, you find single, underpowered studies, with huge effects that are presented without replication (e.g., short reports). What’s more, if you bother to delve into our history and examine any given “phenomena” that we are famous for in social and personality psychology, you will find a history littered with similar stories; publication after publication with troublingly small sample sizes and commensurate, unbelievably large effect sizes. As we now know, in order to have a statistically significant finding when you employ the typical sample sizes found in our research (n = 50), the effect size must not only be large, but also overestimated. Couple that with the fact that the average power to detect even the unbelievably large effect sizes that we do report is 50% and you arrive at the inevitable conclusion that our current and past research simply does not hold up to scrutiny. Thus, much of the history of our field is unbelievable. Or, to be bit less hyperbolic, some unknown proportion of our findings can’t be trusted. That is to say, we have no history, or at least no history we can trust.

This was brought home for me recently when a reporter asked me to weigh in on a well-known psychological phenomenon that he was writing about. I poked around the literature and found a disconcertingly large number of “supportive” studies using remarkably small sample sizes and netting (without telling of course), amazingly large effect sizes, despite the fact that the effect was supposed to be delicate. I mentioned this in passing to a colleague who was more of an expert on the topic and he said “well, the real effect for that phenomenon is much smaller.” His comment reflected the fact that he, unlike the reporter, or the text book writer, or the graduate student, or the scholar from another field, or me, knew about all the failed studies that had never been published. However, if you took the history lodged in our most elite journals you would have to come to a different conclusion—the effect size was huge in the published literature. If you bother to look at many of our most prized ideas, you will find a similar pattern.

The Beginning of History Effect is, of course, a play on the End of History idea put forward by Fukuyama that with the end of overt and subtle battles of the cold war and the transition to almost universal liberal democracy would essentially end the tension requisite for the narrative of history to continue. The Beginning of History Effect (no, unfortunately, it is not an illusion) is an attempt to put positive spin on the fact that we can’t rely on our own scientific history. The most positive take on this situation is that we have the chance of making history from here on out by conducting more reliable research. I guess the most telling question is whether there is any reason to be optimistic that we will begin our history anew or whether we will continue to fight for ideas and questionable methods that have left us little empirical edifice on which to rest our weary bones?

To bring the point home, and to illustrate just how difficult it will be to begin our history over again, I thought it would be instructive to highlight a set of personality findings that are evidently untrue, but still get published in our top journals. Specifically, any study that has been published showing a statistically significant link between a candidate gene and any personality phenotype is demonstrably wrong. How do I know? If one spends a little time examining these studies you will find a very consistent profile. The original study will have what we think is a relatively large sample—hundreds of people—and no replication. Ever. If you go to the supporting literature to find replications you find none or the typical “inconsistent” pattern. More tellingly, if you go to the genome-wide association studies, you will find that they have never, ever replicated any of the candidate gene studies that litter the history of personality psychology, despite the fact that they contain tens of thousands of participants.

What this means in the terminology of the current replication crisis in the field of social and personality psychology is that the effect sizes associated with any given candidate gene polymorphism are so small that they cannot be detected without a sample size in the tens of thousands (if not hundreds of thousands). It is the same low power issue plaguing experimental psychology just playing out on a different scale. This should caution any blanket prescriptions for a priori acceptable sample sizes for any kind of research. The sample size you need is dictated by your effect size and that can’t always be known before hand. Who would have known that the correct sample size for candidate gene research was 50K? Many people still don’t know, including reviewers at our top journals.

The interesting, and appalling thing about the genetics research in personality psychology is that the geneticists knew all along that the modal approach used in our research linking candidate genes or even gwas data to phenotypes was wrong from the beginning (Terwilliger & Goring, 2000). In fact, the current arguments in genetics revolve around whether the right genetic model is a “rare variant” or “infinitesimal model” (Gibson, 2012). Either model accepts the fact that there are almost no common genetic variants that have a notable relation to any phenotype of interest, in personality, or psychology, or otherwise. And by notable, I mean “effect size that is detectable using standard operating procedures in personality psychology (e.g., N of 100 to 500).

What this means in practical terms is that a bunch of research, some done by close friends and colleagues, is patently wrong. And by close friends, I mean really close friends—award winning close friends. What are we going to do about that? What am I supposed to do about that? Simply ignore it? Talk about it in the hallways of conferences and over drinks at the bar? Tell people quietly that they shouldn’t really do that type of research?

Multiply this dilemma across our subfields and you see the problem we face. So, maybe we should hit the reboot button and start our history over again. At the very least, we need to confront our history. Our current approach to the replication crisis is to either deny it or recommend changes to the way we do our current research. Given our history of conducting unreliable research we need to do more. In other essays I’ve called for a USADA of psychology to monitor our ongoing use of improper research methods. I think we need something more like a Consumer Reports for Psychology. We need a group of researchers to go back and redo our key findings to show that they are reliable—to evaluate the sturdiness and value of our various concepts year in and year out. Brian Nosek’s Reproducibility Project has started in this direction, but we need more. We need to vet our legacy, otherwise our research findings are of unknown value, at best.

Brent W. Roberts

This entry was posted in Uncategorized. Bookmark the permalink.

57 Responses to The Beginning of History Effect

Jeff Sherman says:

February 6, 2013 at 12:28 am

Small effects are not patently wrong effects. And small effects are sometimes important. Perhaps we need to embrace that more emphatically. I fear that the end result of this way of thinking is that we can’t know anything that important about people with much certainty, so we might as well stop studying them. My prescription is two stiff drinks.

Reply
- pigee says:
  
  February 6, 2013 at 1:50 am
  
  Small effects are the bomb. Well, maybe that’s my second glass of wine speaking. There are entire funding agencies at NIH organized around small effect sizes–reliable, but small effect sizes.
  
  Reply
  - Ruben says:
    
    February 6, 2013 at 8:18 am
    
    I believe pre-registered (for now “actually theory-driven” if I feel able to tell) small effects more than out-of-the-blue, theory-is-an-obvious-fantasy-concocted-after-trying-analysis-variation-#25 large effects.
    Unfortunately this nice simple idea of pre-registration won’t work equally well across all fields.
    Who will believe you when you say you didn’t look at the data before registering your hypotheses about some MIDUS data? Many big data sets in personality psychology aren’t collected after every idea that could be explored with that data has been given thought.
    What are your ideas for improving scientific practice in personality and developmental psychology? Much of the debate and work in the OSF has focused on social psychology with its obvious problems.
    By the way at least for GWAS analyses the negative results and nonreplications are implicitly published along with the over-emphasised positive results, so that lo-and-behold the research is actually turning to rare variants, de novo mutations (eg. the recent autism studies, Iossifov et al.). Took a while, yes, but the newer research seems more solid to me. Of course genetics of personality studies lag behind clinical psychology somewhat in this area, but it’ll come around I hope.
Jeff Sherman says:

February 6, 2013 at 1:33 am

To further elaborate, this seems like a good argument to pay close attention to our collective empirical history–including unpublished data to whatever extent possible. I don’t see a good argument to start over rather than to work hard on estimating effect sizes accurately.

Reply
Sanjay Srivastava says:

February 6, 2013 at 3:52 am

What a downer of a blog post. Let me guess, it must have been raining the day you wrote this.

Reply
- pigee says:
  
  February 6, 2013 at 4:00 am
  
  Just call me Professor Buzz Kill. Just think what would happen if I lived somewhere like Oregon.
  
  Reply
  - Suzanne Segerstrom says:
    
    February 6, 2013 at 1:29 pm
    
    Harsh on publication bias (I just finished p-curving data from a meta-analysis I did a few years ago, with interesting results), but do NOT harsh on Oregon.
  - pigee says:
    
    February 6, 2013 at 1:41 pm
    
    Rain-soaked tree hugger. Don’t tease us. What did you find with the p-curve analysis? The fact that our meta-analyses are potentially flawed too should be a serious concern to all.
- Suzanne Segerstrom says:
  
  February 12, 2013 at 8:38 pm
  
  Data from Segerstrom & Miller (2004, Psych Bull). Five effects with sufficient numbers of studies: Effects of acute stress on NK cell #, cytotoxic T cell #, and NK cell cytotoxicity, and effects of longer-term, naturalistic stress on NK cell cytotoxicity and T cell proliferation. All had significant effect sizes in the meta-analysis. Four of the five had robust positive skew, but the effect of naturalistic stress on cytotoxic T cell # was not significantly different from a flat distribution. I’m guessing that when people found that effect incidentally (most people would have primarily been looking at NK cell #), they reported it, and when they didn’t, they didn’t.
  
  Reply
Michael says:

February 6, 2013 at 4:12 am

B-Rob! I think one piece of silver lining in all of this is that right now is a great opportunity for researchers employing stellar methods to challenge existing theories– particularly those with weak data.

Reply
MB says:

February 6, 2013 at 7:15 am

There are some straw men here, however. Goal priming and candidate gene studies may be plagued by under powered designs that are rarely replicated, but other topics do not. For example, political psychology has adopted a large sample size approach for much of its history. Similarly, in some areas of personality psychology large samples are the norm.

As to replication specifically, a number of effects in psychology are well replicated (behaviorism, as one example) and are integrated into the field (and related fields).

The point is, psychology has well replicated effects that include well powered and under powered studies alike. Choosing some bad examples without recognizing the good unfairly taints a lot of good work. I don’t think the sins of select types of research need to cause a reboot for an entire field.

Reply
jasonjjones says:

February 6, 2013 at 11:18 am

One reason people use small samples (n=50) is that large samples are expensive in time and money. A single study with 50 subjects could represent a semester’s worth of hard work for a marginally-funded graduate student. There is a lot of pressure to publish, and if they can get that study published, they will.

We expect graduate students to have published by the time they look for a job, but it’s not reasonable to expect graduate students to collect 50,000 subjects on their own.

Especially in areas such as GWAS studies, it is probably more appropriate to expect large consortia of labs to pool resources and data to generate studies that are adequately powered. But the current incentives for individual scientists – maximize publication count; maximize first-author count – do not encourage the appropriate cooperation.

Reply
Lars Penke (@LarsPenke) says:

February 6, 2013 at 12:07 pm

I think you hit the nail on the head, Brent! Just two days ago an esteemed psychologist told me he is not interested in merging his data with mine on exactly the same topic because “heaven knows that the review process is hard enough, and there’s really no reason to think that my data are any less robust than data from any other first-demonstration study.” It’s that (widespread) attitude that makes me diregard new articles more and more often. But molecular genetics is a good example, where candidate gene results are almost not publishable anymore, and even GWAS are required to have replications and are expected to come from consortia, where everyone that worked on the topic put their data together.

Reply
Interested Humanities Faculty says:

February 6, 2013 at 3:16 pm

I think it might be a bad idea to align “incorrect” or “bad” results with a lack of history. There is, in fact, a very rich history of psychology but I have doubts that it could be explored from within the field itself. What I mean to say is that failures reveal a set of unsolved problems and multiple attempts to arrive a/the truth of this problem have incredible value for understanding the variety of historical interests that have framed these investigations. I am turning (empirical) failure into something of value because I believe that we have much to learn from the past. Rejecting the past ignores the ways that these previous studies might still speak to our present interests and concerns. I’ll cite and append William James here: “You can give humanistic value to almost anything by reaching it historically. Geology, economics, mechanics, are humanities when taught with reference to the successive achievements [and failures] of the geniuses to which these sciences owe their being.”

Reply
rcfraley says:

February 6, 2013 at 3:55 pm

I agree with almost everything you’ve written, Brent (B Rob). In my view, the problem boils down to two issues.

1. The field doesn’t value “null” results. As a consequence, the empirical literature does not represent a systematic summary of what is known on any question. We have a literature full of empirical anecdotes rather than systematic data.

2. A surprising number of studies that are published in our top journals are underpowered to detect the kinds of effects we would consider worthy of investigation.

You wrote, “what are we going to do about that? What am I supposed to do about that? Simply ignore it?”

I think many of your solutions are sensible. (Although the possibility of having oversight committees brings out the anti-bureaucracy side of my personalty and makes me want to “accidentally” spill hot coffee on your shoes.) Here is how I would answer those questions based on the two issues described above. I’ll focus both on how we can make sense of data from the past and what we can do in the future.

1. In my view, there are three key questions we should ask ourselves when we learn or read about research: (a) Is the question important for practical or theoretical reasons?, (b) Are the methods rigorous enough to answer the question well?, and (c) Were the data analyzed in a way that adheres to “best practices” as we currently understand them? If the answer to those questions is “yes”, then there is little reason to disregard the research–whether that is older research from the literature or newer research.

What can we do moving forward? Editors and reviewers can begin to evaluate studies on these bases too. If we weight the “front end” of the research process more heavily, we will have fewer articles accepted or rejected on the basis of the data/findings themselves. Moreover, we will have fewer incentives among researchers to torture the data into unsustainable confessions.

There are a lot of papers in the literature that exploit researcher degrees of freedom and small sample sizes to publish potential Type I errors and/or gross estimates of effects. But there are some solid pieces of work out there too. I don’t know if my rule of thumb will accurately sort among them, but it is a principled way of addressing the problem, both for the literature as it stands now and for moving forward.

2. Put most of your trust in (and base your teaching on) studies based on large sample sizes and established (vs. ad hoc) methods of assessment.

Moving forward: Editors, reviewers, and dissertation committees can raise their threshold for the sample sizes that should be expected from research published in top journals. I don’t think we need a “one size fits all” solution to this problem. But, at the very least, the selection of sample size needs to be decided on a priori grounds (e.g., Simmons et al., 2011).

I appreciate that sometimes it can be hard to obtain large sample sizes. But, let’s face it: there is more published research out there than any of us can read. We might as well adjust the way we allocate our resources so that we publish fewer papers, but make the ones we do publish and read as strong as possible.

Reply
- Aaron Weidman says:
  
  February 7, 2013 at 1:35 am
  
  For those of you who have served (or are currently serving) as editors, how feasible would it be to implement a policy at journals to require researchers to justify their sample size in each study of a paper? These types of justifications could take many forms depending on the researcher. For example, a faculty member working off a grant might justify sample size based on a pre-ordained power calculation, whereas a grad student writing a thesis might justify sample size based on a pragmatic concern such as “we collected data until the end of the semester prior to the spring in which I had to write my thesis.” Or, in the event that data is previously available (e.g., MIDUS), the researcher could simply say “we used all available data.”
  
  Justifications would hopefully guard against both a) samples that are small because data collection was terminated the moment the eager researcher detected a significant effect, and b) samples that are too large because the researcher kept collecting data until a small effect became significant. I bet this second issue would be remedied most in the context of multi-study papers, in which conceptual “replications” are often demonstrated with substantially larger sample sizes than the original study. For example, if Study 1 has a d = .65, and Study 2 then replicates this effect with a d = .30 and a sample size three times as large, the researcher would have some explaining to do.
  
  I’d love to hear thoughts!
  
  Reply
  - Jeff Sherman says:
    
    February 7, 2013 at 8:28 am
    
    I am an editor and not a big fan of the pre-determined sample size. We don’t have many effects for which we can confidently estimate power. Of course, that is more true for studies that are testing a novel effect (even if it’s just a novel moderator). And, if you know the power/effect size before doing the study, why would you bother to do it? Just report your calculated estimate: there; no data needed.
    
    To me, this obsession with sample size reifies the almighty p < .05, which is a bigger problem. If we don't care so much about .05, this issue loses most of its relevance. If the effect size is stable after 25 subjects, I don't care if you run another 25 subjects to ensure that your significant result wasn't spurious.
    
    I am proposing that we adopt the Psychonomic Society guidelines for Social Cognition, including statements about pre-determined sample size. But, I will be content with: that's the sample size per condition that is normally used in these kinds of studies. Frankly, that's about the best metric we've got for many hypotheses.
  - pigee says:
    
    February 7, 2013 at 5:13 pm
    
    Huzzah to Jeff for adopting the Psychonomic Society guidelines at Social Cognition. Those were well thought out.
    
    I have two thoughts on pre-determined sample size. On the first study in a series, I don’t see any reason to be strict. Presumably, you find an effect, otherwise you wouldn’t be reporting it. It is in the second, third, and fourth study that some justification is really needed and typically lacking. This is why effect sizes are so important. Regardless of the theoretical goals or ecological validity of the research, the first study nets you some effect size and in the absence of no information, it should be used to guide the designs of the subsequent research. How much power did you have to detect that effect and should you increase it for the second study? What you see now is a seemingly random walk of sample sizes across studies, which implicates data peeking behavior.
    
    Second, I think Jeff is correct that sample size can reify p values. I think the solution to this reification is not only reporting, but interpreting effect sizes. When the average effect size in social psychology is equal to a correlation of .2 (Richard et al., 2003), then you have to question how the hell you got an effect equal to a correlation of .6 with 20 people per cell. It is unlikely to happen if the average effect size is .2. Also, as your sample size grows larger you hit a threshold where p-values become an embarrassment in the context of effect sizes. For example, when we used the Dunedin data (N = 1000 or so), we came up with a different way of evaluating the effects because Beta weights of .07 were statistically significant, but clearly pretty small. Or, take our current dilemma, which is a sample of 380,000 people. Everything is significant. Always. Even birth order (dig: Fraley). As we increase our sample sizes and use effect sizes, our p-value obsession will abate.
  - rcfraley says:
    
    February 7, 2013 at 5:18 pm
    
    Jeff: You wrote “I am an editor and not a big fan of the pre-determined sample size. We don’t have many effects for which we can confidently estimate power.”
    
    I agree that this is a problem that makes the consideration of sample size and power more challenging.
    
    The way I encourage people to do it is to ask themselves the following question: How “big” does the effect need to be for me to be willing to interpret it as consistent with the hypothesis? (Or alternatively, how big does it need to be for me to “care” about it given the trade offs involved in time, N, power, and effect sizes.)
    
    Regardless of the *actual* size of the effect, people can answer this question. And, yes, it requires a highly subjective response for most people, but it can be a valuable question to consider. And if people cannot answer it, it is certainly principled to simply use the average effect size in an area of research as a threshold.
    
    Here are some rules of thumb:
    
    If you think an association of r = .10 or higher would be compatible with the hypothesis, then you need 617 participants to detect it with 80% power.
    
    If you think an association of r = .20 (the median correlation in personality research; Fraley & Marks, 2007) or higher would be compatible with the hypothesis, then you need 153 participants to detect it with 80% power.
    
    If you think an association of r = .30 or higher would be compatible with the hypothesis, then you need 67 participants to detect it with 80% power.
    
    Joe Simmons (http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2205186) had an interesting set of suggestions that he offered at SPSP 2013. His argument was that, because it is difficult for people to think about power and sample size selection, rules of thumb might be a useful way forward.
    
    And one simple rule of thumb is this:
    
    Is the effect I’m expecting bigger or smaller than the association bewteen gender and weight? If it is probably smaller, then I need at least 100 people.
    
    What’s clever about this is that the weight difference between men and women is an effect that is large enough that lay people can appreciate it without the need for systematic quantiative analysis. Moreover, based on data, we know the effect is equivalent to a Cohen’s d of approximately .50. To detect an effect of this magnitude with 80% power, you would need to sample 50 men and 50 women (total N = 100; n = 50 per cell).
    
    Thus, one simple rule that reviewers and editors can use is this: Researchers need *at least* enough power in their studies to detect effects that lay people would consider self-evident. Studying something that is not blatantly obvious with inadequate power is a recipe for trouble, regardless of whether you have a basis for estimating power a priori.
  - Jeff Sherman says:
    
    February 7, 2013 at 6:50 pm
    
    This reply is to Chris’s post. Guess I can’t reply directly.
    
    You know, I don’t think I can answer your question about how big does the effect size need to be to matter. As I mentioned above, small effects are sometime important effects (for theoretical reasons or otherwise).
    
    It also frequently is not possible to know the average effect size for an area. What area? All priming effects? Only evaluative priming effects? Only evaluative priming effects that use pictures as stimuli? Only evaluative priming effects that use pictures of stimuli and an SOA < 500ms? And on and on. In most cases, the exact study I am doing right now has never been done by anyone (to my knowledge). Even seemingly minor variations in operationalizations affect outcomes and effect sizes. Unfortunately, the subject of our research is damn complicated.
    
    Sorry, the lay person standard strikes me as an astoundingly bad idea. Again, that thing about small effect sizes mattering. You think a lay person would find the effect size of aspirin on heart disease self-evident? Never mind the geeky behavioral outcomes we are usually studying. You think a lay person understands interactions, never mind what the effect size of an interaction means? We are not usually testing simple main effects of variables that are easy to understand for a lay person. Terrible, terrible idea.
  - Jeff Sherman says:
    
    February 7, 2013 at 6:59 pm
    
    This is in response to Brent’s post. This inability to reply sucks.
    
    I agree with you about estimating effect sizes if you are doing an exact replication of the first study. Most of the time, we are not. We are usually replicating and extending, and, usually it is a conceptual replication. I am a big, big fan of the conceptual replication. I am not interested in nailing down the effect size of one specific operationalization of my I.V. (Construct A) on one specific operationalization of my D.V. (Construct B). Frankly, I’m a bit baffled by the sudden elevation of the direct replication, but that’s another argument.
    
    As for using the first effect size as a guide for non-direct replications, I’m afraid that I have little faith. Unfortunately, the strength of a relationship between 2 constructs in our field can vary widely depending on the specific operationalizations of the constructs. Start adding in novel moderators or tests of mediating processes, and who knows? I’m not opposed to anyone doing power estimation, I just don’t put a lot of faith in our ability to do it with much accuracy, outside of a direct replication. Unfortunately, human behavior is damn complicated, and our job is not an easy one.
  - pigee says:
    
    February 7, 2013 at 9:00 pm
    
    Hi Jeff,
    
    Sorry about the comment structure–attribute it to my lack of blog management experience.
    
    In terms of replication or conceptual replication, it really shouldn’t matter. If, in fact, as you say “human behavior is damn complicated,” then you should anticipate that most of our manipulations or IVs will have modest effects on any DV we are interested in because the outcome, by definition will be overdetermined–complex. That would mean a good bet on the second study would be that you should plan for a smaller effect size–regression to the mean and all–and run more participants. This becomes even more important when you insert “novel moderators” as moderator effects demand much greater power to detect. The fact that we do not automatically increase sample sizes when we run moderators is a huge problem. But, as you say, that is a discussion for another day.
    
    Effect sizes are not guides to replication. They are vehicles to get at accurate estimates of how your IV affects your DV. Accumulated over time, they will let us know whether your theory works. NHST cannot and does not give you that judgment as it should never be, nor have ever been, a dichotomous decision.
    
    The reason for the push to do direct replications is that there is not longer trust in the system. If we can publish 9-study research papers showing that ESP exists, then we can’t be trusted. It is not really that onerous, nor is it at all a perfect solution. It is just one simple baby step on the way to regaining trust.
  - Jeff Sherman says:
    
    February 7, 2013 at 10:53 pm
    
    The trick is to hit “reply” in the email notice!
    
    What you say is fine and true, but is based on fealty to p < .05. As an editor, if you replicated your study and found the same effect size as in the original study, I don't care about your p-value. You have demonstrated the reliability of the effect size.
    
    We seem to largely agree about the importance of effect sizes but not about the implications for determining sample size.
    
    Trust in the system, exactly. For me, my trust in the system is enhanced by a conceptual replication using different operationalizations. I don't care how reliable your effect is if I don't know what it means. And, if it means what you say it means, then the effect should show on a conceptual replication.
  - rcfraley says:
    
    February 7, 2013 at 10:26 pm
    
    Jeff: You wrote “You know, I don’t think I can answer your question about how big does the effect size need to be to matter. As I mentioned above, small effects are sometime important effects (for theoretical reasons or otherwise).”
    
    I agree that small effects can sometimes (perhaps even most of the time) be important for theoretical or practical reasons. If you believe that, however, it seems particularly important to ensure that your research design can detect those effects with decent precision.
    
    As an editor, wouldn’t you want to create incentives for researchers to use more powerful designs? I think it is terribly risky for the field to disregard this issue merely because it is difficult to know what effect sizes to expect as we wade into uncharted territory.
  - Jeff Sherman says:
    
    February 7, 2013 at 10:45 pm
    
    Being able to know the effect size with decent precision is very different from running a lot of subjects to make sure that you don’t mistakenly cross the sacred boundary of p < .05, which I believe is the primary argument being advanced for pre-determining your sample size. I'm all for knowing your effect size. Do you know many subjects I need to run to know that? It's not the size I need to produce a "significant" effect.
mbdonnellan says:

February 6, 2013 at 3:58 pm

Great post. Hewitt (2012) has an interesting editorial for Behavioral Genetics and he is demanding replication studies to help weed out false positives in his journal. Contrast this hard-line stance with some of the push-back for reform in social/personality. We have to get more stringent about things and the first step is admitting there is a problem.

Reply
Sanjay Srivastava says:

February 6, 2013 at 5:45 pm

Are GWAS studies really the right comparison here? GWAS tests lots and lots of effects, most of which have true values that are extremely close to zero. The widespread presence of true zero effects makes null/nil hypothesis testing a reasonable approach, and because of the large number of tests it requires massive power to avoid false positives.

By contrast, in many areas of personality psychology it would be surprising for the true effect to be zero. Paul Meehl wrote about this years ago when he talked about the crud factor – “everything correlates to some extent with everything else” (p. 204). I’d argue that that also applies to many areas of experimental social psychology that deal with broad, system-level variables (like emotions/affect, self processes, etc.). That requires a different framework for thinking about hypotheses and evidence, because the null hypothesis is never really true. I like Andrew Gelman’s suggestion that in areas where true zero effects are likely to be rare, we abandon the NHST language of Type I and II errors and “false positives” etc. and instead talk about Type S and Type M errors, for “sign” and “magnitude” respectively.

And that leads to a different diagnosis. If a hypothesis makes only a sign prediction and not a magnitude prediction, we can fault the theory for not offering up a riskier hypothesis (Meehl again). But very often the investigator only formulates hypotheses and draws conclusions about sign (as Brent D pointed out in the comment thread on Funder’s post about effect sizes — in many instances people report effect sizes because APA makes them do it, but they devote zero discussion to them). And in a topic domain with an appreciable crud factor (unlike GWAS), you don’t need as much power to correctly detect sign. So yes, most studies are indeed underpowered (even given a crud factor), but not nearly as extremely as the GWAS comparison suggests. Moreover, researchers who ignore effect sizes are (somewhat ironically) avoiding making Type M errors, and they’re probably often reaching correct conclusions about S. In some cases that might be enough — some interesting theories can probably be tested just on the sign of predictions. But there are probably many more interesting theories and/or research questions that will require us to think about effect size as well. So it’s not so much that we need to throw out our history, as that we need to stop repeating it.

Reply
- Galen Bodenhausen says:
  
  February 6, 2013 at 6:22 pm
  
  Sanjay makes great points here. In promoting his “perspectivism” approach to science Bill McGuire argued that most psychological hypotheses are likely to be true under some circumstances and false under others. For this reason he urged researchers not to hide their pilot studies (and “failed” studies) from public view but to incorporate them into published reports — and he urged the field to expect complex rather than simple conclusions to emerge from sustained research on a given topic.
  
  Reply
  - pigee says:
    
    February 6, 2013 at 9:40 pm
    
    Galen let me put forth a “risky” hypothesis that Bills assertion that “most psychological hypotheses are likely to be true under some circumstances and false under others” emerges from doing too many small, underpowered studies, that you falsely believe to be informative. Actually, I think it is a testable hypothesis….
- pigee says:
  
  February 6, 2013 at 9:38 pm
  
  Sanjay, I used GWAS because it was convenient and it was based in personality psychology. I could have used any number of examples, including my own. For example, in Helson, Roberts, and Agronick (1995) we report a .48 (!!!) correlation between creative temperament in college and creative success in work 30 years later in a sample of 122 women. This is both statistically significant and a gross overestimate of the true effect size.
  
  My goal was two-fold in using GWAS. First, we really do know the answer to the effect size issue given the GWAS results. There is no crud factor here. Second, I’m tired of picking on the experimental psychologists. It is about time for personality psychologists to confess their own sins and fix their own house.
  
  I also agree with Tal below. There is no sound argument against taking responsibility for your effect sizes as they are, in part, what determines whether you have a statistically significant effect. Your study can be perversely ecologically invalid (wire vs mesh monkeys anybody?), but there is still an effect size in there that becomes very important when considering the merits of a replication. I will go further and argue that people who make the “I’m only interested in the direction and/or significance of my effect” are either lying or rationalizing. And, in this case the rationalization is more of a problem because it leads people to believe their own BS.
  
  Reply
  - Jeff Sherman says:
    
    February 7, 2013 at 7:13 pm
    
    I can personally guarantee you that many, many studies in social psychology are done with zero concern for effect size. Your statement about lying or rationalizing is harsh and inaccurate. We are not trying to predict school GPA from some subscale of the MMPI. We are not trying to nail down the effect size of some new drug on a medical outcome. We are not trying to describe the increase in sales one can expect from using one ad versus another. We are building theories of how variables are related to one another. Effect sizes can be important components of that process, but they are not always. I recommend you go back and read Mook.
  - pigee says:
    
    February 7, 2013 at 9:05 pm
    
    Sorry, it was a bit harsh. Nonetheless, it did do what I wanted it to do, which was to elicit the response I wanted (not necessarily from you). So, go back and re-read the examples you used. They were all applied. It is tremendously difficult not to interpret this as saying “The experimentalist is the real scientist who tests theories. Effect sizes are for the unwashed types who don’t really have ideas and who just predict things”. That too, is harsh. This is an unnecessarily harsh attitude towards using effect sizes, which gets in the way of experimentalists using and interpreting them. As I noted in the other comment, effect sizes are not perfect or the ultimate answer, but to ignore them is silly and regressive. Use them. They will become your friend. With time, they may actually tell you what you want to know.
  - Jeff Sherman says:
    
    February 7, 2013 at 10:57 pm
    
    I did not say anything about one type of research being better or more important than the other. I don’t have a harsh attitude toward effect sizes. My point is that knowing precise effect sizes matters more in some kinds of research than in others. The choice of applied examples was intended.
    
    I am all for knowing and using effect sizes. But, frequently, it is not my *primary* goal to know the effect size, which you claimed was my goal, even if I denied it.
  - pigee says:
    
    February 8, 2013 at 4:16 pm
    
    My point is that knowing your effect size should be everyone’s goal as much as knowing whether something is statistically significant. If you don’t use effect size information you become a victim of it. For example, one of the most conspicuous features of the priming research that is currently being called into question are the enormous effect sizes associated with the string of studies that now can’t be replicated. If a some theoretical idea has a d of 1, then a monkey should be able to directly replicate the effect without much difficulty. The fact that we can’t just adds to the conclusion that too many of our colleagues have scammed us and the entire field. As I alluded to in another comment, the average effect size on the r-metric in social psychology is equal to a correlation of .2. Which, by-the-way, has to be an overestimate because experimental social psych studies have been historically underpowered. As an editor, you should be asking questions like “hmmm, they have a cell sample size of 20 and they have an interaction effect equivalent to an r of .6, is that believable?” The answer, I fear will be no.
  - Jeff Sherman says:
    
    February 8, 2013 at 7:39 pm
    
    In doing the research, you will learn your effect size. I am not suggesting that we should not use that information. I am merely reporting to you that learning the effect size is often not the primary goal of conducting the research (which was your assertion). I will maintain that knowing the precise effect size is more important for some kinds of research than others.
- Jeff Sherman says:
  
  February 7, 2013 at 8:30 am
  
  Well said.
  
  Reply
Tal Yarkoni (@talyarkoni) says:

February 6, 2013 at 8:47 pm

MB, I think Brent was pretty clear that these are generalizations that apply to varying degrees to different subfields, and won’t be applicable to all.

Sanjay, I couldn’t agree more with the suggestion to favor a Type S/M framework over conventional null hypothesis testing. That said, I think very few people really, truly care only about sign and not about magnitude. My sense, having discussed these issues with people often, is that the typical move is to sneak effect size in through the backdoor under the pretense that one cares only about rejecting the null. In other words, what people typically mean when they say “I don’t care about how big an effect is, just that it’s statistically significant” is really something more like “I can feel pretty confident that any effect I manage to detect must be large enough to care about, because I’m using small samples, so I only need to worry about the sign”.

Now in truth we know it doesn’t work that way (i.e., because of various factors, we often end up detecting inconsequential effects even with tiny samples), but even if it did, the point is that when you force the issue, I don’t think very many people would really get out of bed and go to work if the best they could hope for is to demonstrate the sign of the effect and say nothing about the magnitude. A big part of the reason is precisely the crud factor you allude to: given that we expect everything to correlate with everything for (largely) uninteresting reasons, how much have we actually learned if we find that X is reliably positively associated with Y but can’t say anything about the magnitude? After all, even if there was no interesting systematic reason for that association whatsoever (i.e., it was all crud), there would be a 50% chance of getting the sign right. So with maybe some isolated exceptions, I just can’t see people in most domains of psychology caring about sign to the exclusion of magnitude. Actually (and this may well just be my own failure of imagination), I can’t think of a single effect in personality psychology where I’d be content to know just the sign.

As far as the general issue of how to deal with these problems, I think that the best long-term solution may be to remove barriers to publication of findings (i.e., conventional pre-publication review) and emphasize centralized post-publication review platforms where there’s a permanent record of discussion and people are free to raise any methodological concerns they have about anyone’s work. I suspect a big part of the problem we have at the moment is that if you happen to land three uncritical reviewers who aren’t aware of some of the issues Brent brings up, your paper gets through the filter and thereafter appears to have the field’s seal of approval, with no real opportunity for correction. Now envision a system where there is no gatekeeper, and the process of evaluation really begins after ‘publication’, and you’re in a much better position to have iterative (and very rapid) evaluation of a paper’s true merit. I’ll shamelessly plug my own work and point to this paper I wrote recently, which delves into the details of how this could work (it’s part of a broader special topic on post-publication review platforms, and it turned out that pretty much everyone else who contributed had more or less the same ideas).

Reply
- Jeff Sherman says:
  
  February 7, 2013 at 8:44 am
  
  In social psychology, many of the hypotheses are about direction and not size. Will people obey? Does positive mood increase persuasion? Does cognitive load increase stereotyping? Does mere social categorization produce intergroup bias? Does mere exposure produce positivity? Do people make normative attributions about others’ behavior? We did not know the existence/direction of these effects when they were first demonstrated.
  
  It is certainly useful to know the effect sizes in determining likely theoretical extensions, but it is not the primary question of interest.
  
  Outside of social psychology, there are numerous examples of such “can this happen at all?” studies: Does biofeedback work? Can apes learn sign language? This kind of science happens all the time in the physical sciences, as well. Mook (1983) has a great discussion about this.
  
  No gatekeeping: shudder. I appreciate the intent, but we’re already drowning in crap. I am all for post-publication review, but I want to retain the gatekeeping.
  
  Reply
  - rcfraley says:
    
    February 7, 2013 at 4:59 pm
    
    Jeff: I appreciate that many of the hypotheses that are tested in social psychology are directional.
    
    But notice that each of your statements is not a hypothesis per se, but a “research question.”
    
    Once we start thinking about research questions, it is really quite trivial to frame those in ways in which magnitude matters. For example, instead of simply asking “Does mere exposure produce positivity?” we can ask “How much does mere exposure affect positivity?” or “How much of a change in positivity is observed when we manipulate exposures in the following ways?”
    
    In my opinion, this way of thinking allows us to build a better knowledge base, provides more constraints for our theories, encourages better measurement, and allows us to better understand the way in which context may impact our estimates.
    
    I keep trying to imagine where other sciences would be if they had been content with claims that “pulling on things makes them longer” or “pushing things makes them go faster.” We can build airplanes, land on the moon, and understand the distribution of earthquake magnitudes because quantities and parameters are critical to theoretical and applied scientific developments.
  - Jeff Sherman says:
    
    February 7, 2013 at 6:37 pm
    
    Of course, during the research process, you will learn the size as well as the direction of the effect, and it should be reported. As for follow up questions, usually we are going to be looking at novel moderators or process mediators and then, once again, you are in the realm of having directional predictions but not strong a priori ideas about effect size. Again, learning and reporting and using those effect sizes for building knowledge and theory is important–but it is rarely the central goal of the study.
    
    Those other sciences you imagine don’t have to worry about whether the effect size differs depending on how you measure gravity, for example. Must be nice.
  - Tal Yarkoni (@talyarkoni) says:
    
    February 9, 2013 at 7:57 pm
    
    As I said above, the kinds of examples you give are only about direction and not magnitude if you ignore crud and quietly sneak effect size in through the back door. The fact of the matter is that the null hypothesis is always false just about everywhere (except maybe, say, ESP). Which means that, a priori, there is a 50% chance that any hypothesis about sign will be correct even if it’s just a guess. But it’s worse than that, because we have the crud factor to worry about: many hypotheses will be ‘correct’ only in the nominal sense that the effect in the population happens to go in the predicted direction, and not for any interesting reason at all. In other words, in the vast majority of cases you will have learned next to nothing by showing that an effect is reliably positive or reliably negative, because every effect has to go one way or the other, and there will be a million uninteresting reasons for that particular sign in every single case.
    
    Take your example of positive mood increasing persuasion. Let’s suppose that positive mood does in fact increase persuasion. Let’s also suppose that the true population effect is r = 0.001. Now, if researchers run enough studies, they will eventually establish that, hey, you know what, it looks like positive mood increases persuasion. The trouble is that an effect size of r = 0.001 cannot be considered large enough to say anything about anyone’s theory, simply because of crud. In other words, you could tell all the stories you want about how your elaborate and elegant psychological theory predicts that positive mood should in fact make it easier to persuade people, but the simple (and perfectly correct) rebuttal will be for me to say “you must be kidding–with an effect size that small, any constellation of random factors could have just as well produced that effect, and there’s a 50% chance I would have been right just by guessing, without needing your theory at all”. So no, it simply isn’t true that social psychologists don’t care about effect size. It’s just that they sneak effect size in through the back door by running relatively small samples and implicitly assuming that if an effect attains significance in enough relatively small studies, it must be important enough to evade the crud factor. But this is a very bad way to do things in the long run, because it leaves us completely unable to determine which effects are actually worth theorizing about, and which could just as well be completely uninteresting.
  - Jeff Sherman says:
    
    February 9, 2013 at 9:34 pm
    
    If I can show that positive mood increases persuasion with multiple operationalizations of mood and persuasion, I feel pretty good that it’s not random crud. Fortunately, none of the effects I described has an effect size of r = .001. Nevertheless, small effects (probably not that small) can be important.
    
    Obviously, it would not be accurate to say that social psychologists don’t care about effect size. However, it is entirely accurate to say that determining the effect size is rarely the primary goal of a given study conducted by an experimental social psychologist (or an experimental cognitive or perceptual psychologist, for that matter).
    
    Do you think the criterion for publication should be effect size alone? What size? Also, I’d be interested if you can tell me how many subjects per condition you need to run in the above examples (or when testing a novel hypothesis) to attain a stable estimate of effect size? How do you know?
  - Jeff Sherman says:
    
    February 9, 2013 at 10:15 pm
    
    Also, I’d like to point out that much of the conversation here seems predicated on a search for main effects or simple comparisons. The vast majority of the time in experimental social psychology (in all experimental psychology?), we are testing interactions, where the simple directional argument is not so simple.
    
    Somehow, in arguing that effect size isn’t necessarily the primary goal of some research, I seem to have come to be viewed as someone who doesn’t think effect sizes are important. Just to clarify, I think they are important, and I am no strong adherent to NHST. At the same time, sometimes it is important to know that a result is unlikely due to chance alone. I hope that we will eventually arrive at a place where publication decisions are based on combinations of information about effect sizes, p-values, and/or confidence intervals (and maybe more). The particular importance of each piece of information varies, depending on the nature of the research.
  - pigee says:
    
    February 10, 2013 at 5:28 pm
    
    Sorry Jeff, but I’m going to beat this horse a bit more. I’m not sure if the horse is dead yet. You say “The particular importance of each piece of information varies, depending on the nature of the research” in reference to effect sizes, p-values, confidence intervals, and, of course, power. I’m going to argue the following:
    
    1. At this point in time, we don’t know how important each piece of evidence is for anything.
    2. Effect sizes, p-values, confidence intervals, and power are all so interdependent that they are inseparable. If you know your p-value and sample size, you know your effect size and your power, for example. There is little or no excuse for not reporting effect sizes at this point in our history.
    3. We do know that the p-values associated with any given test statistic in any given study tell us almost nothing about the replicability of our research.
    4. Adding more information to the evaluation of a study, such as effect size indicators, gives us additional information on the replicability and likelihood of a finding. It does this by letting us use prior information about effect sizes to gauge the likelihood of an effect. At a gross level, if 90% of the studies in psychology find d-scores between .25 and .50 (which I believe to be fact; Meyer et al., 2001), then finding a d-score of 1 across 4 different studies should raise some questions about the likelihood of that set of findings–a point that Greg Francis and Uli Schimmack have been pushing as of late.
    5. Given the fact that we don’t know the direct relevance of each indicator to any given program of research, to proceed by ignoring any of these indicators (power, p-values, effect sizes, and N), is willfully ignoring potentially important information. Systematically requiring effect sizes and confidence intervals is so easy. If we begin to use them systematically we will only gain more knowledge than we have now, which is a good thing, right?
    
    You also add that “…we are testing interactions, where the simple directional argument is not so simple.” Exactly. This is another reason to report, interpret, and plan for effect sizes. We know, for example, that interaction effects tend to be small, and therefore require exponentially larger sample sizes (Cohen, 1996). In fact, Cohen makes it pretty clear that you need upwards of 400 participants to detect a small effect in a 4 group ANOVA (this is not the interaction case, BTW). The fact that we do not systematically increase our sample sizes when testing interaction effects, which as you say, are the main focus of our research, makes the reported significant interaction effects even less likely.
    
    In another post you argue that some of us are accusing researchers of running small studies to manipulate Type I errors (Sampling Error Abuse: SEA). Just to be provocative, I’ll stand by that accusation and add a little anecdote to back it up. I had a conversation with an Experimental Social Psychologist (ESP) last year about these exact issues. Much to my surprise, he knew quite well what the power requirements were for detecting an interaction effect and said something to the effect of “Have you seen Cohen’s numbers? To hell if I’m going to run 400 subjects for one study.”
    
    So, think about this a bit more and you will identify why we don’t trust “conceptual” replications. The ESP would rather run a string of underpowered studies with average N of 80, than one study with a big sample size. It makes sense, right? Why invest so much in one permutation of your original design? Also, you were taught that 20 people per cell is an acceptable “rule of thumb” for designing an experiment back in grad school. If he or she runs 4 studies per semester with 80 subjects each, then the ESP can try 4 different “conceptual” replications of the original idea. Now, let’s be charitable and say that one of the four studies nets a statistically significant effect (if our true Type I error rate is 50% then he or she hits twice). If the ESP already has one study in the bag and adds this one study as a “conceptual” replication he or she has a two-study paper. Run four more studies in the spring and you should get another hit. The package of studies sent to JPSP or Social Cognition will be the three studies that netted statistically significant findings without mention of the studies that failed. The argument for the conceptual nature of the replications is actually a tacit acknowledgement of SEA rather than proof that an idea is robust.
    
    So, clearly, some of our colleagues are manipulating Type I errors. To be kind, I would argue that many are doing it out of ignorance rather than malice. Doing what you were taught by your advisor is not a crime. That said, many researchers know what they are doing and persist because that is the game we are playing–good will hunting for p-values.
    
    Mind you, this is where effect sizes can be miserably misleading as we never get an accurate understanding of the true effect because all of the failed studies are never reported.
  - Jeff Sherman says:
    
    February 10, 2013 at 11:12 pm
    
    Brent, I’m happy to continue flogging this horse. I feel like you’re misrepresenting my position.
    
    You continue to suggest that my claim is that we shouldn’t care about effect sizes, when that is not my claim. Again, I have merely reported to you the fact that, contrary to your claims, estimating the effect size has not been the primary concern of experimental psychologists. A different argument is whether estimating the effect size *should* be the primary concern. I agree that estimating effect sizes is an important component of research, but not necessarily always the most important one. My point about effect sizes, p-values, and confidence intervals is the same one being made by Tal and Simine (in funderblog): We want to have the sense of uncertainty surrounding the effect size. In agreement with you, my point was that we should be using all of these kinds of information.
    
    I have certainly never argued that we should not be reporting effect sizes. I don’t know why you continue to suggest that I am opposed to that. In my opinion, there is all but universal consensus on that matter. I believe that confidence intervals also should be regularly reported.
    
    I think part of the confusion about effect sizes may stem from disagreements about the primary purposes of our research and the role of effect sizes in those purposes. For me, effect sizes are mostly something to be revealed and reported in the course of conducting research, for all the important reasons you state. If I understand you correctly, for you, determining the effect size is the (only?) reason for doing the research in the first place.
    
    I admit that I am not as enthusiastic about the planning or interpreting aspects of effect sizes you mention because: 1)I don’t have faith in our ability to pre-estimate very accurately, despite all the assurances that we should just rely on previous related research; and 2)I am generally not comfortable interpreting whether or not an effect size is large enough to care about–an issue that we’ve been kicking around throughout this thread.
    
    The remainder of your post highlights what I consider to be the most regrettable aspect of the recent turmoil in our field. It is one thing to recognize problems in our research practices and advocate for improving them. It is another thing to attribute dishonest motives to those who are not behaving as you would have them behave. Not only do I think your claims of motive are wrong, I think they are counter-productive to the goal of changing things for the better because this framing tends to make people angry, defensive, and dismissive.
    
    The person who told you “to hell with running 400 subjects” is responding to a reality in our incentive structures, not to a desire to cheat. As you note, the main consequence of under-powered studies is the failure to identify small effects. This is why there is a negative correlation between sample size and effect size: only large effects show up in small samples. The truth is that, before anyone is going to run 100 subjects/condition, the incentives in our profession are going to have to change. People want to do science as effectively as possible while also getting and keeping their jobs. In my opinion, changing the incentive structures may be the single most important problem to address.
    
    Yes, I would much rather conceptually replicate the same effect with under-powered studies than be certain of the effect of my one operationalization of my IV on my one operationalization of my DV. First, let me repeat that I am skeptical of our ability to know a priori when a study is sufficiently powered, especially when testing interactions. Second, we just disagree about the value of different kinds of evidence. For me, conceptual replications provide the greatest feeling of trust that we have truly learned something, and I’ve read nothing to dissuade me from this view. You make it sound as though producing conceptual replications is trivially easy, and it is not.
    
    I deem your example of the researcher running multiple studies and cherry-picking the ones that work as irrelevant to this discussion. What you describe is simple dishonesty; a form of fraud. The behavior you describe is not due to ignorance and is, in my opinion, a scientific “crime.” Any researcher pursuing that strategy will not be bound by any guidelines you propose. Let me also state that I am highly doubtful that that strategy is a common one. First, most people know that it is wrong. Second, most people will not waste precious time and resources on continuing to pursue such an obviously unreliable finding.
    
    Finally, we are in full agreement that there needs to be a means of publicizing null results. At this point, my preference is for some kind of on-line repository categorized by topic and the common publication of meta-analyses.
  - Jeff Sherman says:
    
    February 9, 2013 at 10:22 pm
    
    And, finally, I think it’s interesting that half of the people arguing passionately for pre-determining sample sizes want to make sure that no one accidentally achieves a significant result, whereas the other half are arguing that NHST is useless and we only need to know the effect size.
  - Tal Yarkoni (@talyarkoni) says:
    
    February 9, 2013 at 10:35 pm
    
    If I can show that positive mood increases persuasion with multiple operationalizations of mood and persuasion, I feel pretty good that it’s not random crud. Fortunately, none of the effects I described has an effect size of r = .001. Nevertheless, small effects (probably not that small) can be important.
    
    Well, this is exactly my point. In practice, if you manage to replicate an effect repeatedly using conventional (small) sample sizes, the effect must be big enough to care about–and not just any value other than zero. (Or at least, that would be true if not for the pernicious effects of various selection biases that operate to greatly inflate the false positive rate; but we can ignore that for the moment.) But it’s disingenuous to say that this means social psychologists only care about the sign of the effect. What it really means is that social psychologists (and most other psychologists) are nearly always making tacit assumptions about effect sizes that are sometimes justified but are very often not.
    
    Obviously, it would not be accurate to say that social psychologists don’t care about effect size. However, it is entirely accurate to say that determining the effect size is rarely the primary goal of a given study conducted by an experimental social psychologist (or an experimental cognitive or perceptual psychologist, for that matter).
    
    Well, it’s certainly not the primary stated purpose, but again, that’s only because we’re sneaking effect size assumptions in through the back door. If you make the Sign/Magnitude distinction explicit, I think it’s very clear that no social psychologist is interested in identifying effects within the bounds of the crud factor. I would characterize the true primary purpose of most studies as being the detection or testing of effects that are large enough to be meaningful.
    
    Do you think the criterion for publication should be effect size alone?
    
    Well, as I said above, I think the notion of official “publication” as a discrete step in scientific evaluation is an artifact of history that’s going to go away in the next few years. But if you mean, more generally, criteria for evaluating a finding, then no. I think having a sense of the uncertainty around the point estimate of effect size is also very important. And p-values are certainly helpful in that respect (though not as informative as confidence intervals).
    
    What size?
    
    Obviously this is a very difficult problem, and it will vary on a domain-by-domain and question-by-question basis. But if you think that opting to not think about effect size at all is a solution, you’re mistaken. If I pick a sample size by fiat and say “I just care about whether the result is significant,” I am in effect making a claim about what effect size I find meaningful. I’m just making a blatantly ridiculous claim, which is that I care about any effect that isn’t exactly zero. Again, nobody gets out of bed for effects of r = 0.000001. The fact that you don’t have an excellent a priori estimate of effect size is not a reason not to at least ask yourself two important questions: (a) what ballpark estimate is reasonable given the closest literature to the question I’m asking (cf. Brent’s point about genetic effect sizes), and (b) what is the minimum effect size I would consider worthwhile to pursue. In many domains (including many in social psychology), taking those two questions seriously would result in researchers going about their research in a very different way.
    
    Also, I’d be interested if you can tell me how many subjects per condition you need to run in the above examples (or when testing a novel hypothesis) to attain a stable estimate of effect size? How do you know?
    
    I’m not sure if this is a serious question. None of these particular examples are something I know much of anything about, so there’s no reason for me to be able to answer them. Now if you ask me the same question about an effect that falls within my area of interest (e.g., neural correlates of personality dimensions in specific contexts), I expect to be able to give you a reasonable ballpark answer. Not a precise one, but an educated guess that basically amounts to a weakly informative prior.
    
    Note that no one in this thread is suggesting that there’s a one-size-fits-all approach or a magic wand you can wave to answer these difficult questions. The argument is simply that ignoring effect size entirely is not actually an alternative approach (even though it may feel that way), it simply amounts to making the worst possible assumption about effect size–namely, that you really, truly care about any effect that isn’t exactly zero. What I’m pointing out is that literally any other estimate, no matter how bad, is better than that. And in most cases, we can actually come up with reasonable ballpark estimates. E.g., priming effects are small (r’s < 0.1); genetic effects are tiny (r's .3), and so on. Attacking such claims on grounds that they’re imprecise is silly, because being imprecise is surely better than being ludicrously wrong and assuming a completely uniform prior effect size distribution. (And note that even people who claim not to care about effect sizes do in fact indirectly base their sample size selection on effect size estimates. For instance, personality psychologists will typically do “exploratory” studies with much larger samples than cognitive psychologists, which is a testament to the fact that there are established conventions in these fields about how big effects should plausibly be, even when they’re not made explicit. They may be lousy conventions, but they’re still better than nothing.)
  - Jeff Sherman says:
    
    February 9, 2013 at 11:56 pm
    
    I guess part of my reaction to this is a reaction to what seem like silly claims about both the abilities and motivations of researchers–specifically, that they have been intentionally running small sample sizes to manipulate the likelihood of Type 1 errors. First of all, most experimental psychologists are not (or were not) statistically savvy enough to have planned that intentionally. The standard use of sample sizes of 20-30 subjects/condition is likely due to a simple desire to use a large enough sample to generate stable means/variance but not so large as to guarantee a significant result (I vividly remember critiques of sample sizes that were too large). So, no need to imply insidious manipulation.
    
    Only in the last 25 years or so, has the idea of meta-analysis caught on as a means of reliably establishing effect sizes. If the idea now is to ensure that within each study, then things will have to change.
dcfunder says:

February 10, 2013 at 1:12 am

Can I just say how much I am enjoying this exchange between Jeff and Tal. They are managing to explain their sharp disagreement while at the same time maintaining a tone of mutual respect and, most of all, intellectual honesty. So rare. It is much easier to (a) get nasty and attribute beliefs to your debating partner that they don’t really hold but are easily rebutted or (b) start mushing together very real disagreements so that the issue goes away by becoming vaguer, rather than clearer. As much as I dislike (a), in a scientific context (b) is worse.

About the substance of the debate, I think each has contributed enormously by finding the weakest point in the other’s argument. Tal found the weakest point of Jeff’s argument: No matter how much one might want to be able to justify it, it’s literally **impossible** to not put effect size front and center because as soon as you choose a p-level and a sample size, you have automatically chosen the effect size you will regard as “important”– whether you compute it or not. Jeff found the weakest point of Tal’s argument; it really is difficult-bordering-on-impossible to say how “big” an effect size has to be to matter. Is .01 enough? It is, if we are talking about aspirin’s effect on second heart attacks, because wide prescription can save thousands of lives a year (notice, though that you need effect size to do this calculation). Probably not, though, for other purposes. But really, I don’t know how small is too small.

Maybe there is another way to think about effect size, as ordinal rather than in terms of absolute size. There are many, many contexts in which we care which of two things matters **more**. Personality psychologists routinely publish long (and to some people, boring) lists of correlates but such lists do draw attention to the personality variables that appear to be more and less related to the outcome of interest, even if the actual numerical value isn’t necessarily all that informative.

Social psychological theorizing is also often, often, phrased in terms of relative effect size, even if the effect sizes aren’t always computed or reported. The whole point of Ross & Nisbett’s classic book “The Person and the Situation” is that the effects of situational variables are larger than the effects of personality variables, and they draw theoretical implications from that comparison that — read any social psychology textbook or social psych. section of any intro textbook — goes to the heart of how social psychology is theoretically framed at the most general level. The (in)famous “Fundamental Attribution Error” is expressed in terms of effect size — situational variables allegedly affect behavior “more” than people think. How do you even talk about that claim without comparing effect sizes? The theme of Jenny Crocker’s address at the presidential symposium at the 2012 SPSP was that “small” manipulations can have “large” effects; this is also effect size language expressing a theoretical view. Going back further, when attitude change theorists talked about direct and indirect routes to persuasion, this raised a key theoretical question of relative influence of the two effects. Lee Jussim wrote a whole (and excellent) book about the size of expectancy effects, comparing them to the effects of prior experience, valid information, etc. and building a theoretical model from that comparison. I could go on, but, in short, the relative size of effects matters in social psychological theorizing whether the effects are computed and reported, or not. When they aren’t, of course, the theorizing is proceeding in an empirical vaccum that might not even be noticed – and this happens way too often, including in some of the examples I just listed. My point is that effect size comparisons, usually implicit, are ubiquitous in psychological theorizing so it would probably be better if we remembered to explicitly calculate them, report them, and consider them carefully.

Reply
- Glenn I. Roisman says:
  
  February 10, 2013 at 9:44 pm
  
  A very interesting discussion–thanks to all, and to David for his summary. That said, as somone who had the good fortune to take a class in Philosophical Psychology with P.E. Meehl, I thought I’d comment on the following by David:
  
  “About the substance of the debate, I think each has contributed enormously by finding the weakest point in the other’s argument. Tal found the weakest point of Jeff’s argument: No matter how much one might want to be able to justify it, it’s literally **impossible** to not put effect size front and center because as soon as you choose a p-level and a sample size, you have automatically chosen the effect size you will regard as “important”– whether you compute it or not. Jeff found the weakest point of Tal’s argument; it really is difficult-bordering-on-impossible to say how “big” an effect size has to be to matter. Is .01 enough? It is, if we are talking about aspirin’s effect on second heart attacks, because wide prescription can save thousands of lives a year (notice, though that you need effect size to do this calculation). Probably not, though, for other purposes. But really, I don’t know how small is too small.”
  
  I agree wholheartedly with David’s assessment of Tal’s correct emphasis on why effect size matters, whether we think it does or not in the context of NHST. On the other hand, I think Meehl had the answer to David’s comment that: “it really is difficult-bordering-on-impossible to say how “big” an effect size has to be to matter.”
  
  In fact, effect sizes matter for (at least two) DISTINCT reasons, that ought not to be confounded. In the first–and relevant to the aspirin example–we are using an effect size (presumably one we have a great deal of trust in due to large N and small SEs around the estimate) to do PARAMETER ESTIMATION. We are not testing some theory about aspirin. No really. We already knew it was safe and cheap and we wondered how many lives would be saved if we used aspirin prophylactically for those at risk for heart disease (or who already have a CVA) due to its expected role in alleviating inflammation (as I understand it). Because aspirin use/dosage and mortality exist in non-arbitray metrics, we can answer David’s question easily. 1000 lives sounds worth it to me in light of the cost in aspirin.
  
  The other use to which we might want to put to effect sizes is THEORY TESTING. What the Big N examples with non-directional and directional hypotheses demonstrate is that effect sizes are basically useless in testing most of PSYCHOLOGY’s substantive theories, because the theories don’t care whether the effect is tiny or humongous. In short, there is no rational basis to identify a lower ES threshold for accepting a theory when the hypothesis being tested is deficient/quasi-unfalsifiable (e.g., any positive [negative] value is consistent or, often enough, any non-zero value is consistent). Meehl realized this long ago in emphasizing that most “theory testing” in psychological science amounts to, “Did I have a large enough N” because we are treating NHST as a form of theory testing, which it is absolutely not. In other words, our field rarely does real theory testing at all (see below) and that is why we can never identify on rational grounds a lower threshold for ES (other than the one embodied in the NHST test–that is, not exactly r=.00000000000000000000).
  
  Meehl and other philosophers of science also has the answer to this problem–which is to follow the lead of scientists elsewhere and actually test falsifiable theories, typically ones that lead to a relatively narrow ranges of expected effects, patterns in the data, etc. Salmon called this the search for theories that made predictions that, if reflected in the observed data, would be “damn strange coincidences.” Until we are testing such theories, we’re stuck in a (in my opinion) much bigger hole than the one created by people gaming NHST with small samples (intentionally or not).
  
  Reply
  - Jeff Sherman says:
    
    February 10, 2013 at 11:19 pm
    
    This. Thank you, Glenn.
Pingback: How High is the Sky? Well, Higher than the Ground | funderstorms
Lee Jussim says:

February 22, 2013 at 3:01 pm

WHEN EFFECT SIZES MATTER: THE INTERNAL (IN?)COHERENCE OF MUCH SOCIAL PSYCHOLOGY

Effect sizes may matter in some but not all situations (as the discussion above has made clear), and reasonably people may disagree.

This post is about one class of situations where: 1) They clearly do matter; and 2) They are largely ignored. That situation: When scientific articles, theories, writing makes explicit or implicit claims about the relative power of various phenomena (see also David F’s comments on ordinal effect sizes).

If you DO NOT care about effect sizes, that is fine. But, then, please do not make claims about the “unbearable automaticity of being.” I suppose automaticity could be an itsy bitsy teenie weenie effect size that is unbearable (like a splinter of glass in your foot), but that is not my reading of those claims. And it is not just about absolute effect sizes. It would be about the relative effects of conscious versus unconscious processes, something almost never compared empirically.

If you do not care about relative effect sizes, please do not declare that “social beliefs may create reality more than reality create social beliefs” (or the equivalent) as have lots of social psychologists.

If you do not care about at least relative effect sizes, please do not declare stereotypes to be some extraordinarily difficult-to-override “default” basis of person perception and argue that only under extraordinary conditions do people rely on individuating information (relative effect sizes of stereotypes versus individuating information in person perception are r’s=.10, .70, respectively).

If you do not care about at least relative effect sizes, please do not make claims about error and bias dominating social perception, without comparing such effects to accuracy, agreement, and rationality.

If one is making claims about the power and pervasiveness of some phenomenon — which social psychologists apparently often seem to want to do — one needs effect sizes.

Two concrete examples:
Rosenhan’s famous “being sane in insane places” study:
CLAIMED that the “insane were indistinguishable from the insane.” The diagnostic label was supposedly extraordinarily powerful. In fact, his own data showed that the psychiatrists and staff were over 90% accurate in their judgments.

Hastorf & Cantril’s famous “they saw a game” study:
This was interpreted both by the original authors and by pretty much everyone who has ever cited their study thereafter as demonstrating the power of subjective, “constructive” processes in social perception. It actually found far — and I do mean FAR — more evidence of agreement than of bias.

Both of these examples — and many more — can be found in my book (you can get the first chapter, and abstracts and excerpts here: http://www.rci.rutgers.edu/~jussim/TOC.html
(it is very expensive, so, if you are interested, I cannot in good faith recommend buying it, but there is always the library).

If (and I mean this metaphorically, to refer to all subsequent social psychological research, and not just these two studies) all Rosenhan and Hastorf & Cantril want to claim is “bias happens” then they do not need effect sizes. If they want to claim that labels and ingroup biases dominate perception and judgment — which they seemed very much to want to do — they need not only an effect size, but to compare effect sizes for bias to those for accuracy, agreement, rationality, and unbiased responding.

Reply
Michael J. Proulx (@MichaelProulx) says:

February 26, 2013 at 11:22 am

I have had editors on more than one occasion require: 1) that a replication experiment be removed from a paper “as it just shows the same effect as the first experiment” and 2) to remove null effect experiments that constrain the extent of an experiment (i.e., not all conceptual replications work). There is very much a culture where null results and replications are seen as something to bin rather than even keep in the file drawer. As others have noted, moving to support the publication and reporting of all experiments would help.

Reply
Pingback: The Berkeley Science Review » Have your cake and eat it, too! Practical reform in social psychology
Pingback: Grappling with the Past – Erika Salomon
Pingback: Don’t stand on the shoulders of giants | Psychology and Neuroscience stuff

	statistics+ –… on The Power Dialogues
	Tessah Joseph on Are conceptual replications pa…
	pigee on Are conceptual replications pa…
	Tessah Joseph on Are conceptual replications pa…
	Descriptive ulceriti… on The Deathly Hallows of Psychol…