Eyes wide shut or eyes wide open?

There have been a slew of systematic replication efforts and meta-analyses with rather provocative findings of late. The ego depletion saga is one of those stories. It is an important story because it demonstrates the clarity that comes with focusing on effect sizes rather than statistical significance.

I should confess that I’ve always liked the idea of ego depletion and even tried my hand at running a few ego depletion experiments.* And, I study conscientiousness which is pretty much the same thing as self-control—at least as it is assessed using the Tangney et al self-control scale (2004) which was meant, in part, to be an individual difference complement to the ego depletion experimental paradigms.

So, I was more than a disinterested observer as the “effect size drama” surrounding ego depletion played out over the last few years. First, you had the seemingly straightforward meta analysis by Hagger et al (2010), showing that the average effect size of the sequential task paradigm of ego-depletion studies was a d of .62. Impressively large by most metrics that we use to judge effect sizes. That’s the same as a correlation of .3 according to the magical effect size converters. Despite prior mischaracterizations of correlations of that magnitude being small**, that’s nothing to cough at.

Quickly on the heels of that meta-analysis were new meta-analyses and re-analyses of the meta-analytic data (e.g., Carter et al, 2015). These new meta-analyses and re-analyses concluded that there wasn’t any “there” there. Right after the Hagger et al paper was published, the quant jocks came up with a slew of new ways of estimating bias in meta-analyses. What happens when you apply these bias estimators to ego depletion data? There seemed to be a lot of bias in the research synthesized in these meta-analyses. So much so that the bias-corrected estimates included a zero effect size as a possibility (Carter et al., 2015). These re-analyses were then re-analyzed because the field of bias correction was moving faster than basic science and these initial corrections were called into question because apparently bias corrections are, well, biased… (Friese et al., 2018).

Not to be undone by an inability to estimate truth from the prior publication record, another, overlapping group of researchers conducted their own registered replication report—the most defensible and unbiased method of estimating an effect size (Hagger et al., 2016). Much to everyone’s surprise, the effect across 23 labs was something close to zero (d = .04). Once again, this effort was criticized for being a non-optimal test of the ego depletion effect (Friese et al., 2018).

To address the prior limitations of all of these incredibly thorough analyses of ego depletion, yet a third team took it upon themselves to run a pre-registered replication project testing two additional approaches ego-depletion using optimal designs (Vohs, Schmeichel & others, 2018). Like a broken record, the estimate across 40 labs resulted in effect size estimates that ranged from 0 (if you assumed zero was the prior) to about a d of .08 if you assumed otherwise***. If you bothered to compile the data across the labs and run a traditional frequentist analysis, this effect size, despite being minuscule was statistically significant (trumpets sound in the distance).

So, it appears the best estimate of the effect of ego depletion is around a d of .08, if we are being generous.

Eyes wide shut

So, there were a fair number of folks who expressed some curiosity about the meaning of the results. They asked questions on social media, like, “The effect was statistically significant, right? That means there’s evidence for ego depletion.”

Setting aside effect sizes for a moment, there are many reasons to see the data as being consistent with the theory. Many of us were rooting for ego depletion theory. Countless researchers were invested in the idea either directly or indirectly. Many wanted a pillar of their theoretical and empirical foundational knowledge to hold up, even if the aggregate effect was more modest than originally depicted. For those individuals, a statistically significant finding seems like good news, even if it is really cold comfort.

Another reason for the prioritization of significant findings over the magnitude of the effect is, well, ignorance of effect sizes and their meaning. It was not too long ago that we tried in vain to convince colleagues that a Neyman-Pearson system was useful (balance power, alpha, effect size, and N). A number of my esteemed colleagues pushed back on the notion that they should pay heed to effect sizes. They argued that, as experimental theoreticians, their work was, at best, testing directional hypotheses of no practical import. Since effect sizes were for “applied” psychologists (read: lower status), the theoretical experimentalist had no need to sully themselves with the tools of applied researchers. They also argued that their work was “proof of concept” and the designs were not intended to reflect real world settings (see ego depletion) and therefore the effect sizes were uninterpretable. Setting aside the unnerving circularity of this thinking****, what it implies is that many people have not been trained on, or forced to think much about, effect sizes. Yes, they’ve often been forced to report them, but not to really think about them. I’ll go out on a limb and propose that the majority of our peers in the social sciences think about and make inferences based solely on p-values and some implicit attributes of the study design (e.g., experiment vs observational study).

The reality, of course, is that every study of every stripe comes with an effect size, whether or not it is explicitly presented or interpreted. More importantly, a body of research in which the same study or paradigm is systematically investigated, like has been done with ego depletion, provides an excellent estimate of the true effect size for that paradigm. The reality of a true effect size in the range of d = .04 to d = .08 is a harsh reality, but one that brings great clarity.

Eyes wide open

So, let’s make an assumption. The evidence is pretty good that the effect size of sequential ego depletion tasks is, at best, d = .08.

With that assumption, the inevitable conclusion is that the traditional study of ego depletion using experimental approaches is dead in the water.


First, because studying a phenomenon with a true effect size of d = .08 is beyond the resources of almost all labs in psychology. To have 80% power to detect an effect size of d = .08 you would need to run more than 2500 participants through your lab. If you go with the d = .04 estimate, you’d need more than 9000 participants. More poignantly, none of the original studies used to support the existence of ego depletion were designed to detect the true effect size.

These types of sample size demands violate most of our norms in psychological science. The average sample size in prior experimental ego depletion research appears to be about 50 to 60. With that kind of sample size, you have 6% power to detect the true effect.

What about our new rules of thumb, like do your best to reach an N of 50 per cell, or use 2.5 the N of the original study, or crank the N up above 500 to test an interaction effect? Power is 8%, 11%, and 25% in each of those situations, respectively. If you ran your studies using these rules of thumb, you would be all thumbs.

But, you say, I can get 2500 participants on mTurk. That’s not a bad option. But, you have to ask yourself: To what end? The import of ego depletion research and much experimental work like it, is predicated on the notion that the situation is “powerful,” as in, it has a large effect. How important is ego depletion to our understanding of human nature if the effect is minuscule? Before you embark on the mega study of thousands of mTurkers, it might be prudent to answer this question.

But, you say, some have argued that small effects can cumulate and therefore be meaningful if studied with enough fidelity and across time. Great. Now all you need to do is run a massive longitudinal intervention study where you test how the minuscule effect of the manipulation cumulates over time and place. The power issue doesn’t disappear with this potential insight. You still have to deal with the true effect size of the manipulation being a d of .08. So, one option is to use a massive study. Good luck funding that study. The only way you could get the money necessary to conduct it would be to promise doing an fMRI of every participant. Wait. Oh, never mind.

The other option would be to do something radical like create a continuous intervention that builds on itself over time—something currently not part of ego depletion theory or traditional experimental approaches in psychology.

But, you say, there are hundreds of studies that have been published on ego depletion. Exactly. Hundreds of studies have been published that had average d-value of .62. Hundreds of studies have been published showing effect sizes that cannot, by definition, be true given the true effect size is d = .08. That is the clarity that comes with the use of accurate effect sizes. It is incredibly difficult to get d-values of .62 when the true d is .08. Look at the distribution of d-values around zero with sample sizes of 50. The likelihood of landing a d of .62 or higher is about 3%. This fact invites some uncomfortable questions. How did all of these people find this many large effects? If we assume they found these relatively huge, highly unlikely effects by chance alone, this would mean that there are thousands of studies lying about in file drawers somewhere. Or it means people used other means to dig these effects out of the data….

Setting aside the motivations, strategies, and incentives that would net this many findings that are significantly unlikely to be correct (p < .03), the import of this discrepancy is huge. The fact that hundreds of studies with such unlikely results were published using the standard paradigms should be troubling to the scientific community. It shows that psychologists, as a group using the standard incentive systems and review processes of the day, can produce grossly inflated findings that lend themselves to the appearance of an accumulated body of evidence for an idea when, by definition, it shouldn’t exist. That should be more than troubling. It should be a wakeup call. Our system is more than broken. It is spewing pollution into the scientific environment at an alarming rate.

This is why effect sizes are important. Knowing that the true effect size of sequential ego depletion studies is a d of .08 leads you to conclude that:

1. Most prior research on the sequential task approach to ego depletion is so problematic that it cannot and should not be used to inform future research. Are you interested in those moderators or boundary mechanisms of ego depletion? Great, you are now proposing to see whether your new condition moves a d of .08 to something smaller. Good luck with that.

2. New research on ego depletion is out of reach for most psychological scientists unless they participate in huge multi-lab projects like the Psychological Science Accelerator.

3. Our field is capable of producing huge numbers of published reports in support of an idea that are grossly inaccurate.

4. If someone fails to replicate one of my studies, I can no longer point to dozens, if not hundreds of supporting studies and confidently state that there is a lot of backing for my work.

5. As has been noted by others, meta-analysis is fucked.

And don’t take this situation as anything particular to ego depletion. We now have reams of studies that either through registered replication reports or meta-analyses have shown that the original effect sizes are inflated and that the “truer” effect sizes are much smaller. In numerous cases, ranging from GxE studies to ovulatory cycle effects, the meta-analytic estimates, while statistically significant, are conspicuously smaller than most if not all of the original studies were capable of detecting. These updated effect sizes need to be weighed heavily in research going forward.

In closing, let me point out that I say these things with no prejudice against the idea of ego depletion. I still like the idea and still hold out a sliver of hope that the idea may be viable. It is possible that the idea is sound and the way prior research was executed is the problem.

But, extrapolating from the cumulative meta-analytic work and the registered replication projects, I can’t avoid the conclusion that the effect size for the standard sequential paradigms is small. Really, really small. So small that it would be almost impossible to realistically study the idea in almost any traditional lab.

Maybe the fact that these paradigms no longer work will spur some creative individuals on to come up with newer, more viable, and more reliable ways of testing the idea. Until then, the implication of the effect size is clear: Steer clear of the classic experimental approaches to ego depletion. And, if you nonetheless continue to find value in the basic idea, come up with new ways to study it; the old ways are not robust.

Brent W. Roberts


* p < .05: They failed.  At the time, I chalked it up to my lack of expertise.  And that was before it was popular to argue that people who failed to replicate a study lacked expertise.

** p < .01: See “personality coefficient” Mischel, W. (2013). Personality and assessment. Psychology Press.

*** p < .005: that’s a correlation of .04, but who’s comparing effect sizes??

**** p < .001: “I’m special, so I can ignore effect sizes—look, small effect sizes—I can ignore these because I’m a theoretician. I’m still special”


Posted in Uncategorized | Leave a comment

Making good on a promise

At the end of my previous blog “Because, change is hard“, I said, and I quote: “So, send me your huddled, tired essays repeating the same messages about improving our approach to science that we’ve been making for years and I’ll post, repost, and blog about them every time.”

Well, someone asked me to repost their’s.  So here is it is: http://www.nature.com/news/no-researcher-is-too-junior-to-fix-science-1.21928.  It is a nice piece by John Tregoning.

Speaking of which, there were two related blogs posted right after the change is hard piece that are both worth reading.  The first by Dorothy Bishop is brilliant and counters my pessimism so effectively I’m almost tempted to call her Simine Vazire: http://deevybee.blogspot.co.uk/2017/05/reproducible-practices-are-future-for.html

And if you missed it James Heathers has a spot on post about the New Bad People: https://medium.com/@jamesheathers/meet-the-new-bad-people-4922137949a1


Posted in Uncategorized | Leave a comment

Because, change is hard

I reposted a quote from a paper on twitter this morning entitled “The earth is flat (p > 0.05): Significance thresholds and the crisis of unreplicable research.” The quote, which is worth repeating, was “reliable conclusions on replicability…of a finding can only be drawn using cumulative evidence from multiple independent studies.”

An esteemed colleague (Daniël Lakens @lakens) responded “I just reviewed this paper for PeerJ. I didn’t think it was publishable. Lacks structure, nothing new.”

Setting aside the typical bromide that I mostly curate information on twitter so that I can file and read things later, the last clause “nothing new” struck a nerve. It reminded me of some unappealing conclusions that I’ve arrived at about the reproducibility movement that lead to a different conclusion—that it is very, very important that we post and repost papers like this if we hope to move psychological science towards a more robust future.

From my current vantage, producing new and innovative insights about reproducibility is not the point. There has been almost nothing new in the entire reproducibility discussion. And, that is okay. I mean, the methodologists (whether terroristic or not) have been telling us for decades that our typical approach to evaluating our research findings is problematic. Almost all of our blogs or papers have simply reiterated what those methodologists told us decades ago. Most of the papers and activities emerging from the reproducibility movement are not coming up with “novel, innovative” techniques for doing good science. Doing good science necessitates no novelty. It does not take deep thought or creativity to pre-register a study, do a power analysis, or replicate your research.

What is different this time is that we have more people’s attention than the earlier discussions. That means, we have a chance to make things better instead of letting psychology fester in a morass of ambiguous findings meant more for personal gain than for discovering and confirming facts about human nature.

The point is that we need to create an environment in which doing science well—producing cumulative evidence from multiple independent studies—is the norm. To make this the norm, we need to convince a critical mass of psychological scientists to change their behavior (I wonder what branch of psychology specializes in that?). We know from our initial efforts that many of our colleagues want nothing to do with this effort (the skeptics). And, these skeptical colleagues count in their ranks a disproportionate number of well-established, high status researchers who have lopsided sway in the ongoing reproducibility discussion. We also know that another critical mass is trying to avoid the issue, but seem to be grudgingly okay with taking small steps like increasing their N or capitulating to new journal requirements (the agnostics). I would even guess that the majority of psychological scientists remain blithely unaware of the machinations of scientists concerned with reproducibility (the naïve) and think that it is only an issue for subgroups like social psychology (which we all know is not true). We know that many young people are entirely sympathetic to the effort to reform methods in psychological science (the sympathizers). But, these early career researchers face withering winds of contempt from their advisors or senior colleagues and problematic incentives for success that dictate they continue to pursue poorly designed research (e.g., the prototypical underpowered series of conceptual replication studies, in which one roots around for p < .05 interaction effects).

So why post papers that reiterate these points? Even if those papers are derivative or maybe not as scintillating as we would like? Why write blogs that repeat what others have said for decades before?

Because, change is hard.

We are not going to change the minds of the skeptics. They are lost to us. That so many of our most highly esteemed colleagues are in this group simply makes things more challenging. The agnostics are like political independents. Their position can be changed, but it takes a lot of lobbying and they often have to be motivated through self-interest. I’ve seen an amazingly small number of agnostics come around after six years of blog posts, papers, presentations, and conversations. These folks come around one talk, one blog, or one paper at a time. And really, it takes multiple messages to get them to change. The naïve are not paying attention, so we need to repeat the same message over and over and over again in hopes that they might actually read the latest reiteration of Jacob Cohen. The early career researchers often see clearly what is going on but then must somehow negotiate the landmines that the skeptics and the reproducibility methodologists throw in their way. In this context, re-messaging, re-posting, re-iterating serves the purpose to  create the perception that doing things well is supported by a critical mass of colleagues.

Here’s my working hypothesis. In the absence of wholesale changes to incentive structures (grants, tenure, publication requirements at journals), one of the few ways we will succeed in making it the norm to “produce cumulative evidence from multiple independent studies” is by repeating the reproducibility message. Loudly. By repeating these messages we can drown out the skeptics, move a few agnostics, enlighten the naïve, and create an environment in which it is safe for early career researchers to do the right thing. Then, in a generation or two psychological science might actually produce, useful, cumulative knowledge.

So, send me your huddled, tired essays repeating the same messages about improving our approach to science that we’ve been making for years and I’ll post, repost, and blog about them every time.

Brent W. Roberts

Posted in Uncategorized | 9 Comments

A Most Courageous Act

The most courageous act a modern academic can make is to say they were wrong.  After all, we deal in ideas, not things.  When we say we were wrong, we are saying our ideas, our products so to speak, were faulty.  It is a supremely unsettling thing to do.

Of course, in the Platonic ideal, and in reality, being a scientist necessitates being wrong a lot. Unfortunately, our incentive system militates against being honest about our work. Thus, countless researchers choose not to admit or even acknowledge the possibility that they might have been mistaken.

In a bracingly honest post in response to a blog by Uli Schimmack, the Nobel Prize winning psychologist, Daniel Kahneman, has done the unthinkable.  He has admitted that he was mistaken.   Here’s a quote:

“I knew, of course, that the results of priming studies were based on small samples, that the effect sizes were perhaps implausibly large, and that no single study was conclusive on its own. What impressed me was the unanimity and coherence of the results reported by many laboratories. I concluded that priming effects are easy for skilled experimenters to induce, and that they are robust. However, I now understand that my reasoning was flawed and that I should have known better. Unanimity of underpowered studies provides compelling evidence for the existence of a severe file-drawer problem (and/or p-hacking). The argument is inescapable: Studies that are underpowered for the detection of plausible effects must occasionally return non-significant results even when the research hypothesis is true – the absence of these results is evidence that something is amiss in the published record. Furthermore, the existence of a substantial file-drawer effect undermines the two main tools that psychologists use to accumulate evidence for a broad hypotheses: meta-analysis and conceptual replication. Clearly, the experimental evidence for the ideas I presented in that chapter was significantly weaker than I believed when I wrote it. This was simply an error: I knew all I needed to know to moderate my enthusiasm for the surprising and elegant findings that I cited, but I did not think it through. When questions were later raised about the robustness of priming results I hoped that the authors of this research would rally to bolster their case by stronger evidence, but this did not happen.”

My respect and gratitude for this statement by Professor Kahneman knows no bounds.

Brent W. Roberts

Posted in Uncategorized | 3 Comments

A Commitment to Better Research Practices (BRPs) in Psychological Science

Scientific research is an attempt to identify a working truth about the world that is as independent of ideology as possible.  As we appear to be entering a time of heightened skepticism about the value of scientific information, we feel it is important to emphasize and foster research practices that enhance the integrity of scientific data and thus scientific information. We have therefore created a list of better research practices that we believe, if followed, would enhance the reproducibility and reliability of psychological science. The proposed methodological practices are applicable for exploratory or confirmatory research, and for observational or experimental methods.

  1. If testing a specific hypothesis, pre-register your research[1], so others can know that the forthcoming tests are informative. Report the planned analyses as confirmatory, and report any other analyses or any deviations from the planned analyses as exploratory.
  2. If conducting exploratory research, present it as exploratory. Then, document the research by posting materials, such as measures, procedures, and analytical code so future researchers can benefit from them. Also, make research expectations and plans in advance of analyses—little, if any, research is truly exploratory. State the goals and parameters of your study as clearly as possible before beginning data analysis.
  3. Consider data sharing options prior to data collection (e.g., complete a data management plan; include necessary language in the consent form), and make data and associated meta-data needed to reproduce results available to others, preferably in a trusted and stable repository. Note that this does not imply full public disclosure of all data. If there are reasons why data can’t be made available (e.g., containing clinically sensitive information), clarify that up-front and delineate the path available for others to acquire your data in order to reproduce your analyses.
  4. If some form of hypothesis testing is being used or an attempt is being made to accurately estimate an effect size, use power analysis to plan research before conducting it so that it is maximally informative.
  5. To the best of your ability maximize the power of your research to reach the power necessary to test the smallest effect size you are interested in testing (e.g., increase sample size, use within-subjects designs, use better, more precise measures, use stronger manipulations, etc.). Also, in order to increase the power of your research, consider collaborating with other labs, for example via StudySwap (https://osf.io/view/studyswap/). Be open to sharing existing data with other labs in order to pool data for a more robust study.
  6. If you find a result that you believe to be informative, make sure the result is robust. For smaller lab studies this means directly replicating your own work or, even better, having another lab replicate your finding, again via something like StudySwap.  For larger studies, this may mean finding highly similar data, archival or otherwise, to replicate results. When other large studies are known in advance, seek to pool data before analysis. If the samples are large enough, consider employing cross-validation techniques, such as splitting samples into random halves, to confirm results. For unique studies, checking robustness may mean testing multiple alternative models and/or statistical controls to see if the effect is robust to multiple alternative hypotheses, confounds, and analytical approaches.
  7. Avoid performing conceptual replications of your own research in the absence of evidence that the original result is robust and/or without pre-registering the study. A pre-registered direct replication is the best evidence that an original result is robust.
  8. Once some level of evidence has been achieved that the effect is robust (e.g., a successful direct replication), by all means do conceptual replications, as conceptual replications can provide important evidence for the generalizability of a finding and the robustness of a theory.
  9. To the extent possible, report null findings. In science, null news from reasonably powered studies is informative news.
  10. To the extent possible, report small effects. Given the uncertainty about the robustness of results across psychological science, we do not have a clear understanding of when effect sizes are “too small” to matter. As many effects previously thought to be large are small, be open to finding evidence of effects of many sizes, particularly under conditions of large N and sound measurement.
  11. When others are interested in replicating your work be cooperative if they ask for input. Of course, one of the benefits of pre-registration is that there may be less of a need to interact with those interested in replicating your work.
  12. If researchers fail to replicate your work continue to be cooperative. Even in an ideal world where all studies are appropriately powered, there will still be failures to replicate because of sampling variance alone. If the failed replication was done well and had high power to detect the effect, at least consider the possibility that your original result could be a false positive. Given this inevitability, and the possibility of true moderators of an effect, aspire to work with researchers who fail to find your effect so as to provide more data and information to the larger scientific community that is heavily invested in knowing what is true or not about your findings.

We should note that these proposed practices are complementary to other statements of commitment, such as the commitment to research transparency (http://www.researchtransparency.org/). We would also note that the proposed practices are aspirational.  Ideally, our field will adopt many, of not all of these practices.  But, we also understand that change is difficult and takes time.  In the interim, it would be ideal to reward any movement toward better research practices.

Brent W. Roberts

Rolf A. Zwaan

Lorne Campbell

[1] van ’t Veer, A. E., & Giner-Sorolla, R. (2016). Pre-registration in social psychology—A discussion and suggested template. Journal of Experimental Social Psychology, 67, 2–12. doi:10.1016/j.jesp.2016.03.004

Posted in Uncategorized | 1 Comment

Andrew Gelman’s blog about the Fiske fiasco

Some of you might have missed the kerfuffle that erupted in the last few days over a pre-print of an editorial written by Susan Fiske for the APS Monitor about us “methodological terrorists”.  Andrew Gelman’s blog reposts Fiske’s piece, puts it in historical context, and does a fairly good job of articulating why it is problematic beyond the terminological hyperbole that Fiske employs.  We are reposting it for your edification.

What has happened down here is the winds have changed

Posted in Uncategorized | Leave a comment

The Power Dialogues

The following is a hypothetical exchange between a graduate student and Professor Belfry-Roaster.  The names have been changed to protect the innocent….

Budlie Bond: Professor Belfry-Roaster I was confused today in journal club when everyone started discussing power.  I’ve taken my grad stats courses, but they didn’t teach us anything about power.  It seemed really important. But it also seemed controversial.  Can you tell me a bit more about power and why people care so much about it

Prof. Belfry-Roaster: Sure, power is a very important factor in planning and evaluating research. Technically, power is defined as the long-run probability of rejecting the null hypothesis when it is, in fact, false. Power is typically considered to be a Good Thing because, if the null is false, then you want your research to be capable of rejecting it. The higher the power of your study, the better the chances are that this will happen.

The concept of power comes out of a very specific approach to significance testing pioneered by Neyman and Pearson. In this system, a researcher considers 4 factors when planning and evaluating research: the alpha level (typically the threshold you use to decide whether a finding is statistically significant), the effect size of your focal test of your hypothesis, sample size, and power.  The cool thing about this system is that if you know 3 of the factors you can compute the last one.  What makes it even easier is that we almost always use an alpha value of .05, so that is fixed. That leaves two things: the effect size (which you don’t control) and your sample size (which you can control). Thus, if you know the effect size of interest, you can use power analysis to determine the sample size needed to reject the null, say, 80% of the time, if the null is false in the population. Similarly, if you know the sample size of a study, you can calculate the power it has to reject the null under a variety of possible effect sizes in the population.

Here’s a classic paper on the topic for some quick reading:

Cohen J. (1992). Statistical power analysis. Current Directions in Psychological Science, 1, 98-101.

Budlie Bond:  Okay, that is a little clearer.  It seems that effect sizes are critical to understanding power. How do I figure out what my effect size is? It seems like that would involve a lot of guess work. I thought part of the reason we did research was because we didn’t know what the effect sizes were.

Prof. Belfry-Roaster: Effect sizes refer to the magnitude of the relationship between variables and can be indexed in far too many ways to describe. The two easiest and most useful for the majority of work in our field are the d-score and the correlation coefficient.  The d-score is the standardized difference between two means—simply the difference divided by the pooled standard deviation. The correlation coefficient is, well, the correlation coefficient. 

The cool thing about these two effect sizes is that they are really easy to compute from the statistics that all papers should report.  They can also be derived from basic information in a study, like the sample size and the p-value associated with a focal significance test.  So, even if an author has not reported an effect size you can derive one easily from their test statistics. Here are some cool resources that help you understand and calculate effect sizes from basic information like means and standard deviations, p-values, and other test statistics:


Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G* Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39(2), 175–191.

Budlie Bond:  You said I can use effect size information to plan a study.  How does that work?

Prof. Belfry-Roaster: If you have some sense of what the effect size may be based on previous research, you can always use that as a best guess for selecting the appropriate sample size. But, many times that information isn’t available because you are trying something new.  If that is the case, you can still draw upon what we generally know about effect sizes in our field.  There are now five reviews that show that the average effect sizes in social, personality, and organizational psychology correspond roughly to a d-score of .4 or a correlation of .2.

Bosco, F. A., Aguinis, H., Singh, K., Field, J. G., & Pierce, C. A. (2015). Correlational effect size benchmarks. Journal of Applied Psychology100(2), 431.

Fraley, R. C., & Marks, M. J. (2007). The null hypothesis significance testing debate and its implications for personality research. Handbook of research methods in personality psychology, 149-169.

Paterson, T. A., Harms, P. D., Steel, P., & Credé, M. (2016). An assessment of the magnitude of effect sizes evidence from 30 years of meta-analysis in management. Journal of Leadership & Organizational Studies23(1), 66-81.

Hemphill, J. F. (2003). Interpreting the magnitudes of correlation coefficients. American Psychologist, 58, 78-80.

Richard, F. D., Bond Jr, C. F., & Stokes-Zoota, J. J. (2003). One Hundred Years of Social Psychology Quantitatively Described. Review of General Psychology, 7(4), 331-363

There are lots of criticisms of these estimates, but they are not a bad starting point for planning purposes.  If you plug those numbers into a power calculator, you find that you need about 200 subjects to have 80% power for an average simple main effect (e.g., d = .4).  If you want to be safe and either have higher power (e.g., 90%) or plan for a smaller effect size (e.g., d of .3), you’ll need more like 250 to 350 participants.  This is pretty close to the sample size when effect sizes get “stable”.

Schoenbrodt & Perugini, 2013; http://www.nicebread.de/at-what-sample-size-do-correlations-stabilize/

However, for other types of analyses, like interaction effects, some smart people have estimated that you’ll need more than twice as many participants—in the range of 500.  For example, Uri Simonsohn has shown that if you want to demonstrate that a manipulation can make a previously demonstrated effect go away, you need twice as many participants as you would need to demonstrate the original main effect (http://datacolada.org/17).

Whatever you do, be cautious about these numbers.  Your job is to think about these issues not to use rules of thumb blindly. For example, the folks who study genetic effects found out that the effect sizes for single nucleotide polymorphisms were so tiny that they needed hundreds of thousands of people to have enough power to reliably detect their effects.  On the flip side, when your effects are big, you don’t need many people.  We know that the Stroop effect is both reliable and huge. You only need a handful of people to figure out whether the Stroop main effect will replicate. Your job is to use some estimated effect size to make an informed decision about what your sample size should be.  It is not hard to do and there are no good excuses to avoid it.

Here some additional links and tables that you can use to estimate the sample size you will need to reach in order to achieve 80 or 90% power once you’ve got an estimate of your effect size:

For correlations:


For mean differences:


Here’s are two quick and easy tables showing the relation between power and effect size for reference:



Budlie Bond:  My office-mate Chip Harkel says that there is a new rule of thumb that you should simply get 50 people per cell in an experiment.  Is that a sensible strategy to use when I don’t know what the effect size might be?

Prof. Belfry-Roaster:  The 50 person per cell is better than our previous rules of thumb (e.g., 15 to 20 people per cell), but, with a bit more thought, you can calibrate your sample size better. If you have reasons to think the effect size might be large (like the Stroop Effect), you will waste a lot of resources if you collect 50 cases per cell. Conversely, if you are interested in testing a typical interaction effect, your power is going to be too low using this rule of thumb.

Budlie Bond: Why is low power such a bad thing?

Prof. Belfry-Roaster:  You can think about the answer several ways.  Here’s a concrete and personal way to think about it. Let’s say that you are ready to propose your dissertation.  You’ve come up with a great idea and we meet to plan out how you are going to test it.  Instead of running any subjects I tell you there’s no need.  I’m simply going to flip a coin to determine your results.  Heads your findings are statistically significant; tails insignificant.  Would you agree to that plan?  If you find that to be an objectionable plan, then you shouldn’t care for the way we typically design our research because the average power is close to 50% (a coin flip).  That’s what you do every time you run a low powered study—you flip a coin.  I’d rather that you have a good chance of rejecting the null if it is false then to be subject to the whims of random noise.  That’s what having a high powered study can do for you.

At a broader level low power is a problem because the findings from low powered studies are too noisy to rely on. Low powered studies are uninformative. They are also quite possibly the largest reason behind the replication crisis.  A lot of people point to p-hacking and fraud as the culprits behind our current crisis, but a much simpler explanation of the problems is that the original studies were so small that they were not capable of revealing anything reliable. Sampling variance is a cruel master. Because of sampling variance, effects in small studies bounce around a lot. If we continue to publish low powered studies, we are contributing to the myth that underpowered studies are capable of producing robust knowledge. They are not.

Here are some additional readings that should help to understand how power is related to increasing the informational value of your research:

Lakens, D., & Evers, E. R. K. (2014). Sailing from the seas of chaos into the corridor of stability: Practical recommendations to increase the informational value of studies. Perspectives on Psychological Science, 9(3), 278–292. http://doi.org/10.1177/1745691614528520

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59(1), 537–563. http://doi.org/10.1146/annurev.psych.59.103006.093735

Budlie Bond: Is low power a good reason to dismiss a study after the fact?

Prof. Belfry-Roaster.  Many people assume that statistical power is not necessary “after the fact.” That is, once we’ve done a study and found a significant result, it would appear that the study must have been capable of detecting said effect. This is based on a misunderstanding of p-values and significance tests (see Fraley & Marks, 2007 for a review).

Fraley, R. C., & Marks, M. J. (2007). The null hypothesis significance testing debate and its implications for personality research. Handbook of research methods in personality psychology, 149-169.

What many researchers fail to appreciate is that a literature based on underpowered studies is more likely to be full of false positives than a literature that is based on highly powered studies. This sometimes seems counterintuitive to researchers, but it boils down to the fact that, when studies are underpowered, the relative ratio of true to false positives in the literature shifts (see Ioannidis 2008). The consequence is that a literature based on underpowered studies is quite capable of containing an overwhelming number of false positives—much more than the nominal 5% that we’ve been conditioned to expect. If you want to maximize the number of true positives in the literature relative to false leads, you would be wise to not allow underpowered studies into the literature.

Ioannidis JPA (2008) Why most discovered true associations are inflated.  Epidemiology, 19, 640-648.

In fact, I’d go one step further and say that low power is an excellent reason for why a study should be desk rejected by an editor.  An editor has many jobs, but one of those is to elevate or maintain the quality of the work that the journal publishes. Given how poorly our research is holding up, you really need a good excuse to publish underpowered research because doing so will detract from the reputation of the journal in our evolving climate.  For example, if you are studying a rare group or your resources are limited you may have some justification for using low power designs.  But if that is the case, you need to be careful about using inferential statistics.  The study may have to justified as being descriptive or suggestive, at best.  On the other hand, if you are a researcher at a major university with loads of resources like grant monies, a big subject pool, and an army of undergraduate RAs, there is little or no justification for producing low-powered research.  Low power studies simply increase the noise in the system making it harder and harder to figure out whether an effect exists or not and whether a theory has any merit.  Given how many effects are failing to replicate, we have to start taking power seriously unless we want to see our entire field go down in replicatory flames.

Another reason to be skeptical of low powered studies is that, if researchers are using significance testing as a way of screening the veracity of their results, they can only detect medium to large effects.  Given the fact that on average most of our effects are small, using low powered research makes you a victim of the “streetlight effect”—you know, where the drunk person only looks for their keys under the streetlight because that is the only place they can see? That is not an ideal way of doing science.

Budlie Bond: Ok, I can see some of your points. And, thanks to some of those online power calculators, I can see how I can plan my studies to ensure a high degree of power. But how do power calculations work in more complex designs, like those that use structural equation modeling or multi-level models?

Prof. Belfry-Roaster.  There is less consensus on how to think about power in these situations. But it is still possible to make educated decisions, even without technical expertise. For example, even in a design that involves both repeated measures and between-person factors, the between-persons effects still involve comparisons across people and should be powered accordingly. And in SEM applications, if the pairwise covariances are not estimated with precision, there are lots of ways for those errors to propagate and create estimation problems for the model.

Thankfully, there are some very smart people out there and they have done their best to provide some benchmarks and power calculation programs for more complex designs.  You can find some of them here.

Faul, F., Erdfelder, E., Lang, A. G., & Buchner, A. (2007). G* Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior research methods39(2), 175-191.

Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Frontiers in psychology4, 863.

MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and determination of sample size for covariance structure modeling. Psychological methods1(2), 130-149.

Mathieu, J. E., Aguinis, H., Culpepper, S. A., & Chen, G. (2012). Understanding and estimating the power to detect cross-level interaction effects in multilevel modeling. Journal of Applied Psychology97(5), 951-966.

Muthén, B. O., & Curran, P. J. (1997). General longitudinal modeling of individual differences in experimental designs: A latent variable framework for analysis and power estimation. Psychological methods2(4), 371-402.

Budlie Bond: I wasn’t sure how to think about power when I conducted my first study. But, in looking back at my data, I see that, given the sample size I used, my power to detect the effect size I found (d = .50) was over 90%. Does that mean my study was highly powered?

Prof. Belfry-Roaster: When power is computed based on the effect size observed, the calculation is sometimes referred to as post hoc power or observed power. Although there can be value in computing post hoc power, it is not a good way to estimate the power of your design for a number of reasons. We have touched on some of those already. For example, if the design is based on a small sample, only large effects (true large effects and overestimates of smaller or null effects) will cross the p < .05 barrier. As a result, the effects you see will tend to be larger than the population effects of interest, leading to inflated power estimates.

More importantly, however, power is a design issue, not a data-analytic issue. Ideally, you want to design your studies to be capable of detecting the effects that matter for the theory of interest. Thus, when designing a study, you should always ask “How many subjects do I need to have 80% power to detect an effect if the population effect size is X or higher,” where X is the minimum effect size of interest. This value is likely to vary from one investigator to another, but given that small effects matter for most directional theories, it is prudent to set this value fairly low.

You can also ask about the power of the design in a post hoc way, but it is best to ask not what the power was to detect the effect that was observed, but to ask what the power was to detect effects of various sizes. For example, if you conducted a two-condition study with 50 people per cell, you had 17% power to detect a d of .20, 51% to detect a d of .40, and 98% to detect a d of .80. In short, you can evaluate the power of a study to detect population effects of various sizes after the fact. But you don’t want to compute post hoc power by asking what the power of the design was for detecting the effect observed. For more about these issues, please see Daniel Lakens great blog post on post-hoc power: http://daniellakens.blogspot.com/2014/12/observed-power-and-what-to-do-if-your.html

Budlie Bond:  Thanks for all of the background on power. I got in a heated discussion with Chip again and he said a few things that made me think you are emphasizing power too much.  First he said that effect sizes are for applied researchers and that his work is theoretical. The observed effect sizes are not important because they depend on a number of factors that can vary from one context to the next (e.g., the strength of the manipulation, the specific DV measured). Are effect sizes and power less useful in basic research than they are in applied research?

Prof. Belfry-Roaster:  With the exception of qualitative research, all studies have effect sizes, even if they are based on contrived or artificial conditions (think of Harlow’s wire monkeys, for example).  If researchers want a strong test of their theory in highly controlled laboratory settings, they gain enormously by considering power and thus effect sizes.  They need that information to design the study to test their idea well.

Moreover, if other people want to replicate your technique or build on your design, then it is really helpful if they know the effect size that you found so they can plan accordingly.

In short, even if the effect size doesn’t naturally translate into something of real world significance given the nature of the experimental or lab task, there is an effect size associated with the task. Knowing it is important not only for properly testing the theory and seeing what kinds of factors can modulate the effect, but for helping others plan their research accordingly. You are not only designing better research by using effect sizes, you are helping other researchers too.

Another reason to always estimate your effect sizes is that they are a great reality check on the likelihood and believability of your results. For example, when we do individual difference research, we start thinking that we are probably measuring the same thing when the correlation between our independent and dependent variable gets north of .50.  Well, a correlation of .5 is like a d-score of .8.  So, if you are getting effect sizes above .5 or above a d of .8 your findings warrant a few skeptical questions.  First, you should ask whether you measured the same thing twice.  In an experiment d’s around 1 should really be the exclusive domain of manipulation checks, not an outcome of some subtle manipulation.  Second, you have to ask yourself how you are the special one who found the “low hanging fruit” that is implicit in a huge effect size.  We’ve been at the study of psychology for many decades.  How is it that you are the genius who finally hit on a relationship that is so large that it should visible to the naked eye (Jacob Cohen’s description of a medium effect size) and all of the other psychologists missed it? Maybe you are that observant, but it is a good question to ask yourself nonetheless.

And this circles back to our discussion of low power.  Small N studies only have enough power to detect medium to large effect sizes with any reliability.  If you insist on running small N studies and ignore your effect sizes, you are more likely to produce inaccurate results simply because you can’t detect anything but large effects, which we know are rare. If you then report those exaggerated effect sizes, other people who attempt to build on your research will plan their designs around an effect that is too large. This will lead them to underpower their studies and fail to replicate your results. The underpowered study thus sets in motion a series of unfortunate events that lead to confusion and despair rather than progress.

Choosing to ignore your effect sizes in the context of running small N studies is like sticking your head in the sand.  Don’t do it.

Budlie Bond: Chip’s advisor also says we should not be so concerned with Type 1 errors.  What do you think?

Prof. Belfry-Roaster: To state out loud that you are not worried about Type 1 errors at this point in time is inconceivable.  Our studies are going down in flames one-by-one.  The primary reason for that is because we didn’t design the original studies well—typically they were underpowered and never directly replicated.  If we continue to turn a blind eye to powering our research well, we are committing to a future where our research will repeatedly not replicate.  Personally, I don’t want you to experience that fate.

Budlie Bond:  Chip also said that some people argue against using large samples because doing so is cheating.  You are more likely to get a statistically significant finding that is really tiny.  By only running small studies they say they protect themselves from promoting tiny effects.

Prof. Belfry-Roaster: While it is true that small studies can’t detect small effects, the logic of this argument does not add up. The only way this argument would hold is if you didn’t identify the effect size in your study, which, unfortunately, used to be quite common.  Researchers used to and still do obsess over p-values.  In a world where you only use p-values to decide whether a theory or hypothesis is true, it is the case that large samples will allow you to claim that an effect holds when it is actually quite small. On the other hand, if you estimate your effect sizes in all of your studies then there is nothing deceptive about using a large sample.  Once you identify an effect as small, then other researchers can decide for themselves whether they think it warrants investment.  Moreover, the size of the sample is independent of the effect size (or should be).  You can find a big effect size with a big sample too.

Ultimately, the benefits of a larger sample outweigh the costs of a small sample.  You gain less sampling variance and a more stable estimate of the effect size.  In turn, the test of your idea should hold up better in future research than the results from a small N study.  That’s nothing to sneeze at.

You can also see how this attitude toward power and effect sizes creates a vicious cycle.  If you use small N studies evaluated solely by p-values rather than power and effect sizes, you are destined to lead a chaotic research existence where findings come and go, seemingly nonsensically.  If you then argue that 1) all theories are not true under certain conditions, or that 2) the manipulation is delicate, or 3) that there are loads of hidden moderators, you can quickly get into a situation where your claims cannot be refuted.  Using high powered studies with effect size estimates can keep you a little more honest about the viability of your ideas.

Budlie Bond: Chip’s advisor says all of this obsession with power is hindering our ability to be creative.  What do you think?

Prof. Belfry-Roaster:  Personally, I believe the only thing standing between you and a creative idea is gray matter and some training. How you go about testing that idea is not a hindrance to coming up with the idea in the first place.  At the moment we don’t suffer from a deficit of creativity.  Rather we have an excess of creativity combined with the deafening roar of noise pollution.  The problem with low powered studies is they simply add to the noise. But how valuable are creative ideas in science if they are not true?

Many researchers believe that the best way to test creative ideas is to do so quickly with few people.  Actually, it is the opposite.  If you really want to know whether your new, creative idea is a good one, you want to overpower your study. One reason is that low power leads to Type II errors—not detecting an effect when the null is false.  That’s a tragedy.  And, it is an easy tragedy to avoid—just power your study adequately.

Creative ideas are a dime a dozen. But creative ideas based on robust statistical evidence are rare indeed. Be creative, but be powerfully creative.

Budlie Bond:  Some of the other grad students were saying that the sample sizes you are proposing are crazy large.  They don’t want to run studies that large because they won’t be able to keep up with grad students who can crank out a bunch of small studies and publish at a faster rate.

Prof. Belfry-Roaster:  I’m sympathetic to this problem as it does seem to indicate that research done well will inevitably take more time, but I think that might be misleading.  If your fellow students are running low powered studies, they are likely finding mixed results, which given our publication norms won’t get a positive reception.  Therefore, to get a set of studies all with p-values below .05 they will probably end up running multiple small studies.  In the end, they will probably test as many subjects as you’ll test in your one study.  The kicker is that their work will also be less likely to hold up because it is probably riddled with Type 1 errors.

Will Gervais has conducted some interesting simulations comparing research strategies that focus on slower, but more powerful studies against those that focus on faster, less powerful samples. His analysis suggests that you’re not necessarily better off doing a few quick and under-powered studies. His post is worth a read.


Budlie Bond:  Chip says that your push for large sample sizes also discriminates against people who work at small colleges and universities because they don’t have access to the numbers of people you need to run adequately-powered research.

Prof. Belfry-Roaster:  He’s right.  Running high powered studies will require potentially painful changes to the way we conduct research.  This, as you know, is one reason why we often offer up our lab to friends at small universities to help conduct their studies.  But they also should not be too distraught.  There are creative and innovative solutions to the necessity of running well-designed studies (e.g., high powered research).  First, we can take inspiration from the GWAS researchers.  When faced with the reality that they couldn’t go it alone, they combined efforts into a consortium in order to do their science properly. There is nothing stopping researchers at both smaller and larger colleges and universities from creating their own consortia. It might mean we have to change our culture of worshiping the “hero” researcher, but that’s okay.  Two or more heads is always better than one (at least according to most groups research.  I wonder how reliable that work is…?).  Second, we are on the verge of technological advances that can make access to large numbers of people much easier—MtTurk being just one example.  Third, some of our societies, like SPSP and APS and APA are rich.  Well, rich enough to consider doing something creative with their money.  They could, if they had the will and the leadership, start thinking about doing proactive things like creating subject pool panels that we can all access and run our studies on and thus conduct better powered research.

Basically Bud, we are at a critical juncture.  We can continue doing things the old way which means we will continue to produce noisy, unreplicable research, or we can change for the better.  The simplest and most productive thing we can do so is to increase the power of our research.  In most cases, this can be achieved simply by increasing the average sample size of our studies.  That’s why we obsess about the power of the research we read and evaluate.  Any other questions?




Posted in Uncategorized | 18 Comments