An interesting little discussion popped up in the wild and wooly new media world in science (e.g., podcasts and twitter) concerning the relative merits of “descriptive” vs “hypothesis” driven designs. All, mind you, indirectly caused by the paper that keeps on giving—Tal Yarkoni’s generalizability crisis paper.
Inspired by Tal’s paper, a small group of folks endorsed the merits of descriptive work and the fact that psychology would do well to conduct more of this type of research (Two Psychologist, Four Beers; Very Bad Wizards). In response, Paul Bloom argued/opined for hypothesis testing–more specifically, theoretically informed hypothesis testing of a counterintuitive hypothesis.
I was implicated in the discussion as someone who’s work exemplifies descriptive research. In fact, Tal Yarkoni himself has disparaged my work in just such a way.* And, I must confess, I’ve stated similar things in public, especially when I give my standard credibility crisis talk.
So, it might come as a surprise to hear that I completely agree with Bloom that a surgical hypothesis test using experimental methods that arrives at what is described as a “counterintuitive” finding can be the bee’s knees. It is, and probably should be, the ultimate scientific achievement. If it is true, of course.
That being said, I think there is some slippage in the verbiage being employed here. There are deeper meanings lurking under the surface of the conversation like sharks waiting to upend the fragile scientific dingy we float in.
First, let’s take on the term that Bloom uses, “counterintuitive,” which is laden with so much baggage it needs four porters to be brought to the room. It is both unnecessary and telling to use that exact phrase to describe the hypothetical ideal research paradigm. It is also, arguably, the reason why so many researchers are now clambering to the exits to get a breath of fresh, descriptive air.
Why is it unnecessary? There is a much less laden term, “insight” that could be used instead. Bloom partially justifies his argument for counterintuitive experiments with the classic discovery by Barry Marshall that ulcers are not caused by stress, as once was thought, but by a simple bacteria. Marshall famously gave himself an ulcer first, then successfully treated it with antibiotics. Bloom describes Marshall’s insight as counterintuitive. Was it? There was a fair amount of work by others pointing to the potential of antibiotics to treat peptic ulcers for several decades before Marshall’s work. An alternative take on the entire process of that discovery was that Barry Marshall had an insight that led to a “real” discovery that helped move the scientific edifice forward–as in, we acquired a better approximation of the truth and the truth works. As scientists, we all strive to have insights that move the dial closer to truth. Calling those insights counterintuitive is unnecessary.
It is also telling that Bloom uses the term counterintuitive because it has serious sentimental value. It is a term that reflects the heady, Edge-heavy decades pre-Bem 2011 where we could publish counterintuitive after counterintuitive finding in Psychological Science using the Deathly Hallows of Psychological Science (because that was what that journal was for after all) that in retrospect were simply not true. Why were they not true? Because our experimental methods were so weak and our criteria for evaluating findings so flawed that our research got unmoored from reality. With a little due diligence–a few QRPs, a series of poorly powered studies, and some convenient rationalizations–(e.g., some of my grad students don’t have the knack to find evidence for what is clearly a correct hypothesis….), one could cook up all sorts of counterintuitive findings. There was so much counterintuitive that counterintuitive became counterintuitive. And why did we do this? Because Bloom is right. The coolest findings in the history of science have that “aha” component demonstrated with a convincing experimental study.
This is not to say that p-hacking and counterintuitive experimental methods are synonymous, just that as a field we valued counterintuitive findings so much that we employed problematic methods to arrive at them. Because of this unfortunate cocktail, the “counterintuitive” camp still has serious, painful reckoning to face. We got away with methodological malpractice for several decades in the service of finding counterintuitive results. And, it was so cool. We were stars and pundits and public intellectuals riding a wave of confection that went poof. We ate ice cream for breakfast, lunch, and dinner. Even a small dose of methodological rigor dished up in the form of “eating your vegetables” is going to feel like punishment after that. But since the reproducibility rate of all of those counterintuitive findings is holding steady at less than 50%, I believe some vegetables are in order–or maybe an antibiotic. Having had an ulcer, I know first hand the robustness of Marshall’s work. The relief that occurred after the first dose of antibiotics was profound. Psychology does not currently produce robust findings like that. To opine for the old days when we could flog the data to produce counterintuitive results without first cleaning up our methods, while understandable, is also counterproductive.
Naturally, many folks have reacted to the credibility crisis, which the counterintuitive paradigm helped to foster, with something akin to revulsion and have gone in search of alternatives or fixes, some conceptual, some methodological. One consistent line of thinking is that we should prioritize a range of methods roughly described with terms like descriptive, observational, and exploratory. I’m going to go out on a slight nerdy, psychometrically-inspired interpretive limb here and say that these are all manifest indicators of the true latent factor behind these terms–reality-based research. A bunch of us would prefer that the work we do is grounded in reality–findings that are robust or, even more provocatively, findings that reflect the true nature of human nature.
Chris Fraley put it to me well. He said that the call for more descriptive and exploratory research is grounded in the concern that we don’t have a sound foundation for understanding how and why people behave the way they do. The theory of evolution by natural selection, for example, would have not come about but for a huge repository of direct observations of animal behavior and morphology. Why not psychology? It seems reasonable that psychology should have a well-documented picture of the important dimensions of human thought, feeling, and behavior that is descriptively rich, grounded in the lives that people lead, accurate, and repeatable. When I hear colleagues say that we should do more descriptive work, this is what I’m hearing them say.
Preferably, this real understanding of human behavior would then be the basis upon which insightful experiments would be tested.
Of course, Bloom is partially right that descriptive work can and should put people to sleep. Much of my work does.** Just ask my students. And just by being descriptive, it may not be any more useful than a counterproductive counterintuitively motivated experiment. What of all of that descriptive, observational work on ulcers before Barry Marshall’s work? It had come to the conclusion that stress caused ulcers. Would another observational study of stress and ulcer symptoms have brought insights to this situation? How about a fancy longitudinal, cross-lagged panel model? Ooooh, even better, a longitudinal, growth mixture model of stress and ulcer groups. I’m getting the vapors just thinking about it. No, sorry, given my experience with ulcers I prefer a keen insight into the mechanisms that allowed ulcers to be treated quickly and easily, thank you.
That said, the fetishizing of clever counterintuitiveness and demeaning of descriptive work as boring also smacks of elitism. After all, the truth doesn’t care if it is boring. I remember watching in bemused wonderment back in grad school when Oliver John would receive ream after ream of factor structures in the snail mail from Lew Goldberg who was at the time cranking out the incredibly boring analyses that arrived at the insight that most of how we describe each other can be organized into five domains. It was like watching an accountant get really excited about a spreadsheet. On the other hand, the significance of the Big Five and the revolutionizing effect it has had on the field of personality psychology cannot be overstated. If there was an aha moment it wasn’t the result of anything counterintuitive.
And the trope that observation and description of humans are intrinsically boring is possibly more of an indictment of psychologists’ lack of imagination and provincialism than anything else. After all, there is an entire field across the quad from most of us called Anthropology that has been in the practice of describing numerous cultures, countries, and tribes across the globe. Human ethology, cultural anthropology, and behavioral ecology are remarkably interesting fields with often surprising insights into the uniquenesses and commonalities of all peoples. One could argue that we could get a head start on the whole description thing by reading some of their work instead of cooking up our own stew of descriptive research.
If there is a little homily to end this essay I guess it would be not to lionize either description or counterintuitive methods. Neither method has the market cornered on providing insight.
* Just kidding. I think he meant it as a compliment.
** They say sleep is good for you. Therefore, my research can and does have a positive impact on society.
Contributors to this blog (in alphabetical order a la the economists)
David Condon, Chris Fraley, Katie Corker, Rodica Damian, M Brent Donnelan, Grant Edmonds, David Funder, Don Lynam, Dan Mroczek, Uli Orth, Alexander Schackman, Uli Schimmack, Chris Soto, Brent Roberts, Jennifer Tackett; Brenton Wiernik, Sara Weston,
Scientific personality psychology has had a bit of a renaissance in the last few decades, emerging from a period of deep skepticism and subsequent self-reflection to a period where we believe there are robust findings in our field.
The problem is that many people, and scientists, don’t follow scientific personality psychology and remain blithely unaware of the field’s accomplishments. In fact, it is quite common to do silly things like equate the field of scientific personality psychology with the commodity that is the MBTI.
With this situation in mind, I recently asked a subset of personality psychologists to help identify what they believed to be robust findings in personality psychology. You will find the product of that effort below.
We are not assuming that we’ve identified all of the robust findings. In fact, we’d like you to vote on each one to see whether these are consensually defined “robust findings.” Moreover, we’d love you to comment and suggest other candidates for consideration. All we ask is that you characterize the finding and suggest some research that backs up your suggestion. We’ve kept things pretty loose to this point, but the items below can be characterized as findings that replicate across labs and have a critical mass of research that is typically summarized in one or more meta-analyses. We are open to suggestions about making the inclusion criteria more stringent.
Regardless of your feelings about this effort, I found the experience to be illuminating. At one level, I personally believe that every field should do this even if the result is not convincing. As people have noted, as self-described scientists we are in the enterprise of discovering reliable facts. If we can’t readily identify the provisional facts we’ve come up with and communicate them to others in simple language something is really, really wrong with our field.
If it is the case that you believe these findings are mundane or obvious, we look forward to the link to your post laying out what you thought was mundane and obvious from 2 years ago, or any time in the past for that matter. Lacking that, we suspect your contempt says more about you than about these findings.
Personality traits partially predict longevity at an equal level to, and above and beyond, socioeconomic status and intelligence.
Graham, E.K., Rutsohn, J.P., Turiano, N.A., Bendayan, R., Batterham, P., Gerstorf, D., Katz, M., Reynolds, C., Schoenhofen, E., Yoneda, T., Bastarache, E., Elleman, Zelinski, E.M., Johansson, B., Kuh, D., Barnes, L.L., Bennett, D., Deeg, D., Lipton, R., Pedersen, N., Piccinin, A., Spiro, A., Muniz-Terrera, G., Willis, S., Schaie, K.W., Roan, C., Herd, P., Hofer, S.M., & Mroczek, D.K. (2017). Personality predicts mortality risk: An integrative analysis of 15 international longitudinal studies. Journal of Research in Personality, 70, 174-186.
Personality factors are partially heritable with most of the variance being from non-shared environmental influences and only a small portion being the result of shared environmental influences, like all other psychological constructs.
The infamous personality coefficient compares favorably to other effect sizes studied in many areas of Psychology and related fields. Large effects are not expected when considering multiply-determined, consequential life outcomes.
Personality shows both consistency (rank relative to others) and change (level relative to younger self) across time. Personality continues to change across the lifespan (largest changes between ages 18 and 30, but continues later on) and the mechanisms of change include: social investment, life experiences, therapy, own volition
Personality-descriptive language, psychological tests, and pretty much every other form of describing or measuring individual differences in behavior can be organized in terms of five or six broad trait factors.
Crowe, M.L., Lynam, D.R., Campbell, W.K., & Miller, J.D. (2019). Exploring the structure of narcissism: Towards an integrated solution. Journal of Personality, 87, 1151-1169.
Hur, J., Stockbridge, M. D., Fox, A. S. & Shackman, A. J. (2019). Dispositional negativity, cognition, and anxiety disorders: An integrative translational neuroscience framework. Progress in Brain Research, 247, 375-436.
Kotov, R., Gamez, W., Schmidt, F., & Watson, D. (2010). Linking “big” personality traits to anxiety, depressive, and substance use disorders: a meta-analysis. Psychological bulletin, 136(5), 768.
Lynam, D.R., & Miller, J.D. (2015). Psychopathy from a basic trait perspective: The utility of a five-factor model approach. Journal of Personality, 83, 611-626.
Lynam, D.R. & Widiger, T. (2001). Using the five factor model to represent the personality disorders: An expert consensus approach. Journal of Abnormal Psychology, 110, 401-412.
Miller, J.D., Lynam, D.R., Widiger, T., & Leukefeld, C. (2001). Personality disorders as extreme variants of common personality dimensions: Can the Five Factor Model adequately represent psychopathy? Journal of Personality, 69, 253-276.
Shackman, A. J., Tromp, D. P. M., Stockbridge, M. D., Kaplan, C. M., Tillman, R. M., & Fox, A. S. (2016). Dispositional negativity: An integrative psychological and neurobiological perspective. Psychological Bulletin, 142, 1275-1314.
Vize, C.E., Collison, K.L., Miller, J.D., & Lynam, D.R. (2019). Using Bayesian methods to update and expand the meta-analytic evidence of the Five-Factor Model’s relation to antisocial behavior. Clinical Psychology Review, 67, 61-77.
Widiger, T. A., Sellbom, M., Chmielewski, M., Clark, L. A., DeYoung, C. G., Kotov, R., … & Samuel, D. B. (2019). Personality in a hierarchical model of psychopathology. Clinical Psychological Science, 7(1), 77-92.
Wright, A. G., Hopwood, C. J., & Zanarini, M. C. (2015). Associations between changes in normal personality traits and borderline personality disorder symptoms over 16 years. Personality Disorders: Theory, Research, and Treatment, 6(1), 1.
Personality is partially predicts financial and economic outcomes, such as annual earnings, net worth and consumer spending
Denissen, J. J. A., Bleidorn, W., Hennecke, M., Luhmann, M., Orth, U., Specht, J., & Zimmermann, J. (2018). Uncovering the power of personality to shape income. Psychological Science, 29, 3-13. http://dx.doi.org/10.1177/0956797617724435
Judge, T. A., Livingston, B. A., & Hurst, C. (2012). Do nice guys—and gals—really finish last? The joint effects and agreeableness on income. Journal of Personality and Social Psychology, 102, 390-407. doi: 10.1037/a0026021
Moffitt, T. E., Arseneault, L., Belsky, D., Dickson, N., Hancox, R. J., Harrington, H., … & Sears, M. R. (2011). A gradient of childhood self-control predicts health, wealth, and public safety. Proceedings of the National Academy of Sciences, 108(7), 2693-2698.
Nyhus, E. K., & Pons, E. (2005). The effects of personality on earnings. Journal of Economic Psychology. 26(3), 363-384.
Roberts, B., Jackson, J. J., Duckworth, A. L., & Von Culin, K. (2011, April). Personality measurement and assessment in large panel surveys. In Forum for health economics & policy (Vol. 14, No. 3). De Gruyter.
Weston, S. J., Gladstone, J. J., Graham, E. K., Mroczek, D. K., & Condon, D. M. (2019) Published advance access online September 13, 2018). Who Are the Scrooges? Personality Predictors of Holiday Spending. Social Psychological and Personality Science, 10, 775-782
Birth order is functionally unrelated to personality traits and only modestly related to cognitive ability.
Damian, R. I., & Roberts, B. W. (2015). The associations of birth order with personality and intelligence in a representative sample of US high school students. Journal of Research in Personality, 58, 96-105.
Rohrer, J. M., Egloff, B., & Schmukle, S. C. (2015). Examining the effects of birth order on personality. Proceedings of the National Academy of Sciences, 112(46), 14224-14229.
Rohrer, J. M., Egloff, B., & Schmukle, S. C. (2017). Probing birth-order effects on narrow traits using specification-curve analysis. Psychological Science, 28(12), 1821-1832.
Personality traits, especially conscientiousness and emotional stability, are partially related to reduced risk for Alzheimer’s disease syndrome.
Chapman, B. P., Huang, A., Peters, K., Horner, E., Manly, J., Bennett, D. A., & Lapham, S. (2019). Association Between High School Personality Phenotype and Dementia 54 Years Later in Results From a National US Sample. JAMA psychiatry
Terracciano, A., Sutin, A. R., An, Y., O’Brien, R. J., Ferrucci, L., Zonderman, A. B., & Resnick, S. M. (2014). Personality and risk of Alzheimer’s disease: new data and meta-analysis. Alzheimer’s & Dementia, 10(2), 179-186.
Wilson, R. S., Arnold, S. E., Schneider, J. A., Li, Y., & Bennett, D. A. (2007). Chronic distress, age-related neuropathology, and late-life dementia. Psychosomatic Medicine, 69(1), 47-53.
Wilson, R. S., Schneider, J. A., Arnold, S. E., Bienias, J. L., & Bennett, D. A. (2007). Conscientiousness and the incidence of Alzheimer disease and mild cognitive impairment. Archives of general psychiatry, 64(10), 1204-1212.
Barrick, M. R., Mount, M. K., & Judge, T. A. (2001). Personality and performance at the beginning of the new millennium: What do we know and where do we go next? International Journal of Selection and Assessment, 9(1/2), 9–30. https://doi.org/10/frqhf2
Hurtz, G. M., & Donovan, J. J. (2000). Personality and job performance: The Big Five revisited. Journal of Applied Psychology, 85(6), 869–879. https://doi.org/10/bc7959
van Aarde, N., Meiring, D., & Wiernik, B. M. (2017). The validity of the Big Five personality traits for job performance: Meta-analyses of South African studies. International Journal of Selection and Assessment, 25(3), 223–239. https://doi.org/10/cbhv
Particularly more motivation-driven behaviors (e.g., helping, rule breaking):
Berry, C. M., Carpenter, N. C., & Barratt, C. L. (2012). Do other-reports of counterproductive work behavior provide an incremental contribution over self-reports? A meta-analytic comparison. Journal of Applied Psychology, 97(3), 613–636. https://doi.org/10/fzktph
Berry, C. M., Ones, D. S., & Sackett, P. R. (2007). Interpersonal deviance, organizational deviance, and their common correlates: A review and meta-analysis. Journal of Applied Psychology, 92(2), 410–424. https://doi.org/10/b965s7
Chiaburu, D. S., Oh, I.-S., Berry, C. M., Li, N., & Gardner, R. G. (2011). The five-factor model of personality traits and organizational citizenship behaviors: A meta-analysis. Journal of Applied Psychology, 96(6), 1140–1166. https://doi.org/10/fnfd2q
Bono, J. E., & Judge, T. A. (2004). Personality and transformational and transactional leadership: A meta-analysis. Journal of Applied Psychology, 89(5), 901–910. https://doi.org/10/ctfhf9
Judge, T. A., Bono, J. E., Ilies, R., & Gerhardt, M. W. (2002). Personality and leadership: A qualitative and quantitative review. Journal of Applied Psychology, 87(4), 765–780. https://doi.org/10/bhfk7d
DeRue, D. S., Nahrgang, J. D., Wellman, N. E. D., & Humphrey, S. E. (2011). Trait and behavioral theories of leadership: An integration and meta-analytic test of their relative validity. Personnel Psychology, 64(1), 7–52. https://doi.org/10/fwzt2t
Among the Big Five, C has the largest and most consistent relationships:
Wilmot, M. P., & Ones, D. S. (2019). A century of research on conscientiousness at work. Proceedings of the National Academy of Sciences. https://doi.org/10/ggcjvr
There is a hierarchy of consistency in personality with cognitive abilities at the top followed by personality traits and then subjective evaluations like subjective well-being and life satisfaction
Conley, J. J. (1984). The hierarchy of consistency: A review and model of longitudinal findings on adult individual differences in intelligence, personality and self-opinion. Personality and Individual Differences, 5(1), 11-25.
Fujita, F., & Diener, E. (2005). Life satisfaction set point: stability and change. Journal of personality and social psychology, 88(1), 158.
Anusic, I., & Schimmack, U. (2016). Stability and change of personality traits, self-esteem, and well-being: Introducing the meta-analytic stability and change model of retest correlations. Journal of Personality and Social Psychology, 110(5), 766.
How about a little blast from the past? In rooting around in an old hard drive searching for Pat Hill’s original CV , I came across a document that we wrote way back in 2006 on how to write more effectively. It was a compilation of the collective wisdom at that time of Roberts, Fraley, and Diener. It was interesting to read after 13 years. Fraley and I have updated our opinions a bit. We both thought it would be good to share if only for the documentation of our pre-blogging, pre-twitter thought processes.
Manuscript Acronyms from Hell: Lessons We’ve Learned on Writing the Empirical Research Article
By Brent Roberts (with substantial help from Ed Diener and Chris Fraley)
Originally written sometime in 2006
Updated 2019 thoughts in blue
Here are a set of subtle lessons that we’ve culled from our experience writing journal articles. They are intended as a short list of questions that you can ask yourself each time you complete an article. For example, before you submit your paper to a journal, ask yourself whether you have created a clear need for the study in the introduction, or whether everything is parallel, etc. This list is by no means complete, but we do hope that it is useful.
Create The Need (CTN). Have you created the need? Have you made it clear to the reader why your study needs to be done and why he or she should care? This is typically done in one of two ways. The first way is to show that previous research has failed to consider some connection or some methodological permutation or both. This means reviewing previous research in a positive way with a bite at the end in which you explain that, despite the excellent work, this research failed to consider several things. The second way is to point out that you are doing something completely unique. Even if you are taking this approach, you should review the “analogue” literature. The analogue literature is a line of research that is conceptually similar in content or method, but not exactly like your study.
Fraley 2019: I try to encourage my students to do this based on the ideas themselves. Specifically, the question should be so important, either for theoretical reasons or due to their natural appeal, that the “so what, who cares, why bother” (i.e., the “Caroline Trio”) is unambiguous.
I don’t like it when authors justify the need by saying something along the lines of “No one has addressed this yet.” Research doesn’t examine the association between coffee consumption and the use of two vs. one spaces after a period either. Thus, there is a gap in the literature. But the gap is appropriate: There is no need to address that specific question.
I mention this simply because, imo, the “need” should emerge not only from holes/limitations in the literature. The “need” should also be clear independently of what has or has not been done to date.
Always Be Parallel (APB). Every idea that is laid out in the introduction should be in the methods, results, and discussion. Moreover, the order of the ideas should be exactly the same in each section. Assume your reader is busy, tired, bored, or lazy, or some combination of these wonderful attributes. You don’t want to make your reader work too hard, otherwise they will quickly become someone who is not your reader. Parallelism also refers to emphasis. If you spend three pages discussing a topic in the introduction and two sentences in the results and discussion on the same topic, then you either have to 1) cut the introductory material, or 2) enhance the material in the results and discussion.
Correlate Ideas and Method (CIM). The methods that you choose to adopt in your study should be clearly linked to the concepts and ideas that inspire your research. Put another way, the method you are going to use (e.g., correlation, factor analysis, text analysis, path model, repeated measures experiment, between-subject experiment) should be clear to the readers before they get to the method section.
Eliminate All Tangents (EAT). If you introduce an idea that is not directly germane to your study, eliminate it. That is, if it is not part of your method or not tested in your results then eliminate it from your introduction. If it is important for future research, put it in your discussion. Remember Bem’s maxim: If you find a tangent in your manuscript make it a footnote. In the next revision of the paper, eliminate all footnotes.
Roberts 2019: It is interesting looking back now and seeing that I cited without issue, Bem’s chapter that everyone now excoriates for containing a recipe for p-hacking. Yes, I used to assign that chapter to my students. In retrospect, and even now, his p-hacking section did not bother me, largely because I’ve always been a fan of exploratory and abductive approaches to research—explore first, then validate. If you are going to explore, then it is typically good to report what your data show you. Of course, you should not then “HARK the herald angels sing” and make up your hypotheses after that fact.
Always Be Deductive (ABD). Papers that start with a strong thesis/research question read better than papers that have an inductive structure. The latter build to the study through reviewing the literature. After several pages the idea for the study emerges. The deductive structure starts with the goal of the paper and then often provides an outline or advance organizing section at the beginning of the article informing the reader of what is to come.
Fraley 2019: I don’t endorse this claim strongly (but I think it has its uses). I think this mindset puts the author in the position of “selling” an idea–When authors use a deductive structure, I start to question their biases and whether they are more committed to the idea that motivates the deduction or the facts/data that could potentially challenge that framework.
No Ad Hominen Attacks (NAHA). Don’t point out the failings or foibles of researchers, even if they are idiots. This will needlessly piss of the researcher, who is most likely going to be a reviewer. Or, it will piss off friends of the researcher, who are also likely to be reviewers. If you are going to attack anything, then attack ideas.
Fraley 2019: Only an idiot would make this recommendation.
Roberts 2019: Have you considered being more active on Twitter?
Contrast Two Real Hypotheses (CTRH). Although not attainable in every instance, we like to design studies and write papers that contrast two theoretical perspectives or hypotheses in which one of the hypotheses is not the null hypothesis. This accomplishes several goals at once. First, it helps to generate a deductive structure. Second, it tends to diminish the likelihood of ad hominin attacks, as you have to give both theoretical perspectives their due. In terms of analyses, it tends to force you into contrasting two models rather than throwing yourself against the shoals of the null hypothesis every time, which is relatively uninteresting.
Writing Is Rewriting (WIR). There is no such thing as a “final” draft. There is simply the paper that you submit. This is not to say that you should be nihilistic about your writing and submit slipshod prose because there is no hope of attaining perfection. Rather, you should strive for perfection and learn to accept the fact that you will never achieve it.
Two-Heads-Are-Better-Than-One (THABTO). Have someone else read your paper before turning it in or submitting it. A second pair of eyes can detect flaws that you have simply habituated to after reading through the document for the 400th time. This subsumes the recommendation to always proofread your document. In general, we recommend collaborating with someone else. Often times, a second person possess skills that you lack. Working with that person leverages your combined skills. This inevitably leads to a better paper.
Use Active Language (UAL). Where possible, eliminate the passive voice.
Define Your Terms (DYT). Make sure you define your concepts when they are introduced in the paper.
One Idea Per Sentence (OIPS).
Review Ideas Not People (RINP). When you have the choice of saying “Smith and Jones (1967) found that conscientiousness predicts smoking,” or “Conscientiousness is related to a higher likelihood of smoking (Smith & Jones, 1967),” choose the later.
Don’t Overuse Acronyms (DOA).
Ed Diener summarizes much of this more elegantly. When writing your paper make the introduction lead up to the questions you want to answer; don’t raise extra issues in the introduction that you don’t answer. Make it seem like what you are doing follows as the next direct and logical thing from what has already been done. Moreover, emphasize that what you are doing is not just a nice thing to do, but THE next thing that is essential to do.
Fraley 2019: Another idea worth adding: Write in a way that would allow non-experts to understand what you’re doing and why. Also, many of your readers might not be Native-English speakers. As such, it is best to write directly and avoid turns of phrase or idioms. Focus on communicating ideas rather than showing off your vocabulary or your knowledge of obscure ideas.
Roberts 2019: I can’t help but think about this list of recommendations in light of the reproducibility crisis. The question I would ask now is whether these recommendations apply as well to a registered report as they would to the typical paper from 2006. I think the 2006 list implicitly accepted some of the norms of the time, especially that the null is never accepted, at least for publication, and HARKing was “good rhetoric.” Where the list might go astray now is not with registered reports, but in writing up exploratory research. I think we need some new acronyms and norms for exploratory studies. Of course, that would assume that the field actually decides to honor honestly depicted exploratory work, which it has yet to do. If we aren’t going to publish those types of papers, we don’t need norms do we?
Fraley 2019: Building off your latest comment, I don’t see anything in here that wouldn’t apply to registered reports or (explicitly) exploratory research. In each case, it is helpful to build a need for the work, to articulate alternative perspectives on what the answer might be (even if it is exploratory), to write clearly, eliminate tangents, not make personal attacks on other scholars, etc.
Having said that, I think most authors still operate under the assumption that, if they are testing a hypothesis (in a non-competing hypotheses scenario), they have to be “right” (“As predicted, …”) in order to get other people to value their contribution. I think we lack a framework for how to write about and discuss findings that are difficult to reconcile with existing models or which do not line up with expectations.
Do you have any 2019 recommendations on how to approach this issue?
My off-the-cuff initial suggestion is that we need to find a way, especially in our Discussion sections, to get comfortable with uncertainty. A study doesn’t need to provide “clean results” to make a contribution, and not every study needs to make a definitive contribution.
Roberts 2019. I think there are really deceptively simple ways to get comfortable with uncertainty. First, we could change the norms from valuing “The Show” which says publish clever, counterintuitive ideas that lead directly to the TED-Gladwellian Industrial Complex (e.g., a book contract and B-School job or funding from a morally questionable benefactor) to getting it right. And, by getting it right, I mean honestly portraying your attempts to test ideas and reporting on those attempts regardless of their “success.” Good luck with that one.
A second deceptively simple way to grow comfortable with uncertainty is to work on important ideas that matter to people and society rather than what your advisor says is important. Who cares whether attachment is a type or continuum or whether the structure of conscientiousness includes traditionalism? What matters is whether childhood attachment really has any consequential effects on outcomes we care about—likewise for conscientiousness. Instead of asking, “Does conscientiousness matter to _____”, we could ask “How do I help more adults avoid contracting Alzheimer’s disease.” When asked that way, finding out what doesn’t work (e.g., a null effect) is just as important as finding out what does.
By the way, I just found a typo in the original text….13 years and countless readings by multiple people and it was still not “perfect.”
 Pat claims he only had 2 publications when he applied for our post doc. I remember 7 to 9. Needless to say, he went on to infamy by publishing at a rate during the post doc that no-one to my knowledge has matched. I’d like to take credit for that but given the fact that he continues to publish at that rate, I’m beginning to think it was Pat….
 This, of course, is a bit of an overstatement. As Chris Fraley points out, the judicious use of footnotes can assuage the concerns of reviewers that you failed to consider their research. By eliminating tangents, I mean getting rid of entire paragraphs that are not directly relevant to your paper.
 This is not to say that the motivation for a line of research should not be inspired by a negative reaction to someone or someone’s ideas. It is okay to get your underwear in a bunch over someone’s aggressive ignorance and then do something about it in your research. Just don’t write it up that way.
I seem to replicate the same conversation on Twitter every time a different sliver of the psychological guild confronts open science and reproducibility issues. Each conversation starts and ends the same way as conversations I’ve had or seen 8 years ago, 4 years ago, 2 years ago, last year, or last month.
In some ways that’s a good sign. Awareness of the issue of reproducibility and efforts to improve our science are reaching beyond the subfields that have been at the center of the discussion.
Greater engagement with these issues is ideal. The problem is that each time a new group realizes that their own area is subject to criticism, they raise the same objections based on the same misconceptions, leading to the same mistaken attack on the messengers: They claim that scholars pursuing reproducibility or meta-science issues are a highly organized phalanx of intransigent, inflexible, authoritarians who are insensitive to important differences among subfields and who to impose a monolithic and arbitrary set of requirements to all research.
In these “conversations,” scholars recommending changes to the way science is conducted have been unflatteringly described as sanctimonious, despotic, authoritarian, doctrinaire, and militant, and creatively labeled with names such as shameless little bullies, assholes, McCarthyites, second stringers, methodological terrorists, fascists, Nazis, Stasi, witch hunters, reproducibility bros, data parasites, destructo-critics, replication police, self-appointed data police, destructive iconoclasts, vigilantes, accuracy fetishists, and human scum. Yes, every one of those terms has been used in public discourse, typically by eminent (i.e., senior) psychologists.
Villainizing those calling for methodological reform is ingenious, particularly if you have no compelling argument against the proposed changes*. It is a surprisingly effective, if corrosive, strategy.
Unfortunately, the net effect of all of the name calling is that people develop biased, stereotypical views of anyone affiliated with promoting open and reproducible science**. Then, each time a new group wrestles with reproducibility, we hear the same “well those reproducibility/open science people are militant” objection, as if it is at all relevant to whether you pre-register your study or not***. And this is not to say that all who promote open and reproducible science are uniformly angelic. Far from it. There are really nasty people who are also proponents of open science and reproducibility, and some of them are quite outspoken.
Just. Like. Every. Other. Group. In. Psychology****.
And, just like every other group in psychology, the majority of those advocating for reform are modest and reasonable. But as seems to be the case in our social media world, the modest and reasonable ones are lost in the flurry of fury caused by the more noisy folk. More importantly, the existence of a handful of nasty people has no bearing on the value of the arguments themselves. Regardless of whether you hear it from a nasty person or a nice one, it would improve the quality of our scientific output if we aspired to more often pre-register, replicate, post our materials, and properly power our studies.
The other day on Twitter, I had the conversation again. My colleague Don Lynam (@drl54567) likened the sanctimonious of the reproducibility brigade to ex-smokers, which at first blush was a compelling analogy. Maybe we do get a bit zealous about reproducibility because we’ve committed ourselves to the task. Who hasn’t met a drug and alcohol counselor or ex-smoker who isn’t a tad bit too passionate about helping us to quit drinking and smoking?
But, as I told Don, a better analogy is water sanitation.
The job of a water sanitation engineer is to produce good, clean water. Some of us, circa 2011 or so*****, noticed a lot of E. coli in the scientific waters and concluded that our filtration system was broken. Some countered that a high amount of E. coli is normal in science and of no concern. Many of us disagreed. We pointed out how easily the filtration system could be improved to reduce the amount of E. coli–pre-registering our efforts, making our data and methods more open and transparent, directly replicating our own work, adequately powering our studies so that they actually can work as a filter–you get my point.
When you replace “scientific reform” with “water filtration” and “our subfield” with “our water source”, it reveals why having this same conversations over and over is so frustrating:
Them: “The water in our well is clean. There is no problem.”
Us: “Have you tested your water (e.g., registered replication report)?”
Us: “Then you can’t really be confident that your water is clean.”
Them: “Stop being so militant.”
Them: “I haven’t noticed any problems with our well, so there’s no reason to doubt the effectiveness of our filtration system.”
Us: “Has anyone else applied your filtration system to another well to make sure it works (direct replication)?”
Them: “No. Having other people do the same thing we do isn’t necessary (it’s a waste of time).”
Us: “But if you haven’t tested the effectiveness of your filtration system, how can you be sure that your filter works?”
Them: “Stop being so sanctimonious.”
Them: “Look at my shiny, innovative filtration system that I just created.”
Us: “Has it been tested in different wells (pre-registered study)?”
Them: “No. my job is to create new and shiny filters, not test whether they work for other people.”
Us: “But the water still has E. coli in it.”
Them: “Stop being such an asshole.”
Us:“Your well doesn’t give off enough water to even test (power your research better).”
Them:“What little water we have has always been perfectly clean”
Us:“How about if we dig your well deeper and bigger so we can get more water out of it to test?”
Them: “How dare you question the quality of my water you terrorist.”
Them:“We get pure, clean water from every well we dig”
Us:“Awesomesauce. Can you share your filtration system (open science)?”
Them:“With you? You’re not even an expert. You wouldn’t understand our system.”
Us:“If you post it in the town square we’ll try and figure it out with your help.”
Much of the frustration that I see on the part of those trying to clean the water, so to speak, is that the changes are benign and the arguments against the changes are weak, but people still attack the messenger rather than testing their water for E. coli. We have students getting sick (losing ground in graduate school) from drinking the polluted water (wasting time on bogus findings), and they blame themselves for drinking from non-potable water sources.
In the end, it would be lovely if everyone were kind and civil. It would be great if folks would stop using overwrought, historically problematic monikers for people they don’t like. But we know from experience that one person’s sober and objective criticism of a study is another person’s methodological terrorism. We know that being the target of replication efforts is intrinsically threatening. The emotions in science have been and will continue to run raw. When these conversations focus on the tone or the unsavory personal qualities of those suggesting change, it shows how powerfully people want to avoid cleaning up the water.
Of course, emotional reactions and name calling are immaterial to whether there is E. coli in the water. And, it is in every scientist’s long-term interest to fix our filtration system******. Because it is broken. Those promoting open science and the techniques of reproducibility are motivated to improve the drinking water of science. Tools, like pre-registration, posting materials, direct replication, increased power are not perfect and they merit ongoing discussion and improvement. Yet presently, if you happen to be sitting on what you believe to be an unspoiled well-spring of scientific ideas, there is no better way to prove it than to have another team of scientists test your ideas in a well-powered, pre-registered, direct replication. When the results of that effort come in, we will be happy to discuss the findings, preferable in civil tones with no name calling.
Brent W. Roberts
*I’m not sure it was a deliberate decision, but if you want to avoid changing your methods, making the people the issue, not the ideas, is a brilliant strategy.
**In one very awkward, tragic dinner conversation one of my most lovely, kind colleagues described another one of my lovely, kind colleagues as a bully based solely on second hand rumors based on the name calling.
***A pre-registered hypothesis in need of testing–anyone who tells you the open science cabal is a cabal or militant or nasty or any other bad things, are scholars who have not attempted the reforms themselves and are looking for reasons not to change.
****And in science as a whole. And in life for that matter.
*****Some way before that.
******It might not be in every scientist’s short-term interest to do things well….
A while back, Michael Kraus (MK), Michael Frank (MF) and me (Brent W Roberts, or BWR; M. Brent Donnellan–MBD–is on board for this discussion so we’ll have to keep our Michaels and Brents straight) got into a Twitter inspired conversation about the niceties of using polytomous rating scales vs yes/no rating scales for items. You can read that exchangehere.
The exchange was loads of fun and edifying for all parties. An over-simplistic summary would be that, despite passionate statements made by psychometricians, there is no Yes or No answer to the apparent superiority of Likert-type scales for survey items.
We recently were reminded of our prior effort when a similar exchange on Twitter pretty much replicated our earlier conversation–I’m not sure whether it was a conceptual or direct replication….
In part of the exchange, Michael Frank (MF) mentioned that he had tried the 2-point option with items they commonly use and found the scale statistics to be so bad that they gave up on the effort and went back to a 5-point option. To which, I replied, pithily, that he was using the Likert scale and the systematic errors contained therein to bolster the scale reliability. Joking aside, it reminded us that we had collected similar data that could be used to add more information to the discussion.
But, before we do the big reveal, let’s see what others think. We polled the Twitterati about their perspective on the debate and here are the consensus opinions which correspond nicely to the Michaels’ position:
Most folks thought moving to a 2-point rating scale would decrease reliability.
Most folks thought it would not make a difference when examining gender differences on the Big Five, but clearly there was less consensus on this question.
And, most folks thought moving to a 2-point rating scale would decrease the validity of the scales.
Before I could dig into my data vault, M. Brent Donnellan (MBD) popped up on the twitter thread and forwarded amazingly ideal data for putting the scale option question to the test. He’d collected a lot of data varying the number of scale options from 7 points all the way down to 2 points using theBFI2. He also asked a few questions that could be used as interesting criterion-related validity tests including gender, self-esteem, life satisfaction and age. The sample consisted of folks from a Qualtrics panel with approximately 215 people per group.
So, does moving to a dichotomous rating scale affect internal consistency?
Here are the average internal consistencies (i.e., coefficient alphas) for 2-point (Agree/Disagree), 3-point, 5-point, and 7-point scales:
Just so you know, here are the plots for the same analysis from a forthcoming paper by Len Simms and company (Simms, Zelazny, Williams, & Bernstein, in press):
This one is oriented differently and has more response options, but pretty much tells the same story. Agreeableness and Openness have the lowest reliabilities when using the 2-point option, but the remaining BFI domain scales are just fine–as in well above recommended thresholds for acceptable internal consistency that are typically found in textbooks.
What’s going on here?
BWR: Well, agreeableness is one of the most skewed domains–everyone thinks they are nice (News flash: you’re not). It could be that finer grained response options allow people to respond in less extreme ways. Or, the Likert scales are “fixing” a problematic domain. Openness is classically the most heterogeneous domain that typically does not hold together as well as the other Big Five. So, once again, the Likert scaling might be putting lipstick on a pig.
MK: Seeing this mostly through the lens of a scale user rather than a scale developer, I would not be worried if my reliability coefficients dipped to .70. When running descriptive stats on my data I wouldn’t even give that scale a second thought.
Also I think we can refer to BWR as “Angry Brent” from this point forward?
BWR: I prefer mildly exasperated Brent (MEB). And what are we to do with the Mikes? Refer to one of you as “Nice Mike” and the other as “Nicer Mike”? Which one of you is nicer? It’s hard to tell from my angry vantage point.
MBD: I agree with BWR. I also think the alphas reported with 2-point options are still more or less acceptable for research purposes. The often cited rules of thumb about alpha get close to urban legends (Lance, Butts & Michaels 2006). Clark and Watson (1995) have a nice line in a paper (or at least I remember it fondly) about how the goal of scale construction is to maximize validity, not internal consistency. I also suspect that fewer scale points might prove useful when conducting research with non-college student samples (e.g. younger, less educated). And I like the simplicity of the 2-PL IRT model so the 2-point options hold some appeal. (The ideal point folks can spare me the hate mail). This might be controversial but I think it would be better (although probably not dramatically so) to use fewer response options and use the saved survey space/ink to increase the number of items even by just a few. Content validity will increase and the alpha coefficient will increase assuming that the additional items don’t reduce the average inter-item correlation.
BWR: BTW, we have indirect evidence for this thought–we ran an online experiment where people were randomly assigned to conditions to rate items using a 2-point scale vs a 5-point scale. We lost about 300 people (out of 5000) in the 5-point condition due to people quitting before the end of the survey–they got tuckered out sooner when forced to think a bit more about the ratings.
MF: Since MK hasn’t chosen “nice Mike,” I’ll claim that label. I also agree that BWR lays out some good options for why the Likerts are performing somewhat better. But I think we might able to narrow things down more. In the initial post, I cited the conventional cognitive-psych wisdom that more options = more information. But the actual information gain depends on the way the options interact with the particular distribution of responses in the population. In IRT terms, harder questions are more informative if everyone in your sample has high ability, but that’s not true if ability varies more. I think the same thing is going on here for these scales – when attitudes vary more, the Likerts perform better (are more reliable, because they yield more information).
In the dataset above, I think that Agreeableness is likely to have very bunched up responses up at the top of the scale. Moving to the two-point scale then loses a bunch of information because everyone is choosing the same response. This is the same as putting a bunch of questions that are too easy on your test.
I went back and looked at the dataset that I was tweeting about, and found that exactly the same thing was happening. Our questions were about parenting attitudes, and they are all basically “gimmes” – everyone agrees with nearly all of them. (E.g., “It’s important for parents to provide a safe and loving environment for their child.”) The question is how they weight these. Our 7-point scale version pulls out some useful signal from these weightings (preprint here, whole-scale alpha was .90, subscales in the low .8s). But when we moved to a two-point scale, reliability plummeted to .20! The problem was that literally everyone agreed with everything.
I think our case is a very extreme example of a general pattern: when attitudes are very variant in a population, a 2-point scale is fine. When they are very homogeneous, you need more scale points.
What about validity?
Our first validity test is convergent validity–how well does the BFI2 correlate with the Mini-IPIP set of B5 scales?
BWR: From my vantage point we once again see the conspicuous nature of agreeableness. Something about this domain does not work as well with the dichotomous rating. On the other hand, the remaining domains look like there is little or no issue with moving from a 7-point to a 2 point scale
MK: If all of you were speculating about why agreeableness doesn’t work as a two-point scale, I’d be interested in your thoughts. What dimensions of a scale might lead to this kind of reduced convergent validity? I can see how people would be unwilling to answer FALSE to statements like “I see myself as caring, compassionate” because, wow, harsh. Another domain might be social dominance orientation because most people have largely egalitarian views about themselves (possible willful ignorance), and so saying TRUE to something like “some groups of people are inherently inferior to other groups.” might be a big ask for the normal range of respondents.
BWR: I would assume that in highly evaluative domains you might run into distributional troubles with dichotomously rated items. With really skewed distributions you would get attenuated correlations among the items and lower reliability. On the other hand, you really want to know who those people are who say “no” to “I’m kind”.
MBD: I agree with BWR’s opening points. When I first read your original blog post, I was skeptical. But then I dug around and found a recent MMPI paper (Finn, Ben Porath, & Tellegen, 2015) that was consistent with BWR’s points. I was more convinced but I still like seeing things for myself. Thus, I conducted a subject pool study when I was at TAMU and pre-registered my predictions. Sure enough, the convergent validity coefficients were not dramatically better for a 5-point response option versus T/F for the BFI2 items. I then collect additional data to push that idea but this is a consistent pattern I have seen with the BFI2 – more options aren’t dramatically better when it comes to response options. I have no clue if this extends beyond the MMPI/BFI/BFI-2 items or not. But my money is on these patterns generalizing.
As for Agreeableness, there is an interesting pattern that supports the idea that the items get more difficult to endorse/reject (depending on their polarity) when you constrain the response options to 2. If we convert all of the observed scores to the Percentage of Maximum Possible scores (see Cohen, Cohen, Aiken, & West, 1999), one could loosely compare across the formats. The average score for A in the 2-Point version was 82.78 (SD = 17.40) and it drops to 70.86 (SD = 14.26) in the 7 point condition. So this might be a case where giving more response options allows people to admit to less desirable characteristics (The results for the other composites were less dramatic). So, I think MK has a good point above that might qualify some of my enthusiasm for the 2-pt format for some kinds of content.
MF: OK, so this discussion above totally lines up with my theory that agreeableness is less variable, especially the idea that range on some of these variables might be restricted due to social desirability. MBD, BWR, is this something that’s generally true that agreeableness has low variance? (A histogram of responses for each variable in the 7 point case would be useful to see this by eye).
More generally, just to restate the theory: 2-point is good when there is a lot of variance in the population. But when variance is compressed – whether due to social desirability or true homogeneity – more scale points are increasingly important.
BWR: I don’t see any evidence for variance issues, but I am aware of people reporting skewness problems with agreeableness. Most of us believe we are nice. But, there are a few folks who are more than willing to admit to being not nice–thus, variances look good, but skewness may be the real culprit.
How about gender differences?
BWR: I see one thing in this table: sampling error. There is no rhyme nor reason to the way these numbers bounce around to my read, but I’m willing to be convinced.
MBD: I should give credit to Les Morey (creator of the PAI) for suggesting this exploratory question. I am still puzzled why the effect sizes bounce around (and have seen this in another dataset). I think a deeper dive testing invariance would prove interesting. But who has the time?
At the very least, there does not seem to be a simple story here. And it shows that we need a bigger N to get those CIs narrower. The size of those intervals make me kind of ill.
MF: I love that you guys are upset about CIs this wide. Have you ever read an experimental developmental psychology study? On another note, I do think it’s interesting that you’re seeing overall larger effects for the larger numbers of scale points. If you look at the mean effect, it’s .20 for the 7-pt, and .10 for the 2-pt, 15.5 for the 3-pt, and .2 for the 5-pt. So sure, lots of sampling error, but still some kind of consistency…
MK: Despite all the bouncing around, there doesn’t seem to be anything unusual about the two-option scale confidence intervals.
And now the validity coefficients for self-esteem (I took the liberty of reversing the N scores to ES scores so everything was positive).
BWR: On this one the True-False scales actually do better than the Likert scales in some cases. No strong message here.
MK: This is shocking to me! Wow! One question though — could the two-point scale items just be reflecting this overall positivity bias and not the underlying trait construct. That is, if the two point scales were just measures of self-esteem would this look just like it does here? I guess I’m hoping for some discriminant validity… or maybe I’d just like to see how intercorrelated the true-false version is across the five factors and compare that correlation to the longer Likerts.
BWR: Excellent point MK. To address the overall positivity bias inherent in a bunch of evaluative scales, we correlated the different B5 scales with age down below. Check it out.
MK: That is so… nice of you! Thanks!
BWR: I wish you would stop being so nice.
MF: I agree that it’s a bit surprising to me that we see the flip, but, going with my theory above, I predict that extraversion is the scale with the most variance in the larger likert ratings. That’s why the 2-pt is performing so well – people really do vary in this characteristic dramatically AND there’s less social desirability coming out in the ratings, so 2-point is actually useful.
And finally the coefficients for life satisfaction
MK: I’m a believer now, thanks Brent and Angry Brent!
MBD: Wait, which Brent is Angry! 😉
MF: Ok, so if I squint I can still say some stuff about variance etc. But overall it is true that the validity for the 2-point scale is surprisingly reasonable, especially for these lower-correlation measures. In particular, maybe the only things that really matter for life-satisfaction correlations are the big differences; so you accentuate these characteristics in the 2-pt and get rid of minor variance due to other sources.
How about age?
As was noted above, self-esteem and life satisfaction are rather evaluative, as are the Big Five and that might create too much convergent validity and not enough discriminant validity. What about a non-evaluative outcome like age? Each of the samples was on average in their 50s with age ranges from young adulthood through old age. So, while the sample sizes were a little small for stable estimates (we like 250 minimum), age is not a bad outcome to correlate to because it is clearly not biased from social desirability. Unless, of course, we lie systematically about our age….
If you are keen on interpreting these coefficients, the confidence intervals for samples of this size are about + or – .13. Happy inferencing.
BWR: I find these results really interesting. Despite the apparent issues with the true-false version of agreeableness, it actually has the largest correlation with age–actually higher than most prior reports, which admittedly are based on 5-point rating scale measures of the Big Five. I’m tempted to interpret the 3-Point scales as problematic, but I’m going to go with sampling error again. It was probably just a funky sample.
MK: OK then. I agree, I think the 3-point option is being the strangest for agreeableness.
MBD: I have a second replication sample where I used 2,3,4,5,6, and 7 response formats. The cell sizes are a bit smaller but I will look at those correlations in that one as well.
MBD: This was super fun and appreciate that you three let me join the discussion. I admit that when I originally read the first exchange, I thought something was off about BWR’s thinking [BWR–you are not alone in that thought]. I was in a state of cognitive dissonance as it went against a ”5 to 7 scale points are better than alternatives” heuristic. Reading the MMPI paper was the next step toward disabusing myself of my bias. Now after collecting these data, hearing a talk by Len Simms about his paper, and so forth, I am not as opposed to using fewer scale points than I was in the past. This is especially true if it allows one to collect additional items. That said, I think more work about content by scale point interactions is needed for the reasons brought up in this post. However, I am a lot more positive to 2-point scales than I was in the past. Thanks!
MF: Agreed – this was an impressive demonstration of Angry Brent’s ideas. Even though 7-pt sometimes is still performing better, overall the lack of problems with 2-pt is really food for thought. Even I have to admit that sometimes the 2-pt can be simpler and easier. On the other hand, I will still point to our parenting questionnaire – which is much more tentative and early stage in terms of the constructs it measures than the B5! In that case, it essentially destroyed the instrument to use a 2-pt scale because there was so much consensus (or social desirability)! So while I agree with the theoretical point from the previous post – consider 2-pt scales! – I also want to sound a cautious note here because not every domain is as well-understood.
MK: Agree on the caution that MF alludes but wow, the 2-point scale performed far better than I anticipated. Thanks for doing this all!
BWR: I love data. It never conforms perfectly to your expectations. And, as usual, it raises as many questions as it answers. For me, the overriding question that emerges from these data is whether 2-point scales are problematic with less coherent and skewed domains or whether 2-point scales are excellent indicators that you have a potentially problematic set of items that you are papering over by using a 5-point scale? It may be that the 2-point scale approach is like the canary in the measurement coal mine–it will alert us to problems with our measures that need tending to.
These data also teach the lesson Clark and Watson (1995) provide that validity should be paramount. My sense is that those of us in the psychometric trenches can get rather opinionated about measurement issues, (Use omega rather than Cronbach’s alpha; use IRT rather than classical test theory, etc.) that translate into nothing of significance when you condition your thinking on validity. Our reality may be that when we ask questions, people are capable of telling us a crude “yeah, that’s like me” or “no, not really like me” and that’s about the best we can do regardless of how fine grained our apparent measurement scales are.
MBD: Here’s a relevant quote from Dan Ozer: “It seems that it is relatively easy to develop a measure of personality of middling quality (Ashton & Goldberg, 1973), and then it is terribly difficult to improve it.” (p. 685).
Thanks MK, MF, and MBD for the nerdfest. As usual, it was fun.
P.S. George Richardson pointed out that we did not compare even numbered response options (e.g., 4-point) vs odd numbered response options (e.g., 5-point) and therefore do not confront the timeless debate of “should I include a middle option.” First, Len Simms paper does exactly that–it is a great paper and shows that it makes very little difference. Second, we did a deep dive into that issue for a project funded by the OECD. Like the story above, it made no difference for Big Five reliability or validity if you used 4 or 5 point scales. If you used an IRT model (ggum) in some cases you got a little more information out of the middle option that was of value (e.g., neuroticism). It never did psychometric damage to have a middle option as many fear. So, you may want to lay to rest the argument that everyone will bunch to the middle when you include a middle option.
There have been a slew of systematic replication efforts and meta-analyses with rather provocative findings of late. The ego depletion saga is one of those stories. It is an important story because it demonstrates the clarity that comes with focusing on effect sizes rather than statistical significance.
I should confess that I’ve always liked the idea of ego depletion and even tried my hand at running a few ego depletion experiments.* And, I study conscientiousness which is pretty much the same thing as self-control—at least as it is assessed using the Tangney et al self-control scale (2004) which was meant, in part, to be an individual difference complement to the ego depletion experimental paradigms.
So, I was more than a disinterested observer as the “effect size drama” surrounding ego depletion played out over the last few years. First, you had the seemingly straightforward meta analysis by Hagger et al (2010), showing that the average effect size of the sequential task paradigm of ego-depletion studies was a d of .62. Impressively large by most metrics that we use to judge effect sizes. That’s the same as a correlation of .3 according to the magical effect size converters. Despite prior mischaracterizations of correlations of that magnitude being small**, that’s nothing to cough at.
Quickly on the heels of that meta-analysis were new meta-analyses and re-analyses of the meta-analytic data (e.g., Carter et al, 2015). These new meta-analyses and re-analyses concluded that there wasn’t any “there” there. Right after the Hagger et al paper was published, the quant jocks came up with a slew of new ways of estimating bias in meta-analyses. What happens when you apply these bias estimators to ego depletion data? There seemed to be a lot of bias in the research synthesized in these meta-analyses. So much so that the bias-corrected estimates included a zero effect size as a possibility (Carter et al., 2015). These re-analyses were then re-analyzed because the field of bias correction was moving faster than basic science and these initial corrections were called into question because apparently bias corrections are, well, biased… (Friese et al., 2018).
Not to be undone by an inability to estimate truth from the prior publication record, another, overlapping group of researchers conducted their own registered replication report—the most defensible and unbiased method of estimating an effect size (Hagger et al., 2016). Much to everyone’s surprise, the effect across 23 labs was something close to zero (d = .04). Once again, this effort was criticized for being a non-optimal test of the ego depletion effect (Friese et al., 2018).
To address the prior limitations of all of these incredibly thorough analyses of ego depletion, yet a third team took it upon themselves to run a pre-registered replication project testing two additional approaches ego-depletion using optimal designs (Vohs, Schmeichel & others, 2018). Like a broken record, the estimate across 40 labs resulted in effect size estimates that ranged from 0 (if you assumed zero was the prior) to about a d of .08 if you assumed otherwise***. If you bothered to compile the data across the labs and run a traditional frequentist analysis, this effect size, despite being minuscule was statistically significant (trumpets sound in the distance).
So, it appears the best estimate of the effect of ego depletion is around a d of .08, if we are being generous.
Eyes wide shut
So, there were a fair number of folks who expressed some curiosity about the meaning of the results. They asked questions on social media, like, “The effect was statistically significant, right? That means there’s evidence for ego depletion.”
Setting aside effect sizes for a moment, there are many reasons to see the data as being consistent with the theory. Many of us were rooting for ego depletion theory. Countless researchers were invested in the idea either directly or indirectly. Many wanted a pillar of their theoretical and empirical foundational knowledge to hold up, even if the aggregate effect was more modest than originally depicted. For those individuals, a statistically significant finding seems like good news, even if it is really cold comfort.
Another reason for the prioritization of significant findings over the magnitude of the effect is, well, ignorance of effect sizes and their meaning. It was not too long ago that we tried in vain to convince colleagues that a Neyman-Pearson system was useful (balance power, alpha, effect size, and N). A number of my esteemed colleagues pushed back on the notion that they should pay heed to effect sizes. They argued that, as experimental theoreticians, their work was, at best, testing directional hypotheses of no practical import. Since effect sizes were for “applied” psychologists (read: lower status), the theoretical experimentalist had no need to sully themselves with the tools of applied researchers. They also argued that their work was “proof of concept” and the designs were not intended to reflect real world settings (see ego depletion) and therefore the effect sizes were uninterpretable. Setting aside the unnerving circularity of this thinking****, what it implies is that many people have not been trained on, or forced to think much about, effect sizes. Yes, they’ve often been forced to report them, but not to really think about them. I’ll go out on a limb and propose that the majority of our peers in the social sciences think about and make inferences based solely on p-values and some implicit attributes of the study design (e.g., experiment vs observational study).
The reality, of course, is that every study of every stripe comes with an effect size, whether or not it is explicitly presented or interpreted. More importantly, a body of research in which the same study or paradigm is systematically investigated, like has been done with ego depletion, provides an excellent estimate of the true effect size for that paradigm. The reality of a true effect size in the range of d = .04 to d = .08 is a harsh reality, but one that brings great clarity.
Eyes wide open
So, let’s make an assumption. The evidence is pretty good that the effect size of sequential ego depletion tasks is, at best, d = .08.
With that assumption, the inevitable conclusion is that the traditional study of ego depletion using experimental approaches is dead in the water.
First, because studying a phenomenon with a true effect size of d = .08 is beyond the resources of almost all labs in psychology. To have 80% power to detect an effect size of d = .08 you would need to run more than 2500 participants through your lab. If you go with the d = .04 estimate, you’d need more than 9000 participants. More poignantly, none of the original studies used to support the existence of ego depletion were designed to detect the true effect size.
These types of sample size demands violate most of our norms in psychological science. The average sample size in prior experimental ego depletion research appears to be about 50 to 60. With that kind of sample size, you have 6% power to detect the true effect.
What about our new rules of thumb, like do your best to reach an N of 50 per cell, or use 2.5 the N of the original study, or crank the N up above 500 to test an interaction effect? Power is 8%, 11%, and 25% in each of those situations, respectively. If you ran your studies using these rules of thumb, you would be all thumbs.
But, you say, I can get 2500 participants on mTurk. That’s not a bad option. But, you have to ask yourself: To what end? The import of ego depletion research and much experimental work like it, is predicated on the notion that the situation is “powerful,” as in, it has a large effect. How important is ego depletion to our understanding of human nature if the effect is minuscule? Before you embark on the mega study of thousands of mTurkers, it might be prudent to answer this question.
But, you say, some have argued that small effects can cumulate and therefore be meaningful if studied with enough fidelity and across time. Great. Now all you need to do is run a massive longitudinal intervention study where you test how the minuscule effect of the manipulation cumulates over time and place. The power issue doesn’t disappear with this potential insight. You still have to deal with the true effect size of the manipulation being a d of .08. So, one option is to use a massive study. Good luck funding that study. The only way you could get the money necessary to conduct it would be to promise doing an fMRI of every participant. Wait. Oh, never mind.
The other option would be to do something radical like create a continuous intervention that builds on itself over time—something currently not part of ego depletion theory or traditional experimental approaches in psychology.
But, you say, there are hundreds of studies that have been published on ego depletion. Exactly. Hundreds of studies have been published that had average d-value of .62. Hundreds of studies have been published showing effect sizes that cannot, by definition, be true given the true effect size is d = .08. That is the clarity that comes with the use of accurate effect sizes. It is incredibly difficult to get d-values of .62 when the true d is .08. Look at the distribution of d-values around zero with sample sizes of 50. The likelihood of landing a d of .62 or higher is about 3%. This fact invites some uncomfortable questions. How did all of these people find this many large effects? If we assume they found these relatively huge, highly unlikely effects by chance alone, this would mean that there are thousands of studies lying about in file drawers somewhere. Or it means people used other means to dig these effects out of the data….
Setting aside the motivations, strategies, and incentives that would net this many findings that are significantly unlikely to be correct (p < .03), the import of this discrepancy is huge. The fact that hundreds of studies with such unlikely results were published using the standard paradigms should be troubling to the scientific community. It shows that psychologists, as a group using the standard incentive systems and review processes of the day, can produce grossly inflated findings that lend themselves to the appearance of an accumulated body of evidence for an idea when, by definition, it shouldn’t exist. That should be more than troubling. It should be a wakeup call. Our system is more than broken. It is spewing pollution into the scientific environment at an alarming rate.
This is why effect sizes are important. Knowing that the true effect size of sequential ego depletion studies is a d of .08 leads you to conclude that:
1. Most prior research on the sequential task approach to ego depletion is so problematic that it cannot and should not be used to inform future research. Are you interested in those moderators or boundary mechanisms of ego depletion? Great, you are now proposing to see whether your new condition moves a d of .08 to something smaller. Good luck with that.
2. New research on ego depletion is out of reach for most psychological scientists unless they participate in huge multi-lab projects like the Psychological Science Accelerator.
3. Our field is capable of producing huge numbers of published reports in support of an idea that are grossly inaccurate.
4. If someone fails to replicate one of my studies, I can no longer point to dozens, if not hundreds of supporting studies and confidently state that there is a lot of backing for my work.
And don’t take this situation as anything particular to ego depletion. We now have reams of studies that either through registered replication reports or meta-analyses have shown that the original effect sizes are inflated and that the “truer” effect sizes are much smaller. In numerous cases, ranging from GxE studies to ovulatory cycle effects, the meta-analytic estimates, while statistically significant, are conspicuously smaller than most if not all of the original studies were capable of detecting. These updated effect sizes need to be weighed heavily in research going forward.
In closing, let me point out that I say these things with no prejudice against the idea of ego depletion. I still like the idea and still hold out a sliver of hope that the idea may be viable. It is possible that the idea is sound and the way prior research was executed is the problem.
But, extrapolating from the cumulative meta-analytic work and the registered replication projects, I can’t avoid the conclusion that the effect size for the standard sequential paradigms is small. Really, really small. So small that it would be almost impossible to realistically study the idea in almost any traditional lab.
Maybe the fact that these paradigms no longer work will spur some creative individuals on to come up with newer, more viable, and more reliable ways of testing the idea. Until then, the implication of the effect size is clear: Steer clear of the classic experimental approaches to ego depletion. And, if you nonetheless continue to find value in the basic idea, come up with new ways to study it; the old ways are not robust.
Brent W. Roberts
* p < .05: They failed. At the time, I chalked it up to my lack of expertise. And that was before it was popular to argue that people who failed to replicate a study lacked expertise.
** p < .01: See “personality coefficient” Mischel, W. (2013). Personality and assessment. Psychology Press.
*** p < .005: that’s a correlation of .04, but who’s comparing effect sizes??
**** p < .001: “I’m special, so I can ignore effect sizes—look, small effect sizes—I can ignore these because I’m a theoretician. I’m still special”
At the end of my previous blog “Because, change is hard“, I said, and I quote: “So, send me your huddled, tired essays repeating the same messages about improving our approach to science that we’ve been making for years and I’ll post, repost, and blog about them every time.”
Well, someone asked me to repost their’s. So here is it is: http://www.nature.com/news/no-researcher-is-too-junior-to-fix-science-1.21928. It is a nice piece by John Tregoning.
Speaking of which, there were two related blogs posted right after the change is hard piece that are both worth reading. The first by Dorothy Bishop is brilliant and counters my pessimism so effectively I’m almost tempted to call her Simine Vazire: http://deevybee.blogspot.co.uk/2017/05/reproducible-practices-are-future-for.html
And if you missed it James Heathers has a spot on post about the New Bad People: https://medium.com/@jamesheathers/meet-the-new-bad-people-4922137949a1