Yes or No 2.0: Are Likert scales always preferable to dichotomous rating scales?

A while back, Michael Kraus (MK), Michael Frank (MF) and me (Brent W Roberts, or BWR; M. Brent Donnellan–MBD–is on board for this discussion so we’ll have to keep our Michaels and Brents straight) got into a Twitter inspired conversation about the niceties of using polytomous rating scales vs yes/no rating scales for items.  You can read that exchange here.

The exchange was loads of fun and edifying for all parties.  An over-simplistic summary would be that, despite passionate statements made by psychometricians, there is no Yes or No answer to the apparent superiority of Likert-type scales for survey items.

We recently were reminded of our prior effort when a similar exchange on Twitter pretty much replicated our earlier conversation–I’m not sure whether it was a conceptual or direct replication….

In part of the exchange, Michael Frank (MF) mentioned that he had tried the 2-point option with items they commonly use and found the scale statistics to be so bad that they gave up on the effort and went back to a 5-point option. To which, I replied, pithily, that he was using the Likert scale and the systematic errors contained therein to bolster the scale reliability.  Joking aside, it reminded us that we had collected similar data that could be used to add more information to the discussion.

But, before we do the big reveal, let’s see what others think.  We polled the Twitterati about their perspective on the debate and here are the consensus opinions which correspond nicely to the Michaels’ position:

Most folks thought moving to a 2-point rating scale would decrease reliability.

 

Most folks thought it would not make a difference when examining gender differences on the Big Five, but clearly there was less consensus on this question.

 

And, most folks thought moving to a 2-point rating scale would decrease the validity of the scales.

Before I could dig into my data vault, M. Brent Donnellan (MBD) popped up on the twitter thread and forwarded amazingly ideal data for putting the scale option question to the test. He’d collected a lot of data varying the number of scale options from 7 points all the way down to 2 points using the BFI2.  He also asked a few questions that could be used as interesting criterion-related validity tests including gender, self-esteem, life satisfaction and age. The sample consisted of folks from a Qualtrics panel with approximately 215 people per group.

So, does moving to a dichotomous rating scale affect internal consistency?

Here are the average internal consistencies (i.e., coefficient alphas) for 2-point (Agree/Disagree), 3-point, 5-point, and 7-point scales: 

 

Just so you know, here are the plots for the same analysis from a forthcoming paper by Len Simms and company (Simms, Zelazny, Williams, & Bernstein, in press):

This one is oriented differently and has more response options, but pretty much tells the same story.  Agreeableness and Openness have the lowest reliabilities when using the 2-point option, but the remaining BFI domain scales are just fine–as in well above recommended thresholds for acceptable internal consistency that are typically found in textbooks.  

 

What’s going on here?

BWR: Well, agreeableness is one of the most skewed domains–everyone thinks they are nice (News flash: you’re not). It could be that finer grained response options allow people to respond in less extreme ways. Or, the Likert scales are “fixing” a problematic domain. Openness is classically the most heterogeneous domain that typically does not hold together as well as the other Big Five.  So, once again, the Likert scaling might be putting lipstick on a pig.

MK: Seeing this mostly through the lens of a scale user rather than a scale developer, I would not be worried if my reliability coefficients dipped to .70. When running descriptive stats on my data I wouldn’t even give that scale a second thought.  

Also I think we can refer to BWR as “Angry Brent” from this point forward?

BWR: I prefer mildly exasperated Brent (MEB).  And what are we to do with the Mikes? Refer to one of you as “Nice Mike” and the other as “Nicer Mike”?  Which one of you is nicer? It’s hard to tell from my angry vantage point.

MBD: I agree with BWR. I also think the alphas reported with 2-point options are still more or less acceptable for research purposes. The often cited rules of thumb about alpha get close to urban legends (Lance, Butts & Michaels 2006). Clark and Watson (1995) have a nice line in a paper (or at least I remember it fondly) about how the goal of scale construction is to maximize validity, not internal consistency. I also suspect that fewer scale points might prove useful when conducting research with non-college student samples (e.g. younger, less educated). And I like the simplicity of the 2-PL IRT model so the 2-point options hold some appeal. (The ideal point folks can spare me the hate mail). This might be controversial but I think it would be better (although probably not dramatically so) to use fewer response options and use the saved survey space/ink to increase the number of items even by just a few. Content validity will increase and the alpha coefficient will increase assuming that the additional items don’t reduce the average inter-item correlation.

BWR: BTW, we have indirect evidence for this thought–we ran an online experiment where people were randomly assigned to conditions to rate items using a 2-point scale vs a 5-point scale.  We lost about 300 people (out of 5000) in the 5-point condition due to people quitting before the end of the survey–they got tuckered out sooner when forced to think a bit more about the ratings.

MF: Since MK hasn’t chosen “nice Mike,” I’ll claim that label. I also agree that BWR lays out some good options for why the Likerts are performing somewhat better. But I think we might able to narrow things down more. In the initial post, I cited the conventional cognitive-psych wisdom that more options = more information. But the actual information gain depends on the way the options interact with the particular distribution of responses in the population. In IRT terms, harder questions are more informative if everyone in your sample has high ability, but that’s not true if ability varies more. I think the same thing is going on here for these scales – when attitudes vary more, the Likerts perform better (are more reliable, because they yield more information).

In the dataset above, I think that Agreeableness is likely to have very bunched up responses up at the top of the scale. Moving to the two-point scale then loses a bunch of information because everyone is choosing the same response. This is the same as putting a bunch of questions that are too easy on your test.

I went back and looked at the dataset that I was tweeting about, and found that exactly the same thing was happening. Our questions were about parenting attitudes, and they are all basically “gimmes” – everyone agrees with nearly all of them. (E.g., “It’s important for parents to provide a safe and loving environment for their child.”) The question is how they weight these. Our 7-point scale version pulls out some useful signal from these weightings (preprint here, whole-scale alpha was .90, subscales in the low .8s). But when we moved to a two-point scale, reliability plummeted to .20! The problem was that literally everyone agreed with everything.

I think our case is a very extreme example of a general pattern: when attitudes are very variant in a population, a 2-point scale is fine. When they are very homogeneous, you need more scale points.

 

What about validity?

Our first validity test is convergent validity–how well does the BFI2 correlate with the Mini-IPIP set of B5 scales?

BWR: From my vantage point we once again see the conspicuous nature of agreeableness.  Something about this domain does not work as well with the dichotomous rating. On the other hand, the remaining domains look like there is little or no issue with moving from a 7-point to a 2 point scale

MK: If all of you were speculating about why agreeableness doesn’t work as a two-point scale, I’d be interested in your thoughts. What dimensions of a scale might lead to this kind of reduced convergent validity? I can see how people would be unwilling to answer FALSE to statements like “I see myself as caring, compassionate” because, wow, harsh. Another domain might be social dominance orientation because most people have largely egalitarian views about themselves (possible willful ignorance), and so saying TRUE to something like “some groups of people are inherently inferior to other groups.” might be a big ask for the normal range of respondents.

BWR: I would assume that in highly evaluative domains you might run into distributional troubles with dichotomously rated items. With really skewed distributions you would get attenuated correlations among the items and lower reliability. On the other hand, you really want to know who those people are who say “no” to “I’m kind”.

MBD: I agree with BWR’s opening points. When I first read your original blog post, I was skeptical.  But then I dug around and found a recent MMPI paper (Finn, Ben Porath, & Tellegen, 2015) that was consistent with BWR’s points. I was more convinced but I still like seeing things for myself. Thus, I conducted a subject pool study when I was at TAMU and pre-registered my predictions. Sure enough, the convergent validity coefficients were not dramatically better for a 5-point response option versus T/F for the BFI2 items. I then collect additional data to push that idea but this is a consistent pattern I have seen with the BFI2 – more options aren’t dramatically better when it comes to response options. I have no clue if this extends beyond the MMPI/BFI/BFI-2 items or not. But my money is on these patterns generalizing.

As for Agreeableness, there is an interesting pattern that supports the idea that the items get more difficult to endorse/reject (depending on their polarity) when you constrain the response options to 2. If we convert all of the observed scores to the Percentage of Maximum Possible scores (see Cohen, Cohen, Aiken, & West, 1999), one could loosely compare across the formats. The average score for A in the 2-Point version was 82.78 (SD = 17.40) and it drops to 70.86 (SD = 14.26) in the 7 point condition. So this might be a case where giving more response options allows people to admit to less desirable characteristics (The results for the other composites were less dramatic). So, I think MK has a good point above that might qualify some of my enthusiasm for the 2-pt format for some kinds of content.  

MF: OK, so this discussion above totally lines up with my theory that agreeableness is less variable, especially the idea that range on some of these variables might be restricted due to social desirability. MBD, BWR, is this something that’s generally true that agreeableness has low variance? (A histogram of responses for each variable in the 7 point case would be useful to see this by eye).

More generally, just to restate the theory: 2-point is good when there is a lot of variance in the population. But when variance is compressed – whether due to social desirability or true homogeneity – more scale points are increasingly important.

BWR: I don’t see any evidence for variance issues, but I am aware of people reporting skewness problems with agreeableness.  Most of us believe we are nice. But, there are a few folks who are more than willing to admit to being not nice–thus, variances look good, but skewness may be the real culprit.

 

How about gender differences?

 

 

BWR: I see one thing in this table: sampling error.  There is no rhyme nor reason to the way these numbers bounce around to my read, but I’m willing to be convinced.

MBD: I should give credit to Les Morey (creator of the PAI) for suggesting this exploratory question. I am still puzzled why the effect sizes bounce around (and have seen this in another dataset). I think a deeper dive testing invariance would prove interesting. But who has the time?

At the very least, there does not seem to be a simple story here.  And it shows that we need a bigger N to get those CIs narrower. The size of those intervals make me kind of ill.

MF: I love that you guys are upset about CIs this wide. Have you ever read an experimental developmental psychology study? On another note, I do think it’s interesting that you’re seeing overall larger effects for the larger numbers of scale points. If you look at the mean effect, it’s .20 for the 7-pt, and .10 for the 2-pt, 15.5 for the 3-pt, and .2 for the 5-pt. So sure, lots of sampling error, but still some kind of consistency…

MK: Despite all the bouncing around, there doesn’t seem to be anything unusual about the two-option scale confidence intervals.

And now the validity coefficients for self-esteem (I took the liberty of reversing the N scores to ES scores so everything was positive).

BWR: On this one the True-False scales actually do better than the Likert scales in some cases.  No strong message here.

MK: This is shocking to me! Wow! One question though — could the two-point scale items just be reflecting this overall positivity bias and not the underlying trait construct. That is, if the two point scales were just measures of self-esteem would this look just like it does here? I guess I’m hoping for some discriminant validity… or maybe I’d just like to see how intercorrelated the true-false version is across the five factors and compare that correlation to the longer Likerts.

BWR: Excellent point MK. To address the overall positivity bias inherent in a bunch of evaluative scales, we correlated the different B5 scales with age down below. Check it out.

MK: That is so… nice of you! Thanks!

BWR: I wish you would stop being so nice.

MF: I agree that it’s a bit surprising to me that we see the flip, but, going with my theory above, I predict that extraversion is the scale with the most variance in the larger likert ratings. That’s why the 2-pt is performing so well – people really do vary in this characteristic dramatically AND there’s less social desirability coming out in the ratings, so 2-point is actually useful.

 

And finally the coefficients for life satisfaction

MK: I’m a believer now, thanks Brent and Angry Brent!

MBD: Wait, which Brent is Angry! 😉

MF: Ok, so if I squint I can still say some stuff about variance etc. But overall it is true that the validity for the 2-point scale is surprisingly reasonable, especially for these lower-correlation measures. In particular, maybe the only things that really matter for life-satisfaction correlations are the big differences; so you accentuate these characteristics in the 2-pt and get rid of minor variance due to other sources.

 

How about age?

As was noted above, self-esteem and life satisfaction are rather evaluative, as are the Big Five and that might create too much convergent validity and not enough discriminant validity. What about a non-evaluative outcome like age?  Each of the samples was on average in their 50s with age ranges from young adulthood through old age. So, while the sample sizes were a little small for stable estimates (we like 250 minimum), age is not a bad outcome to correlate to because it is clearly not biased from social desirability.  Unless, of course, we lie systematically about our age….

If you are keen on interpreting these coefficients, the confidence intervals for samples of this size are about + or – .13. Happy inferencing.

BWR: I find these results really interesting. Despite the apparent issues with the true-false version of agreeableness, it actually has the largest correlation with age–actually higher than most prior reports, which admittedly are based on 5-point rating scale measures of the Big Five. I’m tempted to interpret the 3-Point scales as problematic, but I’m going to go with sampling error again. It was probably just a funky sample.

MK: OK then. I agree, I think the 3-point option is being the strangest for agreeableness.

MBD: I have a second replication sample where I used 2,3,4,5,6, and 7 response formats.  The cell sizes are a bit smaller but I will look at those correlations in that one as well.

 

General Thoughts?

MBD: This was super fun and appreciate that you three let me join the discussion. I admit that when I originally read the first exchange, I thought something was off about BWR’s thinking [BWR–you are not alone in that thought]. I was in a state of cognitive dissonance as it went against a ”5 to 7 scale points are better than alternatives” heuristic. Reading the MMPI paper was the next step toward disabusing myself of my bias. Now after collecting these data, hearing a talk by Len Simms about his paper, and so forth, I am not as opposed to using fewer scale points than I was in the past. This is especially true if it allows one to collect additional items. That said, I think more work about content by scale point interactions is needed for the reasons brought up in this post. However, I am a lot more positive to 2-point scales than I was in the past.  Thanks!

MF: Agreed – this was an impressive demonstration of Angry Brent’s ideas. Even though 7-pt sometimes is still performing better, overall the lack of problems with 2-pt is really food for thought. Even I have to admit that sometimes the 2-pt can be simpler and easier. On the other hand, I will still point to our parenting questionnaire – which is much more tentative and early stage in terms of the constructs it measures than the B5! In that case, it essentially destroyed the instrument to use a 2-pt scale because there was so much consensus (or social desirability)! So while I agree with the theoretical point from the previous post – consider 2-pt scales! – I also want to sound a cautious note here because not every domain is as well-understood.

MK: Agree on the caution that MF alludes but wow, the 2-point scale performed far better than I anticipated. Thanks for doing this all!

BWR: I love data. It never conforms perfectly to your expectations. And, as usual, it raises as many questions as it answers. For me, the overriding question that emerges from these data is whether 2-point scales are problematic with less coherent and skewed domains or whether 2-point scales are excellent indicators that you have a potentially problematic set of items that you are papering over by using a 5-point scale?  It may be that the 2-point scale approach is like the canary in the measurement coal mine–it will alert us to problems with our measures that need tending to.

These data also teach the lesson Clark and Watson (1995) provide that validity should be paramount. My sense is that those of us in the psychometric trenches can get rather opinionated about measurement issues, (Use omega rather than Cronbach’s alpha; use IRT rather than classical test theory, etc.) that translate into nothing of significance when you condition your thinking on validity.  Our reality may be that when we ask questions, people are capable of telling us a crude “yeah, that’s like me” or “no, not really like me” and that’s about the best we can do regardless of how fine grained our apparent measurement scales are.

MBD: Here’s a relevant quote from Dan Ozer: “It seems that it is relatively easy to develop a measure of personality of middling quality (Ashton & Goldberg, 1973), and then it is terribly difficult to improve it.” (p. 685).

Thanks MK, MF, and MBD for the nerdfest.  As usual, it was fun.

 

References

 

Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological assessment, 7(3), 309.

Cohen, P., Cohen, J., Aiken, L. S., & West, S. G. (1999). The problem of units and the circumstance for POMP. Multivariate behavioral research, 34(3), 315-346.

Finn, J. A., Ben-Porath, Y. S., & Tellegen, A. (2015). Dichotomous versus polytomous response options in psychopathology assessment: Method or meaningful variance?. Psychological assessment, 27(1), 184.

Lance, C. E., Butts, M. M., & Michels, L. C. (2006). The sources of four commonly reported cutoff criteria: What did they really say?. Organizational research methods, 9(2), 202-220.

Simms, L.J., Zelazny, K., Williams, T.F., & Bernstein, L. (in press).  Does the number of response options matter? Psychometric perspectives using personality questionnaire data.  Psychological Assessment.

P.S. George Richardson pointed out that we did not compare even numbered response options (e.g., 4-point) vs odd numbered response options (e.g., 5-point) and therefore do not confront the timeless debate of “should I include a middle option.”  First, Len Simms paper does exactly that–it is a great paper and shows that it makes very little difference. Second, we did a deep dive into that issue for a project funded by the OECD. Like the story above, it made no difference for Big Five reliability or validity if you used 4 or 5 point scales. If you used an IRT model (ggum) in some cases you got a little more information out of the middle option that was of value (e.g., neuroticism). It never did psychometric damage to have a middle option as many fear. So, you may want to lay to rest the argument that everyone will bunch to the middle when you include a middle option.

Advertisements
Posted in Uncategorized | Leave a comment

Eyes wide shut or eyes wide open?

There have been a slew of systematic replication efforts and meta-analyses with rather provocative findings of late. The ego depletion saga is one of those stories. It is an important story because it demonstrates the clarity that comes with focusing on effect sizes rather than statistical significance.

I should confess that I’ve always liked the idea of ego depletion and even tried my hand at running a few ego depletion experiments.* And, I study conscientiousness which is pretty much the same thing as self-control—at least as it is assessed using the Tangney et al self-control scale (2004) which was meant, in part, to be an individual difference complement to the ego depletion experimental paradigms.

So, I was more than a disinterested observer as the “effect size drama” surrounding ego depletion played out over the last few years. First, you had the seemingly straightforward meta analysis by Hagger et al (2010), showing that the average effect size of the sequential task paradigm of ego-depletion studies was a d of .62. Impressively large by most metrics that we use to judge effect sizes. That’s the same as a correlation of .3 according to the magical effect size converters. Despite prior mischaracterizations of correlations of that magnitude being small**, that’s nothing to cough at.

Quickly on the heels of that meta-analysis were new meta-analyses and re-analyses of the meta-analytic data (e.g., Carter et al, 2015). These new meta-analyses and re-analyses concluded that there wasn’t any “there” there. Right after the Hagger et al paper was published, the quant jocks came up with a slew of new ways of estimating bias in meta-analyses. What happens when you apply these bias estimators to ego depletion data? There seemed to be a lot of bias in the research synthesized in these meta-analyses. So much so that the bias-corrected estimates included a zero effect size as a possibility (Carter et al., 2015). These re-analyses were then re-analyzed because the field of bias correction was moving faster than basic science and these initial corrections were called into question because apparently bias corrections are, well, biased… (Friese et al., 2018).

Not to be undone by an inability to estimate truth from the prior publication record, another, overlapping group of researchers conducted their own registered replication report—the most defensible and unbiased method of estimating an effect size (Hagger et al., 2016). Much to everyone’s surprise, the effect across 23 labs was something close to zero (d = .04). Once again, this effort was criticized for being a non-optimal test of the ego depletion effect (Friese et al., 2018).

To address the prior limitations of all of these incredibly thorough analyses of ego depletion, yet a third team took it upon themselves to run a pre-registered replication project testing two additional approaches ego-depletion using optimal designs (Vohs, Schmeichel & others, 2018). Like a broken record, the estimate across 40 labs resulted in effect size estimates that ranged from 0 (if you assumed zero was the prior) to about a d of .08 if you assumed otherwise***. If you bothered to compile the data across the labs and run a traditional frequentist analysis, this effect size, despite being minuscule was statistically significant (trumpets sound in the distance).

So, it appears the best estimate of the effect of ego depletion is around a d of .08, if we are being generous.

Eyes wide shut

So, there were a fair number of folks who expressed some curiosity about the meaning of the results. They asked questions on social media, like, “The effect was statistically significant, right? That means there’s evidence for ego depletion.”

Setting aside effect sizes for a moment, there are many reasons to see the data as being consistent with the theory. Many of us were rooting for ego depletion theory. Countless researchers were invested in the idea either directly or indirectly. Many wanted a pillar of their theoretical and empirical foundational knowledge to hold up, even if the aggregate effect was more modest than originally depicted. For those individuals, a statistically significant finding seems like good news, even if it is really cold comfort.

Another reason for the prioritization of significant findings over the magnitude of the effect is, well, ignorance of effect sizes and their meaning. It was not too long ago that we tried in vain to convince colleagues that a Neyman-Pearson system was useful (balance power, alpha, effect size, and N). A number of my esteemed colleagues pushed back on the notion that they should pay heed to effect sizes. They argued that, as experimental theoreticians, their work was, at best, testing directional hypotheses of no practical import. Since effect sizes were for “applied” psychologists (read: lower status), the theoretical experimentalist had no need to sully themselves with the tools of applied researchers. They also argued that their work was “proof of concept” and the designs were not intended to reflect real world settings (see ego depletion) and therefore the effect sizes were uninterpretable. Setting aside the unnerving circularity of this thinking****, what it implies is that many people have not been trained on, or forced to think much about, effect sizes. Yes, they’ve often been forced to report them, but not to really think about them. I’ll go out on a limb and propose that the majority of our peers in the social sciences think about and make inferences based solely on p-values and some implicit attributes of the study design (e.g., experiment vs observational study).

The reality, of course, is that every study of every stripe comes with an effect size, whether or not it is explicitly presented or interpreted. More importantly, a body of research in which the same study or paradigm is systematically investigated, like has been done with ego depletion, provides an excellent estimate of the true effect size for that paradigm. The reality of a true effect size in the range of d = .04 to d = .08 is a harsh reality, but one that brings great clarity.

Eyes wide open

So, let’s make an assumption. The evidence is pretty good that the effect size of sequential ego depletion tasks is, at best, d = .08.

With that assumption, the inevitable conclusion is that the traditional study of ego depletion using experimental approaches is dead in the water.

Why?

First, because studying a phenomenon with a true effect size of d = .08 is beyond the resources of almost all labs in psychology. To have 80% power to detect an effect size of d = .08 you would need to run more than 2500 participants through your lab. If you go with the d = .04 estimate, you’d need more than 9000 participants. More poignantly, none of the original studies used to support the existence of ego depletion were designed to detect the true effect size.

These types of sample size demands violate most of our norms in psychological science. The average sample size in prior experimental ego depletion research appears to be about 50 to 60. With that kind of sample size, you have 6% power to detect the true effect.

What about our new rules of thumb, like do your best to reach an N of 50 per cell, or use 2.5 the N of the original study, or crank the N up above 500 to test an interaction effect? Power is 8%, 11%, and 25% in each of those situations, respectively. If you ran your studies using these rules of thumb, you would be all thumbs.

But, you say, I can get 2500 participants on mTurk. That’s not a bad option. But, you have to ask yourself: To what end? The import of ego depletion research and much experimental work like it, is predicated on the notion that the situation is “powerful,” as in, it has a large effect. How important is ego depletion to our understanding of human nature if the effect is minuscule? Before you embark on the mega study of thousands of mTurkers, it might be prudent to answer this question.

But, you say, some have argued that small effects can cumulate and therefore be meaningful if studied with enough fidelity and across time. Great. Now all you need to do is run a massive longitudinal intervention study where you test how the minuscule effect of the manipulation cumulates over time and place. The power issue doesn’t disappear with this potential insight. You still have to deal with the true effect size of the manipulation being a d of .08. So, one option is to use a massive study. Good luck funding that study. The only way you could get the money necessary to conduct it would be to promise doing an fMRI of every participant. Wait. Oh, never mind.

The other option would be to do something radical like create a continuous intervention that builds on itself over time—something currently not part of ego depletion theory or traditional experimental approaches in psychology.

But, you say, there are hundreds of studies that have been published on ego depletion. Exactly. Hundreds of studies have been published that had average d-value of .62. Hundreds of studies have been published showing effect sizes that cannot, by definition, be true given the true effect size is d = .08. That is the clarity that comes with the use of accurate effect sizes. It is incredibly difficult to get d-values of .62 when the true d is .08. Look at the distribution of d-values around zero with sample sizes of 50. The likelihood of landing a d of .62 or higher is about 3%. This fact invites some uncomfortable questions. How did all of these people find this many large effects? If we assume they found these relatively huge, highly unlikely effects by chance alone, this would mean that there are thousands of studies lying about in file drawers somewhere. Or it means people used other means to dig these effects out of the data….

Setting aside the motivations, strategies, and incentives that would net this many findings that are significantly unlikely to be correct (p < .03), the import of this discrepancy is huge. The fact that hundreds of studies with such unlikely results were published using the standard paradigms should be troubling to the scientific community. It shows that psychologists, as a group using the standard incentive systems and review processes of the day, can produce grossly inflated findings that lend themselves to the appearance of an accumulated body of evidence for an idea when, by definition, it shouldn’t exist. That should be more than troubling. It should be a wakeup call. Our system is more than broken. It is spewing pollution into the scientific environment at an alarming rate.

This is why effect sizes are important. Knowing that the true effect size of sequential ego depletion studies is a d of .08 leads you to conclude that:

1. Most prior research on the sequential task approach to ego depletion is so problematic that it cannot and should not be used to inform future research. Are you interested in those moderators or boundary mechanisms of ego depletion? Great, you are now proposing to see whether your new condition moves a d of .08 to something smaller. Good luck with that.

2. New research on ego depletion is out of reach for most psychological scientists unless they participate in huge multi-lab projects like the Psychological Science Accelerator.

3. Our field is capable of producing huge numbers of published reports in support of an idea that are grossly inaccurate.

4. If someone fails to replicate one of my studies, I can no longer point to dozens, if not hundreds of supporting studies and confidently state that there is a lot of backing for my work.

5. As has been noted by others, meta-analysis is fucked.

And don’t take this situation as anything particular to ego depletion. We now have reams of studies that either through registered replication reports or meta-analyses have shown that the original effect sizes are inflated and that the “truer” effect sizes are much smaller. In numerous cases, ranging from GxE studies to ovulatory cycle effects, the meta-analytic estimates, while statistically significant, are conspicuously smaller than most if not all of the original studies were capable of detecting. These updated effect sizes need to be weighed heavily in research going forward.

In closing, let me point out that I say these things with no prejudice against the idea of ego depletion. I still like the idea and still hold out a sliver of hope that the idea may be viable. It is possible that the idea is sound and the way prior research was executed is the problem.

But, extrapolating from the cumulative meta-analytic work and the registered replication projects, I can’t avoid the conclusion that the effect size for the standard sequential paradigms is small. Really, really small. So small that it would be almost impossible to realistically study the idea in almost any traditional lab.

Maybe the fact that these paradigms no longer work will spur some creative individuals on to come up with newer, more viable, and more reliable ways of testing the idea. Until then, the implication of the effect size is clear: Steer clear of the classic experimental approaches to ego depletion. And, if you nonetheless continue to find value in the basic idea, come up with new ways to study it; the old ways are not robust.

Brent W. Roberts

 

* p < .05: They failed.  At the time, I chalked it up to my lack of expertise.  And that was before it was popular to argue that people who failed to replicate a study lacked expertise.

** p < .01: See “personality coefficient” Mischel, W. (2013). Personality and assessment. Psychology Press.

*** p < .005: that’s a correlation of .04, but who’s comparing effect sizes??

**** p < .001: “I’m special, so I can ignore effect sizes—look, small effect sizes—I can ignore these because I’m a theoretician. I’m still special”

 

Posted in Uncategorized | Leave a comment

Making good on a promise

At the end of my previous blog “Because, change is hard“, I said, and I quote: “So, send me your huddled, tired essays repeating the same messages about improving our approach to science that we’ve been making for years and I’ll post, repost, and blog about them every time.”

Well, someone asked me to repost their’s.  So here is it is: http://www.nature.com/news/no-researcher-is-too-junior-to-fix-science-1.21928.  It is a nice piece by John Tregoning.

Speaking of which, there were two related blogs posted right after the change is hard piece that are both worth reading.  The first by Dorothy Bishop is brilliant and counters my pessimism so effectively I’m almost tempted to call her Simine Vazire: http://deevybee.blogspot.co.uk/2017/05/reproducible-practices-are-future-for.html

And if you missed it James Heathers has a spot on post about the New Bad People: https://medium.com/@jamesheathers/meet-the-new-bad-people-4922137949a1

 

Posted in Uncategorized | Leave a comment

Because, change is hard

I reposted a quote from a paper on twitter this morning entitled “The earth is flat (p > 0.05): Significance thresholds and the crisis of unreplicable research.” The quote, which is worth repeating, was “reliable conclusions on replicability…of a finding can only be drawn using cumulative evidence from multiple independent studies.”

An esteemed colleague (Daniël Lakens @lakens) responded “I just reviewed this paper for PeerJ. I didn’t think it was publishable. Lacks structure, nothing new.”

Setting aside the typical bromide that I mostly curate information on twitter so that I can file and read things later, the last clause “nothing new” struck a nerve. It reminded me of some unappealing conclusions that I’ve arrived at about the reproducibility movement that lead to a different conclusion—that it is very, very important that we post and repost papers like this if we hope to move psychological science towards a more robust future.

From my current vantage, producing new and innovative insights about reproducibility is not the point. There has been almost nothing new in the entire reproducibility discussion. And, that is okay. I mean, the methodologists (whether terroristic or not) have been telling us for decades that our typical approach to evaluating our research findings is problematic. Almost all of our blogs or papers have simply reiterated what those methodologists told us decades ago. Most of the papers and activities emerging from the reproducibility movement are not coming up with “novel, innovative” techniques for doing good science. Doing good science necessitates no novelty. It does not take deep thought or creativity to pre-register a study, do a power analysis, or replicate your research.

What is different this time is that we have more people’s attention than the earlier discussions. That means, we have a chance to make things better instead of letting psychology fester in a morass of ambiguous findings meant more for personal gain than for discovering and confirming facts about human nature.

The point is that we need to create an environment in which doing science well—producing cumulative evidence from multiple independent studies—is the norm. To make this the norm, we need to convince a critical mass of psychological scientists to change their behavior (I wonder what branch of psychology specializes in that?). We know from our initial efforts that many of our colleagues want nothing to do with this effort (the skeptics). And, these skeptical colleagues count in their ranks a disproportionate number of well-established, high status researchers who have lopsided sway in the ongoing reproducibility discussion. We also know that another critical mass is trying to avoid the issue, but seem to be grudgingly okay with taking small steps like increasing their N or capitulating to new journal requirements (the agnostics). I would even guess that the majority of psychological scientists remain blithely unaware of the machinations of scientists concerned with reproducibility (the naïve) and think that it is only an issue for subgroups like social psychology (which we all know is not true). We know that many young people are entirely sympathetic to the effort to reform methods in psychological science (the sympathizers). But, these early career researchers face withering winds of contempt from their advisors or senior colleagues and problematic incentives for success that dictate they continue to pursue poorly designed research (e.g., the prototypical underpowered series of conceptual replication studies, in which one roots around for p < .05 interaction effects).

So why post papers that reiterate these points? Even if those papers are derivative or maybe not as scintillating as we would like? Why write blogs that repeat what others have said for decades before?

Because, change is hard.

We are not going to change the minds of the skeptics. They are lost to us. That so many of our most highly esteemed colleagues are in this group simply makes things more challenging. The agnostics are like political independents. Their position can be changed, but it takes a lot of lobbying and they often have to be motivated through self-interest. I’ve seen an amazingly small number of agnostics come around after six years of blog posts, papers, presentations, and conversations. These folks come around one talk, one blog, or one paper at a time. And really, it takes multiple messages to get them to change. The naïve are not paying attention, so we need to repeat the same message over and over and over again in hopes that they might actually read the latest reiteration of Jacob Cohen. The early career researchers often see clearly what is going on but then must somehow negotiate the landmines that the skeptics and the reproducibility methodologists throw in their way. In this context, re-messaging, re-posting, re-iterating serves the purpose to  create the perception that doing things well is supported by a critical mass of colleagues.

Here’s my working hypothesis. In the absence of wholesale changes to incentive structures (grants, tenure, publication requirements at journals), one of the few ways we will succeed in making it the norm to “produce cumulative evidence from multiple independent studies” is by repeating the reproducibility message. Loudly. By repeating these messages we can drown out the skeptics, move a few agnostics, enlighten the naïve, and create an environment in which it is safe for early career researchers to do the right thing. Then, in a generation or two psychological science might actually produce, useful, cumulative knowledge.

So, send me your huddled, tired essays repeating the same messages about improving our approach to science that we’ve been making for years and I’ll post, repost, and blog about them every time.

Brent W. Roberts

Posted in Uncategorized | 11 Comments

A Most Courageous Act

The most courageous act a modern academic can make is to say they were wrong.  After all, we deal in ideas, not things.  When we say we were wrong, we are saying our ideas, our products so to speak, were faulty.  It is a supremely unsettling thing to do.

Of course, in the Platonic ideal, and in reality, being a scientist necessitates being wrong a lot. Unfortunately, our incentive system militates against being honest about our work. Thus, countless researchers choose not to admit or even acknowledge the possibility that they might have been mistaken.

In a bracingly honest post in response to a blog by Uli Schimmack, the Nobel Prize winning psychologist, Daniel Kahneman, has done the unthinkable.  He has admitted that he was mistaken.   Here’s a quote:

“I knew, of course, that the results of priming studies were based on small samples, that the effect sizes were perhaps implausibly large, and that no single study was conclusive on its own. What impressed me was the unanimity and coherence of the results reported by many laboratories. I concluded that priming effects are easy for skilled experimenters to induce, and that they are robust. However, I now understand that my reasoning was flawed and that I should have known better. Unanimity of underpowered studies provides compelling evidence for the existence of a severe file-drawer problem (and/or p-hacking). The argument is inescapable: Studies that are underpowered for the detection of plausible effects must occasionally return non-significant results even when the research hypothesis is true – the absence of these results is evidence that something is amiss in the published record. Furthermore, the existence of a substantial file-drawer effect undermines the two main tools that psychologists use to accumulate evidence for a broad hypotheses: meta-analysis and conceptual replication. Clearly, the experimental evidence for the ideas I presented in that chapter was significantly weaker than I believed when I wrote it. This was simply an error: I knew all I needed to know to moderate my enthusiasm for the surprising and elegant findings that I cited, but I did not think it through. When questions were later raised about the robustness of priming results I hoped that the authors of this research would rally to bolster their case by stronger evidence, but this did not happen.”

My respect and gratitude for this statement by Professor Kahneman knows no bounds.

Brent W. Roberts

Posted in Uncategorized | 3 Comments

A Commitment to Better Research Practices (BRPs) in Psychological Science

Scientific research is an attempt to identify a working truth about the world that is as independent of ideology as possible.  As we appear to be entering a time of heightened skepticism about the value of scientific information, we feel it is important to emphasize and foster research practices that enhance the integrity of scientific data and thus scientific information. We have therefore created a list of better research practices that we believe, if followed, would enhance the reproducibility and reliability of psychological science. The proposed methodological practices are applicable for exploratory or confirmatory research, and for observational or experimental methods.

  1. If testing a specific hypothesis, pre-register your research[1], so others can know that the forthcoming tests are informative. Report the planned analyses as confirmatory, and report any other analyses or any deviations from the planned analyses as exploratory.
  2. If conducting exploratory research, present it as exploratory. Then, document the research by posting materials, such as measures, procedures, and analytical code so future researchers can benefit from them. Also, make research expectations and plans in advance of analyses—little, if any, research is truly exploratory. State the goals and parameters of your study as clearly as possible before beginning data analysis.
  3. Consider data sharing options prior to data collection (e.g., complete a data management plan; include necessary language in the consent form), and make data and associated meta-data needed to reproduce results available to others, preferably in a trusted and stable repository. Note that this does not imply full public disclosure of all data. If there are reasons why data can’t be made available (e.g., containing clinically sensitive information), clarify that up-front and delineate the path available for others to acquire your data in order to reproduce your analyses.
  4. If some form of hypothesis testing is being used or an attempt is being made to accurately estimate an effect size, use power analysis to plan research before conducting it so that it is maximally informative.
  5. To the best of your ability maximize the power of your research to reach the power necessary to test the smallest effect size you are interested in testing (e.g., increase sample size, use within-subjects designs, use better, more precise measures, use stronger manipulations, etc.). Also, in order to increase the power of your research, consider collaborating with other labs, for example via StudySwap (https://osf.io/view/studyswap/). Be open to sharing existing data with other labs in order to pool data for a more robust study.
  6. If you find a result that you believe to be informative, make sure the result is robust. For smaller lab studies this means directly replicating your own work or, even better, having another lab replicate your finding, again via something like StudySwap.  For larger studies, this may mean finding highly similar data, archival or otherwise, to replicate results. When other large studies are known in advance, seek to pool data before analysis. If the samples are large enough, consider employing cross-validation techniques, such as splitting samples into random halves, to confirm results. For unique studies, checking robustness may mean testing multiple alternative models and/or statistical controls to see if the effect is robust to multiple alternative hypotheses, confounds, and analytical approaches.
  7. Avoid performing conceptual replications of your own research in the absence of evidence that the original result is robust and/or without pre-registering the study. A pre-registered direct replication is the best evidence that an original result is robust.
  8. Once some level of evidence has been achieved that the effect is robust (e.g., a successful direct replication), by all means do conceptual replications, as conceptual replications can provide important evidence for the generalizability of a finding and the robustness of a theory.
  9. To the extent possible, report null findings. In science, null news from reasonably powered studies is informative news.
  10. To the extent possible, report small effects. Given the uncertainty about the robustness of results across psychological science, we do not have a clear understanding of when effect sizes are “too small” to matter. As many effects previously thought to be large are small, be open to finding evidence of effects of many sizes, particularly under conditions of large N and sound measurement.
  11. When others are interested in replicating your work be cooperative if they ask for input. Of course, one of the benefits of pre-registration is that there may be less of a need to interact with those interested in replicating your work.
  12. If researchers fail to replicate your work continue to be cooperative. Even in an ideal world where all studies are appropriately powered, there will still be failures to replicate because of sampling variance alone. If the failed replication was done well and had high power to detect the effect, at least consider the possibility that your original result could be a false positive. Given this inevitability, and the possibility of true moderators of an effect, aspire to work with researchers who fail to find your effect so as to provide more data and information to the larger scientific community that is heavily invested in knowing what is true or not about your findings.

We should note that these proposed practices are complementary to other statements of commitment, such as the commitment to research transparency (http://www.researchtransparency.org/). We would also note that the proposed practices are aspirational.  Ideally, our field will adopt many, of not all of these practices.  But, we also understand that change is difficult and takes time.  In the interim, it would be ideal to reward any movement toward better research practices.

Brent W. Roberts

Rolf A. Zwaan

Lorne Campbell

[1] van ’t Veer, A. E., & Giner-Sorolla, R. (2016). Pre-registration in social psychology—A discussion and suggested template. Journal of Experimental Social Psychology, 67, 2–12. doi:10.1016/j.jesp.2016.03.004

Posted in Uncategorized | 1 Comment

Andrew Gelman’s blog about the Fiske fiasco

Some of you might have missed the kerfuffle that erupted in the last few days over a pre-print of an editorial written by Susan Fiske for the APS Monitor about us “methodological terrorists”.  Andrew Gelman’s blog reposts Fiske’s piece, puts it in historical context, and does a fairly good job of articulating why it is problematic beyond the terminological hyperbole that Fiske employs.  We are reposting it for your edification.

What has happened down here is the winds have changed

Posted in Uncategorized | Leave a comment