A while back, Michael Kraus (MK), Michael Frank (MF) and me (Brent W Roberts, or BWR; M. Brent Donnellan–MBD–is on board for this discussion so we’ll have to keep our Michaels and Brents straight) got into a Twitter inspired conversation about the niceties of using polytomous rating scales vs yes/no rating scales for items. You can read that exchange here.
The exchange was loads of fun and edifying for all parties. An over-simplistic summary would be that, despite passionate statements made by psychometricians, there is no Yes or No answer to the apparent superiority of Likert-type scales for survey items.
We recently were reminded of our prior effort when a similar exchange on Twitter pretty much replicated our earlier conversation–I’m not sure whether it was a conceptual or direct replication….
In part of the exchange, Michael Frank (MF) mentioned that he had tried the 2-point option with items they commonly use and found the scale statistics to be so bad that they gave up on the effort and went back to a 5-point option. To which, I replied, pithily, that he was using the Likert scale and the systematic errors contained therein to bolster the scale reliability. Joking aside, it reminded us that we had collected similar data that could be used to add more information to the discussion.
But, before we do the big reveal, let’s see what others think. We polled the Twitterati about their perspective on the debate and here are the consensus opinions which correspond nicely to the Michaels’ position:
Most folks thought moving to a 2-point rating scale would decrease reliability.
Most folks thought it would not make a difference when examining gender differences on the Big Five, but clearly there was less consensus on this question.
And, most folks thought moving to a 2-point rating scale would decrease the validity of the scales.
Before I could dig into my data vault, M. Brent Donnellan (MBD) popped up on the twitter thread and forwarded amazingly ideal data for putting the scale option question to the test. He’d collected a lot of data varying the number of scale options from 7 points all the way down to 2 points using the BFI2. He also asked a few questions that could be used as interesting criterion-related validity tests including gender, self-esteem, life satisfaction and age. The sample consisted of folks from a Qualtrics panel with approximately 215 people per group.
So, does moving to a dichotomous rating scale affect internal consistency?
Here are the average internal consistencies (i.e., coefficient alphas) for 2-point (Agree/Disagree), 3-point, 5-point, and 7-point scales:
Just so you know, here are the plots for the same analysis from a forthcoming paper by Len Simms and company (Simms, Zelazny, Williams, & Bernstein, in press):
This one is oriented differently and has more response options, but pretty much tells the same story. Agreeableness and Openness have the lowest reliabilities when using the 2-point option, but the remaining BFI domain scales are just fine–as in well above recommended thresholds for acceptable internal consistency that are typically found in textbooks.
What’s going on here?
BWR: Well, agreeableness is one of the most skewed domains–everyone thinks they are nice (News flash: you’re not). It could be that finer grained response options allow people to respond in less extreme ways. Or, the Likert scales are “fixing” a problematic domain. Openness is classically the most heterogeneous domain that typically does not hold together as well as the other Big Five. So, once again, the Likert scaling might be putting lipstick on a pig.
MK: Seeing this mostly through the lens of a scale user rather than a scale developer, I would not be worried if my reliability coefficients dipped to .70. When running descriptive stats on my data I wouldn’t even give that scale a second thought.
Also I think we can refer to BWR as “Angry Brent” from this point forward?
BWR: I prefer mildly exasperated Brent (MEB). And what are we to do with the Mikes? Refer to one of you as “Nice Mike” and the other as “Nicer Mike”? Which one of you is nicer? It’s hard to tell from my angry vantage point.
MBD: I agree with BWR. I also think the alphas reported with 2-point options are still more or less acceptable for research purposes. The often cited rules of thumb about alpha get close to urban legends (Lance, Butts & Michaels 2006). Clark and Watson (1995) have a nice line in a paper (or at least I remember it fondly) about how the goal of scale construction is to maximize validity, not internal consistency. I also suspect that fewer scale points might prove useful when conducting research with non-college student samples (e.g. younger, less educated). And I like the simplicity of the 2-PL IRT model so the 2-point options hold some appeal. (The ideal point folks can spare me the hate mail). This might be controversial but I think it would be better (although probably not dramatically so) to use fewer response options and use the saved survey space/ink to increase the number of items even by just a few. Content validity will increase and the alpha coefficient will increase assuming that the additional items don’t reduce the average inter-item correlation.
BWR: BTW, we have indirect evidence for this thought–we ran an online experiment where people were randomly assigned to conditions to rate items using a 2-point scale vs a 5-point scale. We lost about 300 people (out of 5000) in the 5-point condition due to people quitting before the end of the survey–they got tuckered out sooner when forced to think a bit more about the ratings.
MF: Since MK hasn’t chosen “nice Mike,” I’ll claim that label. I also agree that BWR lays out some good options for why the Likerts are performing somewhat better. But I think we might able to narrow things down more. In the initial post, I cited the conventional cognitive-psych wisdom that more options = more information. But the actual information gain depends on the way the options interact with the particular distribution of responses in the population. In IRT terms, harder questions are more informative if everyone in your sample has high ability, but that’s not true if ability varies more. I think the same thing is going on here for these scales – when attitudes vary more, the Likerts perform better (are more reliable, because they yield more information).
In the dataset above, I think that Agreeableness is likely to have very bunched up responses up at the top of the scale. Moving to the two-point scale then loses a bunch of information because everyone is choosing the same response. This is the same as putting a bunch of questions that are too easy on your test.
I went back and looked at the dataset that I was tweeting about, and found that exactly the same thing was happening. Our questions were about parenting attitudes, and they are all basically “gimmes” – everyone agrees with nearly all of them. (E.g., “It’s important for parents to provide a safe and loving environment for their child.”) The question is how they weight these. Our 7-point scale version pulls out some useful signal from these weightings (preprint here, whole-scale alpha was .90, subscales in the low .8s). But when we moved to a two-point scale, reliability plummeted to .20! The problem was that literally everyone agreed with everything.
I think our case is a very extreme example of a general pattern: when attitudes are very variant in a population, a 2-point scale is fine. When they are very homogeneous, you need more scale points.
What about validity?
Our first validity test is convergent validity–how well does the BFI2 correlate with the Mini-IPIP set of B5 scales?
BWR: From my vantage point we once again see the conspicuous nature of agreeableness. Something about this domain does not work as well with the dichotomous rating. On the other hand, the remaining domains look like there is little or no issue with moving from a 7-point to a 2 point scale
MK: If all of you were speculating about why agreeableness doesn’t work as a two-point scale, I’d be interested in your thoughts. What dimensions of a scale might lead to this kind of reduced convergent validity? I can see how people would be unwilling to answer FALSE to statements like “I see myself as caring, compassionate” because, wow, harsh. Another domain might be social dominance orientation because most people have largely egalitarian views about themselves (possible willful ignorance), and so saying TRUE to something like “some groups of people are inherently inferior to other groups.” might be a big ask for the normal range of respondents.
BWR: I would assume that in highly evaluative domains you might run into distributional troubles with dichotomously rated items. With really skewed distributions you would get attenuated correlations among the items and lower reliability. On the other hand, you really want to know who those people are who say “no” to “I’m kind”.
MBD: I agree with BWR’s opening points. When I first read your original blog post, I was skeptical. But then I dug around and found a recent MMPI paper (Finn, Ben Porath, & Tellegen, 2015) that was consistent with BWR’s points. I was more convinced but I still like seeing things for myself. Thus, I conducted a subject pool study when I was at TAMU and pre-registered my predictions. Sure enough, the convergent validity coefficients were not dramatically better for a 5-point response option versus T/F for the BFI2 items. I then collect additional data to push that idea but this is a consistent pattern I have seen with the BFI2 – more options aren’t dramatically better when it comes to response options. I have no clue if this extends beyond the MMPI/BFI/BFI-2 items or not. But my money is on these patterns generalizing.
As for Agreeableness, there is an interesting pattern that supports the idea that the items get more difficult to endorse/reject (depending on their polarity) when you constrain the response options to 2. If we convert all of the observed scores to the Percentage of Maximum Possible scores (see Cohen, Cohen, Aiken, & West, 1999), one could loosely compare across the formats. The average score for A in the 2-Point version was 82.78 (SD = 17.40) and it drops to 70.86 (SD = 14.26) in the 7 point condition. So this might be a case where giving more response options allows people to admit to less desirable characteristics (The results for the other composites were less dramatic). So, I think MK has a good point above that might qualify some of my enthusiasm for the 2-pt format for some kinds of content.
MF: OK, so this discussion above totally lines up with my theory that agreeableness is less variable, especially the idea that range on some of these variables might be restricted due to social desirability. MBD, BWR, is this something that’s generally true that agreeableness has low variance? (A histogram of responses for each variable in the 7 point case would be useful to see this by eye).
More generally, just to restate the theory: 2-point is good when there is a lot of variance in the population. But when variance is compressed – whether due to social desirability or true homogeneity – more scale points are increasingly important.
BWR: I don’t see any evidence for variance issues, but I am aware of people reporting skewness problems with agreeableness. Most of us believe we are nice. But, there are a few folks who are more than willing to admit to being not nice–thus, variances look good, but skewness may be the real culprit.
How about gender differences?
BWR: I see one thing in this table: sampling error. There is no rhyme nor reason to the way these numbers bounce around to my read, but I’m willing to be convinced.
MBD: I should give credit to Les Morey (creator of the PAI) for suggesting this exploratory question. I am still puzzled why the effect sizes bounce around (and have seen this in another dataset). I think a deeper dive testing invariance would prove interesting. But who has the time?
At the very least, there does not seem to be a simple story here. And it shows that we need a bigger N to get those CIs narrower. The size of those intervals make me kind of ill.
MF: I love that you guys are upset about CIs this wide. Have you ever read an experimental developmental psychology study? On another note, I do think it’s interesting that you’re seeing overall larger effects for the larger numbers of scale points. If you look at the mean effect, it’s .20 for the 7-pt, and .10 for the 2-pt, 15.5 for the 3-pt, and .2 for the 5-pt. So sure, lots of sampling error, but still some kind of consistency…
MK: Despite all the bouncing around, there doesn’t seem to be anything unusual about the two-option scale confidence intervals.
And now the validity coefficients for self-esteem (I took the liberty of reversing the N scores to ES scores so everything was positive).
BWR: On this one the True-False scales actually do better than the Likert scales in some cases. No strong message here.
MK: This is shocking to me! Wow! One question though — could the two-point scale items just be reflecting this overall positivity bias and not the underlying trait construct. That is, if the two point scales were just measures of self-esteem would this look just like it does here? I guess I’m hoping for some discriminant validity… or maybe I’d just like to see how intercorrelated the true-false version is across the five factors and compare that correlation to the longer Likerts.
BWR: Excellent point MK. To address the overall positivity bias inherent in a bunch of evaluative scales, we correlated the different B5 scales with age down below. Check it out.
MK: That is so… nice of you! Thanks!
BWR: I wish you would stop being so nice.
MF: I agree that it’s a bit surprising to me that we see the flip, but, going with my theory above, I predict that extraversion is the scale with the most variance in the larger likert ratings. That’s why the 2-pt is performing so well – people really do vary in this characteristic dramatically AND there’s less social desirability coming out in the ratings, so 2-point is actually useful.
And finally the coefficients for life satisfaction
MK: I’m a believer now, thanks Brent and Angry Brent!
MBD: Wait, which Brent is Angry! 😉
MF: Ok, so if I squint I can still say some stuff about variance etc. But overall it is true that the validity for the 2-point scale is surprisingly reasonable, especially for these lower-correlation measures. In particular, maybe the only things that really matter for life-satisfaction correlations are the big differences; so you accentuate these characteristics in the 2-pt and get rid of minor variance due to other sources.
How about age?
As was noted above, self-esteem and life satisfaction are rather evaluative, as are the Big Five and that might create too much convergent validity and not enough discriminant validity. What about a non-evaluative outcome like age? Each of the samples was on average in their 50s with age ranges from young adulthood through old age. So, while the sample sizes were a little small for stable estimates (we like 250 minimum), age is not a bad outcome to correlate to because it is clearly not biased from social desirability. Unless, of course, we lie systematically about our age….
If you are keen on interpreting these coefficients, the confidence intervals for samples of this size are about + or – .13. Happy inferencing.
BWR: I find these results really interesting. Despite the apparent issues with the true-false version of agreeableness, it actually has the largest correlation with age–actually higher than most prior reports, which admittedly are based on 5-point rating scale measures of the Big Five. I’m tempted to interpret the 3-Point scales as problematic, but I’m going to go with sampling error again. It was probably just a funky sample.
MK: OK then. I agree, I think the 3-point option is being the strangest for agreeableness.
MBD: I have a second replication sample where I used 2,3,4,5,6, and 7 response formats. The cell sizes are a bit smaller but I will look at those correlations in that one as well.
General Thoughts?
MBD: This was super fun and appreciate that you three let me join the discussion. I admit that when I originally read the first exchange, I thought something was off about BWR’s thinking [BWR–you are not alone in that thought]. I was in a state of cognitive dissonance as it went against a ”5 to 7 scale points are better than alternatives” heuristic. Reading the MMPI paper was the next step toward disabusing myself of my bias. Now after collecting these data, hearing a talk by Len Simms about his paper, and so forth, I am not as opposed to using fewer scale points than I was in the past. This is especially true if it allows one to collect additional items. That said, I think more work about content by scale point interactions is needed for the reasons brought up in this post. However, I am a lot more positive to 2-point scales than I was in the past. Thanks!
MF: Agreed – this was an impressive demonstration of Angry Brent’s ideas. Even though 7-pt sometimes is still performing better, overall the lack of problems with 2-pt is really food for thought. Even I have to admit that sometimes the 2-pt can be simpler and easier. On the other hand, I will still point to our parenting questionnaire – which is much more tentative and early stage in terms of the constructs it measures than the B5! In that case, it essentially destroyed the instrument to use a 2-pt scale because there was so much consensus (or social desirability)! So while I agree with the theoretical point from the previous post – consider 2-pt scales! – I also want to sound a cautious note here because not every domain is as well-understood.
MK: Agree on the caution that MF alludes but wow, the 2-point scale performed far better than I anticipated. Thanks for doing this all!
BWR: I love data. It never conforms perfectly to your expectations. And, as usual, it raises as many questions as it answers. For me, the overriding question that emerges from these data is whether 2-point scales are problematic with less coherent and skewed domains or whether 2-point scales are excellent indicators that you have a potentially problematic set of items that you are papering over by using a 5-point scale? It may be that the 2-point scale approach is like the canary in the measurement coal mine–it will alert us to problems with our measures that need tending to.
These data also teach the lesson Clark and Watson (1995) provide that validity should be paramount. My sense is that those of us in the psychometric trenches can get rather opinionated about measurement issues, (Use omega rather than Cronbach’s alpha; use IRT rather than classical test theory, etc.) that translate into nothing of significance when you condition your thinking on validity. Our reality may be that when we ask questions, people are capable of telling us a crude “yeah, that’s like me” or “no, not really like me” and that’s about the best we can do regardless of how fine grained our apparent measurement scales are.
MBD: Here’s a relevant quote from Dan Ozer: “It seems that it is relatively easy to develop a measure of personality of middling quality (Ashton & Goldberg, 1973), and then it is terribly difficult to improve it.” (p. 685).
Thanks MK, MF, and MBD for the nerdfest. As usual, it was fun.
References
P.S. George Richardson pointed out that we did not compare even numbered response options (e.g., 4-point) vs odd numbered response options (e.g., 5-point) and therefore do not confront the timeless debate of “should I include a middle option.” First, Len Simms paper does exactly that–it is a great paper and shows that it makes very little difference. Second, we did a deep dive into that issue for a project funded by the OECD. Like the story above, it made no difference for Big Five reliability or validity if you used 4 or 5 point scales. If you used an IRT model (ggum) in some cases you got a little more information out of the middle option that was of value (e.g., neuroticism). It never did psychometric damage to have a middle option as many fear. So, you may want to lay to rest the argument that everyone will bunch to the middle when you include a middle option.
This is a really great blog piece – thanks writing this Brent.
My question is: What happens when you continue to use EVEN scales? e.g., 2, 4, and 6 point scales?
For example, the four point would become:
Highly true for me
Slightly true for me
Slightly false for me
Highly false for me
and the six-point would become:
Highly true for me
Moderately true for me
Slightly true for me
Slightly false for me
Moderately false for me
Highly false for me
It’s fascinating to me as to what happens to personality scales when the middle ground is removed… I wonder if it makes the variance more extreme and thus makes the even scales more sensitive (though, perhaps less specific) than the odd-number scales.
Hi Luke,
We have poked around with that permutation using personality items a bit. In one investigation we compared 4 vs 5 point scales under the assumption that the middle point allowed people to escape making a commitment and therefore added noise to the system. In that study (a tech report for the OECD, not published per se) we found that we lost a little bit of information for some of the Big Five when we dropped the middle option. Based on my faulty memory, it did not do much to anything else–it did not affect the basic item stats much or the validities at all.
We have in the past opted for the even-numbered rating scales if only to make the transition to IRT 2pl models easier, but that’s not really a justifiable reason for the practice–unless you’re the grad student tasked with learning the GGUM IRT program….
Brent
That’s helpful to know. I wonder if the ‘loss of info.’ was on your latent model? If so, this is quite interesting because it suggests some kind of continuity in the true underlying construct such that all components of the big five are continuous from pos to neg valence and NOT dichotomous, categorical variables… This, of course, is interesting for two reasons: Firstly, it explains why your dichotomous scale does not map readily onto the big five compared to your ordinal scales. And secondly it raises the question about the dependence of the big five in general, which as far as I understand have been traditionally mapped onto an orthogonal component space and are assumed to be somewhat uncorrelated / independent (e.g., your big five component and NOT factor model). This is in spite of the fact that some of the big five tend to be correlated (see Scott Kaufman’s work on the dark triad or Ulrich Schimmack’s work on the clustering of A, O, and C). So the question that comes to my mind is that, if we assume independence and orthogonality between the big five then why do we see evidence of continuity within them? (per your models). You’ve certainly got me thinking in all kinds of weird and wonderful factor space in relation to IRT and the big five.
Thanks for taking the time to write the piece and getting back to me.
If my memory serves me there was not a tremendous amount of information lost according to the information curves. Specific traits did not benefit by including the middle option–extraversion for example, while traits like neuroticism did. In terms of the Big Five and their overlap or lack thereof, I love the classic article by Hofstee et al 1992. It shows why we get the Big Five (the holes in the lexicon) and that the presumption of orthogonality is not born out in the language. Most terms are double-barreled, meaning we should have intercorrelations among variants of the Big Five depending on how people operationalize the scales. So, in my opinion, the Hofstee et al article shows quite elegantly that there never were 5 orthogonal dimensions. Of course, this does not stop people from misunderstanding that applying a 5-factor orthogonal solution is not the same thing as discovering a perfectly orthogonal measurement space. Here’s the article: https://www.dropbox.com/s/ou4wx0tbzmss39i/Hofstee_deRaad_Goldberg_1992.pdf?dl=0