Andrew Gelman’s blog about the Fiske fiasco

Some of you might have missed the kerfuffle that erupted in the last few days over a pre-print of an editorial written by Susan Fiske for the APS Monitor about us “methodological terrorists”.  Andrew Gelman’s blog reposts Fiske’s piece, puts it in historical context, and does a fairly good job of articulating why it is problematic beyond the terminological hyperbole that Fiske employs.  We are reposting it for your edification.

What has happened down here is the winds have changed

Posted in Uncategorized | Leave a comment

The Power Dialogues

The following is a hypothetical exchange between a graduate student and Professor Belfry-Roaster.  The names have been changed to protect the innocent….

Budlie Bond: Professor Belfry-Roaster I was confused today in journal club when everyone started discussing power.  I’ve taken my grad stats courses, but they didn’t teach us anything about power.  It seemed really important. But it also seemed controversial.  Can you tell me a bit more about power and why people care so much about it

Prof. Belfry-Roaster: Sure, power is a very important factor in planning and evaluating research. Technically, power is defined as the long-run probability of rejecting the null hypothesis when it is, in fact, false. Power is typically considered to be a Good Thing because, if the null is false, then you want your research to be capable of rejecting it. The higher the power of your study, the better the chances are that this will happen.

The concept of power comes out of a very specific approach to significance testing pioneered by Neyman and Pearson. In this system, a researcher considers 4 factors when planning and evaluating research: the alpha level (typically the threshold you use to decide whether a finding is statistically significant), the effect size of your focal test of your hypothesis, sample size, and power.  The cool thing about this system is that if you know 3 of the factors you can compute the last one.  What makes it even easier is that we almost always use an alpha value of .05, so that is fixed. That leaves two things: the effect size (which you don’t control) and your sample size (which you can control). Thus, if you know the effect size of interest, you can use power analysis to determine the sample size needed to reject the null, say, 80% of the time, if the null is false in the population. Similarly, if you know the sample size of a study, you can calculate the power it has to reject the null under a variety of possible effect sizes in the population.

Here’s a classic paper on the topic for some quick reading:

Cohen J. (1992). Statistical power analysis. Current Directions in Psychological Science, 1, 98-101.

Budlie Bond:  Okay, that is a little clearer.  It seems that effect sizes are critical to understanding power. How do I figure out what my effect size is? It seems like that would involve a lot of guess work. I thought part of the reason we did research was because we didn’t know what the effect sizes were.

Prof. Belfry-Roaster: Effect sizes refer to the magnitude of the relationship between variables and can be indexed in far too many ways to describe. The two easiest and most useful for the majority of work in our field are the d-score and the correlation coefficient.  The d-score is the standardized difference between two means—simply the difference divided by the pooled standard deviation. The correlation coefficient is, well, the correlation coefficient. 

The cool thing about these two effect sizes is that they are really easy to compute from the statistics that all papers should report.  They can also be derived from basic information in a study, like the sample size and the p-value associated with a focal significance test.  So, even if an author has not reported an effect size you can derive one easily from their test statistics. Here are some cool resources that help you understand and calculate effect sizes from basic information like means and standard deviations, p-values, and other test statistics:

https://sites.google.com/site/lakens2/effect-sizes

Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G* Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39(2), 175–191.

Budlie Bond:  You said I can use effect size information to plan a study.  How does that work?

Prof. Belfry-Roaster: If you have some sense of what the effect size may be based on previous research, you can always use that as a best guess for selecting the appropriate sample size. But, many times that information isn’t available because you are trying something new.  If that is the case, you can still draw upon what we generally know about effect sizes in our field.  There are now five reviews that show that the average effect sizes in social, personality, and organizational psychology correspond roughly to a d-score of .4 or a correlation of .2.

Bosco, F. A., Aguinis, H., Singh, K., Field, J. G., & Pierce, C. A. (2015). Correlational effect size benchmarks. Journal of Applied Psychology100(2), 431.

Fraley, R. C., & Marks, M. J. (2007). The null hypothesis significance testing debate and its implications for personality research. Handbook of research methods in personality psychology, 149-169.

Paterson, T. A., Harms, P. D., Steel, P., & Credé, M. (2016). An assessment of the magnitude of effect sizes evidence from 30 years of meta-analysis in management. Journal of Leadership & Organizational Studies23(1), 66-81.

Hemphill, J. F. (2003). Interpreting the magnitudes of correlation coefficients. American Psychologist, 58, 78-80.

Richard, F. D., Bond Jr, C. F., & Stokes-Zoota, J. J. (2003). One Hundred Years of Social Psychology Quantitatively Described. Review of General Psychology, 7(4), 331-363

There are lots of criticisms of these estimates, but they are not a bad starting point for planning purposes.  If you plug those numbers into a power calculator, you find that you need about 200 subjects to have 80% power for an average simple main effect (e.g., d = .4).  If you want to be safe and either have higher power (e.g., 90%) or plan for a smaller effect size (e.g., d of .3), you’ll need more like 250 to 350 participants.  This is pretty close to the sample size when effect sizes get “stable”.

Schoenbrodt & Perugini, 2013; http://www.nicebread.de/at-what-sample-size-do-correlations-stabilize/

However, for other types of analyses, like interaction effects, some smart people have estimated that you’ll need more than twice as many participants—in the range of 500.  For example, Uri Simonsohn has shown that if you want to demonstrate that a manipulation can make a previously demonstrated effect go away, you need twice as many participants as you would need to demonstrate the original main effect (http://datacolada.org/17).

Whatever you do, be cautious about these numbers.  Your job is to think about these issues not to use rules of thumb blindly. For example, the folks who study genetic effects found out that the effect sizes for single nucleotide polymorphisms were so tiny that they needed hundreds of thousands of people to have enough power to reliably detect their effects.  On the flip side, when your effects are big, you don’t need many people.  We know that the Stroop effect is both reliable and huge. You only need a handful of people to figure out whether the Stroop main effect will replicate. Your job is to use some estimated effect size to make an informed decision about what your sample size should be.  It is not hard to do and there are no good excuses to avoid it.

Here some additional links and tables that you can use to estimate the sample size you will need to reach in order to achieve 80 or 90% power once you’ve got an estimate of your effect size:

For correlations:

https://power.phs.wakehealth.edu/index.cfm?calc=cor

For mean differences:

https://power.phs.wakehealth.edu/index.cfm?calc=tsm

Here’s are two quick and easy tables showing the relation between power and effect size for reference:

screen-shot-2016-09-12-at-1-52-15-pm

screen-shot-2016-09-12-at-1-53-09-pm

Budlie Bond:  My office-mate Chip Harkel says that there is a new rule of thumb that you should simply get 50 people per cell in an experiment.  Is that a sensible strategy to use when I don’t know what the effect size might be?

Prof. Belfry-Roaster:  The 50 person per cell is better than our previous rules of thumb (e.g., 15 to 20 people per cell), but, with a bit more thought, you can calibrate your sample size better. If you have reasons to think the effect size might be large (like the Stroop Effect), you will waste a lot of resources if you collect 50 cases per cell. Conversely, if you are interested in testing a typical interaction effect, your power is going to be too low using this rule of thumb.

Budlie Bond: Why is low power such a bad thing?

Prof. Belfry-Roaster:  You can think about the answer several ways.  Here’s a concrete and personal way to think about it. Let’s say that you are ready to propose your dissertation.  You’ve come up with a great idea and we meet to plan out how you are going to test it.  Instead of running any subjects I tell you there’s no need.  I’m simply going to flip a coin to determine your results.  Heads your findings are statistically significant; tails insignificant.  Would you agree to that plan?  If you find that to be an objectionable plan, then you shouldn’t care for the way we typically design our research because the average power is close to 50% (a coin flip).  That’s what you do every time you run a low powered study—you flip a coin.  I’d rather that you have a good chance of rejecting the null if it is false then to be subject to the whims of random noise.  That’s what having a high powered study can do for you.

At a broader level low power is a problem because the findings from low powered studies are too noisy to rely on. Low powered studies are uninformative. They are also quite possibly the largest reason behind the replication crisis.  A lot of people point to p-hacking and fraud as the culprits behind our current crisis, but a much simpler explanation of the problems is that the original studies were so small that they were not capable of revealing anything reliable. Sampling variance is a cruel master. Because of sampling variance, effects in small studies bounce around a lot. If we continue to publish low powered studies, we are contributing to the myth that underpowered studies are capable of producing robust knowledge. They are not.

Here are some additional readings that should help to understand how power is related to increasing the informational value of your research:

Lakens, D., & Evers, E. R. K. (2014). Sailing from the seas of chaos into the corridor of stability: Practical recommendations to increase the informational value of studies. Perspectives on Psychological Science, 9(3), 278–292. http://doi.org/10.1177/1745691614528520

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59(1), 537–563. http://doi.org/10.1146/annurev.psych.59.103006.093735

Budlie Bond: Is low power a good reason to dismiss a study after the fact?

Prof. Belfry-Roaster.  Many people assume that statistical power is not necessary “after the fact.” That is, once we’ve done a study and found a significant result, it would appear that the study must have been capable of detecting said effect. This is based on a misunderstanding of p-values and significance tests (see Fraley & Marks, 2007 for a review).

Fraley, R. C., & Marks, M. J. (2007). The null hypothesis significance testing debate and its implications for personality research. Handbook of research methods in personality psychology, 149-169.

What many researchers fail to appreciate is that a literature based on underpowered studies is more likely to be full of false positives than a literature that is based on highly powered studies. This sometimes seems counterintuitive to researchers, but it boils down to the fact that, when studies are underpowered, the relative ratio of true to false positives in the literature shifts (see Ioannidis 2008). The consequence is that a literature based on underpowered studies is quite capable of containing an overwhelming number of false positives—much more than the nominal 5% that we’ve been conditioned to expect. If you want to maximize the number of true positives in the literature relative to false leads, you would be wise to not allow underpowered studies into the literature.

Ioannidis JPA (2008) Why most discovered true associations are inflated.  Epidemiology, 19, 640-648.

In fact, I’d go one step further and say that low power is an excellent reason for why a study should be desk rejected by an editor.  An editor has many jobs, but one of those is to elevate or maintain the quality of the work that the journal publishes. Given how poorly our research is holding up, you really need a good excuse to publish underpowered research because doing so will detract from the reputation of the journal in our evolving climate.  For example, if you are studying a rare group or your resources are limited you may have some justification for using low power designs.  But if that is the case, you need to be careful about using inferential statistics.  The study may have to justified as being descriptive or suggestive, at best.  On the other hand, if you are a researcher at a major university with loads of resources like grant monies, a big subject pool, and an army of undergraduate RAs, there is little or no justification for producing low-powered research.  Low power studies simply increase the noise in the system making it harder and harder to figure out whether an effect exists or not and whether a theory has any merit.  Given how many effects are failing to replicate, we have to start taking power seriously unless we want to see our entire field go down in replicatory flames.

Another reason to be skeptical of low powered studies is that, if researchers are using significance testing as a way of screening the veracity of their results, they can only detect medium to large effects.  Given the fact that on average most of our effects are small, using low powered research makes you a victim of the “streetlight effect”—you know, where the drunk person only looks for their keys under the streetlight because that is the only place they can see? That is not an ideal way of doing science.

Budlie Bond: Ok, I can see some of your points. And, thanks to some of those online power calculators, I can see how I can plan my studies to ensure a high degree of power. But how do power calculations work in more complex designs, like those that use structural equation modeling or multi-level models?

Prof. Belfry-Roaster.  There is less consensus on how to think about power in these situations. But it is still possible to make educated decisions, even without technical expertise. For example, even in a design that involves both repeated measures and between-person factors, the between-persons effects still involve comparisons across people and should be powered accordingly. And in SEM applications, if the pairwise covariances are not estimated with precision, there are lots of ways for those errors to propagate and create estimation problems for the model.

Thankfully, there are some very smart people out there and they have done their best to provide some benchmarks and power calculation programs for more complex designs.  You can find some of them here.

Faul, F., Erdfelder, E., Lang, A. G., & Buchner, A. (2007). G* Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior research methods39(2), 175-191.

Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Frontiers in psychology4, 863.

MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and determination of sample size for covariance structure modeling. Psychological methods1(2), 130-149.

Mathieu, J. E., Aguinis, H., Culpepper, S. A., & Chen, G. (2012). Understanding and estimating the power to detect cross-level interaction effects in multilevel modeling. Journal of Applied Psychology97(5), 951-966.

Muthén, B. O., & Curran, P. J. (1997). General longitudinal modeling of individual differences in experimental designs: A latent variable framework for analysis and power estimation. Psychological methods2(4), 371-402.

Budlie Bond: I wasn’t sure how to think about power when I conducted my first study. But, in looking back at my data, I see that, given the sample size I used, my power to detect the effect size I found (d = .50) was over 90%. Does that mean my study was highly powered?

Prof. Belfry-Roaster: When power is computed based on the effect size observed, the calculation is sometimes referred to as post hoc power or observed power. Although there can be value in computing post hoc power, it is not a good way to estimate the power of your design for a number of reasons. We have touched on some of those already. For example, if the design is based on a small sample, only large effects (true large effects and overestimates of smaller or null effects) will cross the p < .05 barrier. As a result, the effects you see will tend to be larger than the population effects of interest, leading to inflated power estimates.

More importantly, however, power is a design issue, not a data-analytic issue. Ideally, you want to design your studies to be capable of detecting the effects that matter for the theory of interest. Thus, when designing a study, you should always ask “How many subjects do I need to have 80% power to detect an effect if the population effect size is X or higher,” where X is the minimum effect size of interest. This value is likely to vary from one investigator to another, but given that small effects matter for most directional theories, it is prudent to set this value fairly low.

You can also ask about the power of the design in a post hoc way, but it is best to ask not what the power was to detect the effect that was observed, but to ask what the power was to detect effects of various sizes. For example, if you conducted a two-condition study with 50 people per cell, you had 17% power to detect a d of .20, 51% to detect a d of .40, and 98% to detect a d of .80. In short, you can evaluate the power of a study to detect population effects of various sizes after the fact. But you don’t want to compute post hoc power by asking what the power of the design was for detecting the effect observed. For more about these issues, please see Daniel Lakens great blog post on post-hoc power: http://daniellakens.blogspot.com/2014/12/observed-power-and-what-to-do-if-your.html

Budlie Bond:  Thanks for all of the background on power. I got in a heated discussion with Chip again and he said a few things that made me think you are emphasizing power too much.  First he said that effect sizes are for applied researchers and that his work is theoretical. The observed effect sizes are not important because they depend on a number of factors that can vary from one context to the next (e.g., the strength of the manipulation, the specific DV measured). Are effect sizes and power less useful in basic research than they are in applied research?

Prof. Belfry-Roaster:  With the exception of qualitative research, all studies have effect sizes, even if they are based on contrived or artificial conditions (think of Harlow’s wire monkeys, for example).  If researchers want a strong test of their theory in highly controlled laboratory settings, they gain enormously by considering power and thus effect sizes.  They need that information to design the study to test their idea well.

Moreover, if other people want to replicate your technique or build on your design, then it is really helpful if they know the effect size that you found so they can plan accordingly.

In short, even if the effect size doesn’t naturally translate into something of real world significance given the nature of the experimental or lab task, there is an effect size associated with the task. Knowing it is important not only for properly testing the theory and seeing what kinds of factors can modulate the effect, but for helping others plan their research accordingly. You are not only designing better research by using effect sizes, you are helping other researchers too.

Another reason to always estimate your effect sizes is that they are a great reality check on the likelihood and believability of your results. For example, when we do individual difference research, we start thinking that we are probably measuring the same thing when the correlation between our independent and dependent variable gets north of .50.  Well, a correlation of .5 is like a d-score of .8.  So, if you are getting effect sizes above .5 or above a d of .8 your findings warrant a few skeptical questions.  First, you should ask whether you measured the same thing twice.  In an experiment d’s around 1 should really be the exclusive domain of manipulation checks, not an outcome of some subtle manipulation.  Second, you have to ask yourself how you are the special one who found the “low hanging fruit” that is implicit in a huge effect size.  We’ve been at the study of psychology for many decades.  How is it that you are the genius who finally hit on a relationship that is so large that it should visible to the naked eye (Jacob Cohen’s description of a medium effect size) and all of the other psychologists missed it? Maybe you are that observant, but it is a good question to ask yourself nonetheless.

And this circles back to our discussion of low power.  Small N studies only have enough power to detect medium to large effect sizes with any reliability.  If you insist on running small N studies and ignore your effect sizes, you are more likely to produce inaccurate results simply because you can’t detect anything but large effects, which we know are rare. If you then report those exaggerated effect sizes, other people who attempt to build on your research will plan their designs around an effect that is too large. This will lead them to underpower their studies and fail to replicate your results. The underpowered study thus sets in motion a series of unfortunate events that lead to confusion and despair rather than progress.

Choosing to ignore your effect sizes in the context of running small N studies is like sticking your head in the sand.  Don’t do it.

Budlie Bond: Chip’s advisor also says we should not be so concerned with Type 1 errors.  What do you think?

Prof. Belfry-Roaster: To state out loud that you are not worried about Type 1 errors at this point in time is inconceivable.  Our studies are going down in flames one-by-one.  The primary reason for that is because we didn’t design the original studies well—typically they were underpowered and never directly replicated.  If we continue to turn a blind eye to powering our research well, we are committing to a future where our research will repeatedly not replicate.  Personally, I don’t want you to experience that fate.

Budlie Bond:  Chip also said that some people argue against using large samples because doing so is cheating.  You are more likely to get a statistically significant finding that is really tiny.  By only running small studies they say they protect themselves from promoting tiny effects.

Prof. Belfry-Roaster: While it is true that small studies can’t detect small effects, the logic of this argument does not add up. The only way this argument would hold is if you didn’t identify the effect size in your study, which, unfortunately, used to be quite common.  Researchers used to and still do obsess over p-values.  In a world where you only use p-values to decide whether a theory or hypothesis is true, it is the case that large samples will allow you to claim that an effect holds when it is actually quite small. On the other hand, if you estimate your effect sizes in all of your studies then there is nothing deceptive about using a large sample.  Once you identify an effect as small, then other researchers can decide for themselves whether they think it warrants investment.  Moreover, the size of the sample is independent of the effect size (or should be).  You can find a big effect size with a big sample too.

Ultimately, the benefits of a larger sample outweigh the costs of a small sample.  You gain less sampling variance and a more stable estimate of the effect size.  In turn, the test of your idea should hold up better in future research than the results from a small N study.  That’s nothing to sneeze at.

You can also see how this attitude toward power and effect sizes creates a vicious cycle.  If you use small N studies evaluated solely by p-values rather than power and effect sizes, you are destined to lead a chaotic research existence where findings come and go, seemingly nonsensically.  If you then argue that 1) all theories are not true under certain conditions, or that 2) the manipulation is delicate, or 3) that there are loads of hidden moderators, you can quickly get into a situation where your claims cannot be refuted.  Using high powered studies with effect size estimates can keep you a little more honest about the viability of your ideas.

Budlie Bond: Chip’s advisor says all of this obsession with power is hindering our ability to be creative.  What do you think?

Prof. Belfry-Roaster:  Personally, I believe the only thing standing between you and a creative idea is gray matter and some training. How you go about testing that idea is not a hindrance to coming up with the idea in the first place.  At the moment we don’t suffer from a deficit of creativity.  Rather we have an excess of creativity combined with the deafening roar of noise pollution.  The problem with low powered studies is they simply add to the noise. But how valuable are creative ideas in science if they are not true?

Many researchers believe that the best way to test creative ideas is to do so quickly with few people.  Actually, it is the opposite.  If you really want to know whether your new, creative idea is a good one, you want to overpower your study. One reason is that low power leads to Type II errors—not detecting an effect when the null is false.  That’s a tragedy.  And, it is an easy tragedy to avoid—just power your study adequately.

Creative ideas are a dime a dozen. But creative ideas based on robust statistical evidence are rare indeed. Be creative, but be powerfully creative.

Budlie Bond:  Some of the other grad students were saying that the sample sizes you are proposing are crazy large.  They don’t want to run studies that large because they won’t be able to keep up with grad students who can crank out a bunch of small studies and publish at a faster rate.

Prof. Belfry-Roaster:  I’m sympathetic to this problem as it does seem to indicate that research done well will inevitably take more time, but I think that might be misleading.  If your fellow students are running low powered studies, they are likely finding mixed results, which given our publication norms won’t get a positive reception.  Therefore, to get a set of studies all with p-values below .05 they will probably end up running multiple small studies.  In the end, they will probably test as many subjects as you’ll test in your one study.  The kicker is that their work will also be less likely to hold up because it is probably riddled with Type 1 errors.

Will Gervais has conducted some interesting simulations comparing research strategies that focus on slower, but more powerful studies against those that focus on faster, less powerful samples. His analysis suggests that you’re not necessarily better off doing a few quick and under-powered studies. His post is worth a read.

http://willgervais.com/blog/2016/2/10/casting-a-wide-net

Budlie Bond:  Chip says that your push for large sample sizes also discriminates against people who work at small colleges and universities because they don’t have access to the numbers of people you need to run adequately-powered research.

Prof. Belfry-Roaster:  He’s right.  Running high powered studies will require potentially painful changes to the way we conduct research.  This, as you know, is one reason why we often offer up our lab to friends at small universities to help conduct their studies.  But they also should not be too distraught.  There are creative and innovative solutions to the necessity of running well-designed studies (e.g., high powered research).  First, we can take inspiration from the GWAS researchers.  When faced with the reality that they couldn’t go it alone, they combined efforts into a consortium in order to do their science properly. There is nothing stopping researchers at both smaller and larger colleges and universities from creating their own consortia. It might mean we have to change our culture of worshiping the “hero” researcher, but that’s okay.  Two or more heads is always better than one (at least according to most groups research.  I wonder how reliable that work is…?).  Second, we are on the verge of technological advances that can make access to large numbers of people much easier—MtTurk being just one example.  Third, some of our societies, like SPSP and APS and APA are rich.  Well, rich enough to consider doing something creative with their money.  They could, if they had the will and the leadership, start thinking about doing proactive things like creating subject pool panels that we can all access and run our studies on and thus conduct better powered research.

Basically Bud, we are at a critical juncture.  We can continue doing things the old way which means we will continue to produce noisy, unreplicable research, or we can change for the better.  The simplest and most productive thing we can do so is to increase the power of our research.  In most cases, this can be achieved simply by increasing the average sample size of our studies.  That’s why we obsess about the power of the research we read and evaluate.  Any other questions?

 

 

 

Posted in Uncategorized | 17 Comments

Please Stop the Bleating

 

It has been unsettling to witness the seemingly endless stream of null effects emerging from numerous pre-registered direct replications over the past few months. Some of the outcomes were unsurprising given the low power of the original studies. But the truly painful part has come from watching and reading the responses from all sides.  Countless words have been written discussing every nuanced aspect of definitions, motivations, and aspersions. Only one thing is missing:

Direct, pre-registered replications by the authors of studies that have been the target of replications.

While I am sympathetic to the fact that those who are targeted might be upset, defensive, and highly motivated to defend their ideas, the absence of any data from the originating authors is a more profound indictment of the original finding than any commentary.  To my knowledge, and please correct me if I’m wrong, none of the researchers who’ve been the target of a pre-registered replication have produced a pre-registered study from their own lab showing that they are capable of getting the effect, even if others are not. For those of us standing on the sidelines watching things play out we are constantly surprised by the fact that the one piece of information that might help—evidence that the original authors are capable of reproducing their own effects (in a pre-registered study)—is never offered up.

So, get on with it. Seriously. Everyone. Please stop the bleating. Stop discussing whether someone p-hacked, or what p-hacking really is, or whether someone is competent to do a replication, what a replication is, or whether a replication was done poorly or well.  Stop reanalyzing the damn Reproducibility Project or any thousands of other ways of re-examining the past.  Just get on with doing direct replications of your own work. It is a critical, albeit missing piece of the reproducibility puzzle.

Science is supposed to be a give and take. If it is true that replicators lack some special sauce necessary to get an effect, then it is incumbent on those of us who’ve published original findings to show others that we can get the effect—in a pre-registered design.

Brent W. Roberts

Posted in Uncategorized | 8 Comments

We Need Federally Funded Daisy Chains

One of the most provocative requests in the reproducibility crisis was Daniel Kahneman’s call for psychological scientists to collaborate on a “daisy chain” of research replication. He admonished proponents of priming research to step up and work together to replicate the classic priming studies that had, up to that point, been called into question.

What happened? Nothing. Total crickets. There were no grand collaborations among the strongest and most capable labs to reproduce each other’s work. Why not? Using 20:20 hindsight it is clear that the incentive structure in psychological science militated against the daisy chain idea.

The scientific system in 2012 (and the one still in place) rewarded people who were the first to discover a new, counterintuitive feature of human nature, preferably using an experimental method. Since we did not practice direct replications, the veracity of our findings weren’t really the point. The point was to be the discoverer, the radical innovator, the colorful, clever genius who apparently had a lot of flair.

If this was and remains the reward structure, what incentive was there or is there to conduct direct replications of your own or other’s work? Absolutely none. In fact, the act of replicating your work would be punitive. Taking the most charitable position possible, most everyone knew that our work was “fragile.” Even an informed researcher would know that the average power of our work (e.g., 50%) would naturally lead to an untenable rate of failures to replicate findings, even if they were true. And, failures to replicate our work would lead to innumerable negative consequences ranging from diminishment of our reputations, undermining our ability to get grants, decreasing the probability of our students publishing their papers, to painful embarrassment.

In fact, the act of replication was so aversive that then, and now, the proponents of most of the studies that have been called into question continue to argue passionately against the value of direct replication in science. In fact, it seems the enterprise of replication is left to anyone but the original authors. The replications are left to the young, the noble, or the disgruntled. The latter are particularly problematic because they are angry. Why are they angry? They are angry because they are morally outraged. They perceive the originating researchers as people who have consciously, willingly manipulated the scientific system to publish outlandish, but popular findings in an effort to enhance or maintain their careers. The anger can have unintended consequences. The disgruntled replicators can and do behave boorishly at times. Angry people do that. Then, they are called bullies or they are boycotted.

All of this sets up a perfectly horrible, internally consistent, self-fulfilling system where replication is punished. In this situation, the victims of replication can rail against the young (and by default less powerful) as having nefarious motivations to get ahead by tearing down their elders. And, they can often accurately point to the disgruntled replicators as mean-spirited. And, of course, you can conflate the two and call them shameless, little bullies. All in all, it creates a nice little self-justifying system for avoiding daisy chaining anything.

My point is not to criticize the current efforts at replication, so much as to argue that these efforts face a formidable set of disincentives. The system is currently rigged against systematic replications. To counter the prevailing anti-replication winds, we need robust incentives (i.e., money). Some journals have made valiant efforts to reward good practices and this is a great start. But, badges are not enough. We need incentives with teeth. We need Federally Funded Daisy Chains.

The idea of a Federally Funded Daisy Chain is simple. Any research that the federal government deems valuable enough to fund should be replicated. And, the feds should pay for it. How? NIH and NSF should set up research daisy chains. These would be very similar to the efforts currently being conducted at Perspectives on Psychological Science being carried out by Dan Simons and colleagues. Research teams from multiple sites would take the research protocols developed in federally funded research and replicate them directly.

And, the kicker is that the funding agencies would pay for this as part of the default grant proposal. Some portion of every grant would go toward funding a consortium of research teams—there could be multiple consortia across the country, for example. The PIs of the grants would be obliged to post their materials in such a way that others could quickly and easily reproduce their work. The replication teams would be reimbursed (e.g., incentivized) to do the replications. This would not only spread the grant-related wealth, but it would reward good practices across the board. PIs would be motivated to do things right from the get go if they knew someone was going to come behind them and replicate their efforts. The pool of replicators would expand as more researchers could get involved and would be motivated by the wealth provided by the feds. Generally speaking, providing concrete resources would help make doing replications the default option rather than the exception.

Making replications the default would go a long way to addressing the reproducibility crisis in psychology and other fields. To do more replications we need concrete positive incentives to do the right thing. The right thing is showing the world that our work satisfies the basic tenet of science—that an independent lab can reproduce our research. The act of independently reproducing the work of others should not be left to charity. The federal government, which spends an inordinate amount of taxpayer dollars to fund our original research, should care enough about doing the right thing that they should fund efforts to replicate the findings they are so interested in us discovering.

Posted in Uncategorized | 3 Comments

Yes or no? Are Likert scales always preferable to dichotomous rating scales?

What follows below is the result of an online discussion I had with psychologists Michael Kraus (MK) and Michael Frank (MF). We discussed scale construction, and particularly, whether items with two response options (i.e., Yes v. No) are good or bad for the reliability and validity of the scale. We had a fun discussion that we thought we would share with you.

MK: Twitter recently rolled out a polling feature that allows its users to ask and answer questions of each other. The poll feature allows polling with two possible response options (e.g., Is it Fall? Yes/No). Armed with snark and some basic training in psychometrics and scale construction, I thought it would be fun to pose the following as my first poll:

Screenshot_2015-10-26-20-00-55

Said training suggests that, all things being equal, some people are more “Yes” or more “No” than others, so having response options that include more variety will capture more of the real variance in participant responses. To put that into an example, if I ask you if you agree with the statement: “I have high self-esteem.” A yes/no two-item response won’t capture all the true variance in people’s responses that might be otherwise captured by six items ranging from strongly disagree to strongly agree. MF/BR, is that how you would characterize your own understanding of psychometrics?
MF: Well, when I’m thinking about dependent variable selection, I tend to start from the idea that the more response options for the participant, the more bits of information are transferred. In a standard two-alternative forced-choice (2AFC) experiment with balanced probabilities, each response provides 1 bit of information. In contrast, a 4AFC provides 2 bits, an 8AFC provides 3, etc. So on this kind of reasoning, the more choices the better, as illustrated by this table from Rosenthal & Rosnow’s classic text:

Screen Shot 2015-11-06 at 10.43.09 AM

For example, in one literature I am involved in, people are interested in the ability of adults and kids to associate words and objects in the presence of systematic ambiguity. In these experiments, you see several objects and hear several words, and over time the ideas is that you build up some kind of links between objects and words that are consistently associated. In these experiments, initially people used 2 and 4AFC paradigms. But as the hypotheses about mechanism got more sophisticated, people shifted to using more stringent measures, like a 15AFC, which was argued to provide more information about the underlying representations.

On the other hand, getting more information out of such a measure presumes that there is some underlying signal. In the example above, the presence of this information was relatively likely because participants had been trained on specific associations. In contrast, in the kinds of polls or judgment studies that you’re talking about, it’s more unknown whether participants have the kind of detailed representations that allow for fine-grained judgements. So if you’re asking for a judgment in general (like in #TwitterPolls or classic likert scales), how many alternatives should you use?

MK: Right, most or all of my work (and I imagine a large portion of survey research) involves subjective judgments where it isn’t known exactly how people are making their judgments and what they’d likely be basing those judgments on. So, to reiterate your own question: How many response alternatives should you use?

MF: Turns out there is some research on this question. There’s a very well-cited paper by Preston & Coleman (2000), who ask about service rating scales for restaurants. Not the most psychological example, but it’ll do. They present different participants with different numbers of response categories, ranging from 2 – 101. Here is their primary finding:

Screen Shot 2015-11-06 at 10.44.53 AM

In a nutshell, the reliability is pretty good for two categories, but it gets somewhat better up to about 7-9 options, then goes down somewhat. In addition, scales with more than 7 options are rated as slower and harder to use. Now this doesn’t mean that all psychological constructs have enough resolution to support 7 or 9 different gradations, but at least simple ratings or preference judgements seem like they might.

MK: This is great stuff! But if I’m being completely honest here, I’d say the reliabilities for just two response categories, even though they aren’t as good as they are at 7-9 options, are good enough to use. BR, I’m guessing you agree with this because of your response to my Twitter Poll:

Screenshot_2015-10-26-20-01-09

BR: Admittedly, I used to believe that when it came to response formats, more was always better.  I mean, we know that dichotomizing continuous variables is bad, so how could it be that a dichotomous rating scale (e.g., yes/no) would be as good if not superior to a 5-point rating scale?  Right?

Two things changed my perspective.  The first was precipitated by being forced to teach psychometrics, which is minimally on the 5th level of Dante’s Hell teaching-wise.  For some odd reason at some point I did a deep dive into the psychometrics of scale response formats and found, much to my surprise, a long and robust history going all they way back to the 1920s.  I’ll give two examples.  Like the Preston & Colemen (2000) study that Michael cites, some old old literature had done the same thing (god forbid, replication!!!).  Here’s a figure showing the test-retest reliability from Matell & Jacoby (1971), where they varied the response options from 2 to 19 on measures of values:

Screen Shot 2015-10-28 at 10.52.36 AM

The picture is a little different from the internal consistencies shown in Preston & Colemen (2000), but the message is similar.  There is not a lot of difference between 2 and 19.  What I really liked about the old school researchers is they cared as much about validity as they did reliability–here’s their figure showing simple concurrent validity of the scales:


Screen Shot 2015-10-28 at 11.00.57 AM

The numbers bounce a bit because of the small samples in each group, but the obvious take away is that there is no linear relation between scale points and validity.  

The second example is from Komorita & Graham (1965).  These authors studied two scales, the evaluative dimension from the Semantic Differential and the Sociability scale from the California Psychological Inventory.  The former is really homogeneous, the latter quite heterogeneous in terms of content.  The authors administered 2 and 6 point response formats for both measures.  Here is what they found vis a vis internal consistency reliability:


Screen Shot 2015-10-28 at 11.08.24 AM

This set of findings is much more interesting.  When the measure is homogeneous, the rating format does not matter.  When it is heterogeneous, having 6 options leads to better internal consistency.  The authors’ discussion is insightful and worth reading, but I’ll just quote them for brevity: “A more plausible explanation, therefore, is that some type of response set such as an “extreme response set” (Cronbach, 1946; 1950) may be operating to increase the reliability of heterogeneous scales. If the reliability of the response set component is greater than the reliability of the content component of the scale, the reliability of the scale will be increased by increasing the number of scale points.”

Thus, the old-school psychometricians argued that increasing the number of scale point options does not affect test-retest reliability, or validity. It does marginally increase internal consistency, but most likely because of “systematic error” such as, response sets (e.g., consistently using extreme options or not) that add some additional internal consistency to complex constructs.  

One interpretation of our modern love of multi-option rating scales is that it leads to better internal consistencies which we all believe to be a good thing.  Maybe it isn’t.

MK: I’ve have three reactions to this: First, I’m sorry that you had to teach psychometrics. Second, it’s amazing to me that all this work on scale construction and optimal item amount isn’t more widely known. Third, how come, knowing all this as you do, this is the first time I have heard you favor two-item response options?

BR: You might think that I would have become quite the zealot for yes/no formats after coming across this literature, but you would be wrong. I continued pursuing my research efforts using 4 and 5 point rating scales ad nauseum. Old dogs and new tricks and all of that.  

The second experience that has turned me toward using yes/no more often, if not by default, came as a result of working with non-WEIRD [WEIRD = White, Educated, Industrial, Rich, and Democratic] samples and being exposed to some of the newer, more sophisticated approaches to modeling response information in Item Response Theory. For a variety of reasons our research of late has been in samples not typically employed in most of psychology, like children, adolescents, and less literate populations than elite college students. In many of these samples, the standard 5-point likert rating of personality traits tend to blow up (psychometrically speaking).  We’ve considered a number of options for simplifying the assessment to make it less problematic for these populations to rate themselves, one of which is to simplify the rating scale to yes/no.  

It just so happens that we have been doing some IRT work on an assessment experiment we ran on-line where we randomly assigned people to fill out the NPI in one of three conditions–the traditional paired-comparison, a 5-point likert ratings of all of the stems, and a yes/no rating of all of the NPI item stems (here’s one paper from that effort). I assumed that if we were going to turn to a yes/no format that we would need more items to net the same amount of information as a likert-style rating.  So, I asked my colleague and collaborator, Eunike Wetzel, how many items you would need using a yes/no option to get the same amount of test information from a set of likert ratings of the NPI.  IRT techniques allow you to estimate how much of the underlying construct a set of items captures via a test information function.  What she reported back was surprising and fascinating.  You get the same amount of information out of 10 yes/no ratings as you do out of 10 5-point likert scale ratings of the NPI.  

So Professor Kraus, this is the source of the pithy comeback to your tweet.  It seems to me that there is no dramatic loss of information, reliability, or validity when using 2-point rating scales.  If you consider the benefits gained–responses will be a little quicker, fewer response set problems, and the potential to be usable in a wider population, there may be many situations in which a yes/no is just fine.  Conversely, we may want to be cautious about the gain in internal consistency reliability we find in highly verbal populations, like college students, because it may arise through response sets and have no relation to validity.  

MK: I appreciate this really helpful response (and that you address me so formally). Using a yes/no format has some clear advantages, as it forces people to fall on one side of a scale or the other, is quicker to answer than questions that rely on 4-7 Likert items, and sounds (from your work BF) that it allows scales to hold up better for non-WEIRD populations. MF, what is your reaction to this work?  

MF: This is totally fascinating. I definitely see the value of using yes/no in cases where you’re working with non-WEIRD populations. We are just in the middle of constructing an instrument dealing with values and attitudes about parenting and child development and the goal is to be able to survey broader populations than the university-town parents we often talk to. So I am certainly convinced that yes/no is a valuable option for that purpose and will do a pilot comparison shortly.

On the other hand, I do want to push back on the idea that there are never cases where you would want a more graded scale. My collaborators and I have done a bunch of work now using continuous dependent variables to get graded probabilistic judgments. Two examples of this work are Kao et al., (2014) – I’m not an author on that one but I really like it – and Frank & Goodman (2012). To take an example, in the second of those papers we showed people displays with a bunch of shapes (say a blue square, blue circle, and green square) and asked them, if someone used the word “blue,” which shape do you think they would be talking about?

In those cases, using sliders or “betting” measures (asking participants to assign dollar values between 0 and 100) really did seem to provide more information per judgement than other measures. I’ve also experimented with using binary dependent variables in these tasks, and my impression is that they both converge to the same mean, but that the confidence intervals on the binary DV are much larger. In other words, if we hypothesize in these cases that participants really are encoding some sort of continuous probability, then querying it in a continuous way should yield more information.

So Brent, I guess I’m asking you whether you think there is some wiggle room in the results we discussed above – for constructs and participants where scale calibration is a problem and psychological uncertainty is large, we’d want yes/no. But for constructs that are more cognitive in nature, tasks that are more well-specified, and populations that are more used to the experimental format, isn’t it still possible that there’s an information gain for using more fine-grained scales?

BR:  Of course there is wiggle room.  There are probably vast expanses of space where alternatives are more appropriate.  My intention is not to create a new “rule of thumb” where we only use yes/no responses throughout.  My intention was simply to point out that our confidence in certain rules of thumb is misplaced.  In this case, the assumption that likert scales are always preferably is clearly not the case.  On the other hand, there are great examples where a single, graded dimension is preferable–we just had a speaker discussing political orientation which was rated from conservative to moderate to liberal on a 9-point scale.  This seems entirely appropriate.  And, mind you, I have a nerdly fantasy of someday creating single-item personality Behaviorally Anchored Rating Scales (BARS).  These are entirely cool rating scales where the items themselves become anchors on a single dimension.  So instead of asking 20 questions about how clean your room is, I would anchor the rating points from “my room is messier than a suitcase packed by a spider monkey on crack” to “my room is so clean they make silicon memory chips there when I’m not in”.  Then you could assess the Big Five or the facets of the Big Five with one item each.  We can dream can’t we?

MF: Seems like a great dream to me. So – it sounds like if there’s one take-home from this discussion, it’s “don’t always default to the seven-point likert scale.” Sometimes such scales are appropriate and useful, but sometimes you want fewer – and maybe sometimes you’d even want more.

Posted in Uncategorized | Leave a comment

The New Rules of Research

by Brent W. Roberts

A paper on one of the most important research projects in our generation came out a few weeks ago. I’m speaking, of course, of the Reproducibility Project conducted by several hundred psychologists. It is a tour de force of good science. Most importantly, it provided definitive evidence for the state of the field. Despite the fact that 97% of the original studies reported statistically significant effects, only 36% hit the magical p < .05 mark when closely replicated.

Two defenses have been raised against the effort. The first, described by some as the “move along folks, there’s nothing to see here” defense, proposes that a 36% replication rate is no big deal. It is to be expected given how tough it is to do psychological science. At one level I’m sympathetic to the argument that science is hard to do, especially psychological science. It is the case that very few psychologists have 36% of their ideas work. And, by work, I mean in the traditional sense of the word, which is to net a p value less than .05 in whatever type of study you run. On the other hand, to make this claim about published work is disingenuous. When we publish a peer-reviewed journal article, we are saying explicitly that we think the effect is real and that it will hold up. If we really believed that our published work was so ephemeral, then much of our behavior in response to the reproducibility crisis has been nonsensical. If we all knew and expected our work not to replicate most of the time, then we wouldn’t get upset when it didn’t. We have disproven that point many times over. If we thought our effects that passed the p< .05 threshold were so flimsy, we would all write caveats at the end of our papers saying other researchers should be wary of our results as they were unlikely to replicate. We never do that. If we really thought so little of our results we would not write such confident columns to the New York Times espousing our findings, stand up on the TED stage and claim such profound conclusions, or speak to the press in such glowing terms about the implications of our unreliable findings. But we do. I won’t get into the debate over whether this is a crisis or not, but please don’t pass off a 36% reproducibility rate as if it is either the norm, expected, or a good thing. It is not.

The second argument, that is somewhat related, is to restate the subtle moderator idea. It is disturbingly common to hear people argue that the reason a study does not replicate is because of subtle differences in the setting, sample, or demeanor of the experimenter across labs. To invoke this is problematic for several reasons. First, it is an acknowledgment that you haven’t been keeping up with the scholarship surrounding reproducibility issues. The Many Labs 3 report addressed this hypothesis directly and showed that the null hypothesis could not be rejected.  Second, it means you are walking back almost every finding ever covered in an introductory psychology textbook. It makes me cringe when I hear what used to be a brazen scientist who had no qualms generalizing his or her findings based on psychology undergraduates to all humans, claiming that their once robust effects are fragile, tender shoots, that only grow on the West coast and not in the Midwest. I’m not sure if the folks invoking this argument realize that this is worse than having 66% of our findings not replicate. At least 36% did work. The subtle moderator take on things basically says we can ignore the remaining 36% too because yet unknown subtle moderators will render them ungeneralizable if tested a third time. While I am no fan of the over-generalization of findings based on undergraduate samples, I’m not yet willing to give up the aspiration of finding things out about humans. Yes, humans. Third, if this was such a widely accepted fact, and not something solely invoked after our work fails to replicate, then again, our reactions to the failures to replicate would be different. If we never expected our work to replicate in the first place, our reactions to failures to replicate wouldn’t be as extreme as they’ve been.

One thing that has not really occurred much in response to the Reproducibility Report is to recommend some changes to the way we do things. With that in mind, and in homage to Bill Maher, I offer a list of the “New Rules of Research[1]” that follow, at least in my estimate, from taking the results of the Reproducibility Report seriously.

  1. Direct replication is yooge (huge). Just do it. Feed the science. Feed it! Good science needs reliable findings and direct replication is the quickest way to good science. Don’t listen to the apologists for conducting only conceptual replications. Don’t pay attention to the purists who argue that all you need is a large sample. Build direct replications into your work so that you know yourself whether your effects hold up. At the very least, doing your own direct replications will save you from evils of sampling error. At the very most, you may catch errors in your protocol that could affect results in unforeseen ways. Then share it with us however you can. When you are done with that do some service to the field and replicate someone else’s work.
  1. If your finding fails to replicate, the field will doubt your finding—for now. Don’t take it personally. We’re just going by base rates. After all, less than half of our studies replicate on average. If your study fails to replicate, you are in good company—the majority. The same thing goes if your study replicates. Two studies do not make a critical mass of evidence. Keep at it.
  1. Published research in top journals should have high informational value. In the parlance of the NHSTers this means high power. For the Bayesian folks, compelling evidence that is robust across a range of reasonable priors. Either way, we know from some nice simulations that for the typical between subjects study this means that we need a minimum of 165 participants for average main effects and more than 400 participants for 2×2 between-subjects interaction tests. You need even more observations if you want to get fancy or reliably detect infinitesimal effect sizes (e.g., birth order and personality, genetic polymorphisms and any phenotype). We now have hundreds of studies that have failed to replicate and the most powerful reason is the lack of informational value in the design of the original research. Many protest that the burden of collecting all of those extra participants will cost too much time, effort, and money. While it is true that increasing our average sample size will make doing our research more difficult, consider the current situation in which 64% of our studies fail to replicate and are therefore are a potential waste of time to read and review because they are poorly designed to start (e.g., small N studies with no evidence of direct replication). We waste countless dollars and hours of our time processing, reviewing, and following up on poorly designed research. The time spent collecting more data in the first place will be well worth it if the consequence is increasing the amount of reproducible and replicable research. And, the journals will love it because we will publish less and their impact factors will inevitably go up—making us even more famous.
  1. The gold standard for our science is a pre-registered direct replication by an independent lab. A finding is not worth touting or inserting in the textbooks until a well-powered, pre-registered, direct replication is published. Well, to be honest, it isn’t a worth touting until a good number of well-powered, pre-registered, direct replications have been published.
  1. The peer-reviewed paper is no longer the gold standard. We need to de-reify the publication as the unit of exaltation. We shouldn’t be winning awards, or tenure, or TED talks for single papers. Conversely, we shouldn’t be slinking away in shame if one of our studies fails to replicate. We are scientists. Our job is, in part, to figure out how the world works. Our tools are inherently flawed and will sometimes give us the wrong answer. Other times we will ask the wrong question. Often we will do things incorrectly even when our question is good. That is okay. What is not okay is to act as if our work is true just because it got published. Updating your priors should be an integral part of doing science.
  1. Don’t leave the replications to the young. Senior researchers, the ones with tenure, should be the front line of replication research—especially if it is their research that is not replicating. They are the ones who can suffer the reputational hits and not lose their paychecks. If we want the field to change quickly and effectively, the senior researchers must lead, not follow.
  1. Don’t trust anyone over 50[2]. You might have noticed that the persons most likely to protest the importance of direct replications or who seem willing to accept a 36% replication rate as “not a crisis” are all chronologically advanced and eminent. And why wouldn’t they want to keep the status quo? They built their careers on the one-off, counter-intuitive, amazeballs research model. You can’t expect them to abandon it overnight can you? That said if you are young, you might want to look elsewhere for inspiration and guidance. At this juncture, defending the status quo is like arguing to stay on board the Titanic.
  1. Stop writing rejoinders. Especially stop writing rejoinders that say 1) there were hidden, subtle moderators (that we didn’t identify in the first place), and 2) a load of my friends and their graduate students conceptually replicated my initial findings so it must be kind of real. Just show us more data. If you can reliably reproduce your own effect, show it. The more time you spend on a rejoinder and not producing a replication of your own work, the less the field will believe your original finding.
  1. Beware of meta-analyses. As Daniël Lakens put it: bad data + good data does not equal good data. As much as it pains me to say it, since I like meta-analyses, they are no panacea. Meta-analyses are especially problematic when a bunch of data has been p-hacked into submission and it is included with some high quality data. The most common result of this combination is to find an effect that is different from zero and thus statistically significant but strikingly small compared to the original finding. Then, you see the folks who published the original finding (usually with a d of .8 or 1) trumpeting the meta-analytic findings as proof that their idea holds, without facing the fact that the flawed meta-analytic effect size is so small that they would have never detected it using the methods they used to detect it in the first place.
  1. If you want anyone to really believe your direct or conceptual replication then pre-register it. Yes, we know, there will be folks who will collect the data, then analyze it, then “pre-register” it after the fact. There will always be cheaters in every field. Nonetheless, most of us are motivated to find the truth and eventually if the gold standard is applied (see rule #4), we will get better estimates of the true effect. In the mean time, pre-register your own replication attempts and the field will be better for your efforts.

[1] Of course, many of these are not at all new. But, given the reactions to the Reproducibility Report and the continued invocation of any reason possible to avoid doing things differently, it is clear that these rules are new to some.

[2] Yes, that includes me. And, yes, I know that there are some chronologically challenged individuals on the pro-reproducibility side of the coin. That said, among the outspoken critics of the effort I count a disproportionate number of eminent scientists without even scratching the surface.

Posted in Uncategorized | 8 Comments

What we are reading in PIG-IE 9-14-15

Last week, we read Chabris et al (2015) The fourth law of behavior genetics another in a series of lucid papers from the GWAS consortium.

This week, with Etienne LeBel in town, we are reading the OSF’s Reproducibility Report.

Posted in Uncategorized | Leave a comment