We Need Federally Funded Daisy Chains

One of the most provocative requests in the reproducibility crisis was Daniel Kahneman’s call for psychological scientists to collaborate on a “daisy chain” of research replication. He admonished proponents of priming research to step up and work together to replicate the classic priming studies that had, up to that point, been called into question.

What happened? Nothing. Total crickets. There were no grand collaborations among the strongest and most capable labs to reproduce each other’s work. Why not? Using 20:20 hindsight it is clear that the incentive structure in psychological science militated against the daisy chain idea.

The scientific system in 2012 (and the one still in place) rewarded people who were the first to discover a new, counterintuitive feature of human nature, preferably using an experimental method. Since we did not practice direct replications, the veracity of our findings weren’t really the point. The point was to be the discoverer, the radical innovator, the colorful, clever genius who apparently had a lot of flair.

If this was and remains the reward structure, what incentive was there or is there to conduct direct replications of your own or other’s work? Absolutely none. In fact, the act of replicating your work would be punitive. Taking the most charitable position possible, most everyone knew that our work was “fragile.” Even an informed researcher would know that the average power of our work (e.g., 50%) would naturally lead to an untenable rate of failures to replicate findings, even if they were true. And, failures to replicate our work would lead to innumerable negative consequences ranging from diminishment of our reputations, undermining our ability to get grants, decreasing the probability of our students publishing their papers, to painful embarrassment.

In fact, the act of replication was so aversive that then, and now, the proponents of most of the studies that have been called into question continue to argue passionately against the value of direct replication in science. In fact, it seems the enterprise of replication is left to anyone but the original authors. The replications are left to the young, the noble, or the disgruntled. The latter are particularly problematic because they are angry. Why are they angry? They are angry because they are morally outraged. They perceive the originating researchers as people who have consciously, willingly manipulated the scientific system to publish outlandish, but popular findings in an effort to enhance or maintain their careers. The anger can have unintended consequences. The disgruntled replicators can and do behave boorishly at times. Angry people do that. Then, they are called bullies or they are boycotted.

All of this sets up a perfectly horrible, internally consistent, self-fulfilling system where replication is punished. In this situation, the victims of replication can rail against the young (and by default less powerful) as having nefarious motivations to get ahead by tearing down their elders. And, they can often accurately point to the disgruntled replicators as mean-spirited. And, of course, you can conflate the two and call them shameless, little bullies. All in all, it creates a nice little self-justifying system for avoiding daisy chaining anything.

My point is not to criticize the current efforts at replication, so much as to argue that these efforts face a formidable set of disincentives. The system is currently rigged against systematic replications. To counter the prevailing anti-replication winds, we need robust incentives (i.e., money). Some journals have made valiant efforts to reward good practices and this is a great start. But, badges are not enough. We need incentives with teeth. We need Federally Funded Daisy Chains.

The idea of a Federally Funded Daisy Chain is simple. Any research that the federal government deems valuable enough to fund should be replicated. And, the feds should pay for it. How? NIH and NSF should set up research daisy chains. These would be very similar to the efforts currently being conducted at Perspectives on Psychological Science being carried out by Dan Simons and colleagues. Research teams from multiple sites would take the research protocols developed in federally funded research and replicate them directly.

And, the kicker is that the funding agencies would pay for this as part of the default grant proposal. Some portion of every grant would go toward funding a consortium of research teams—there could be multiple consortia across the country, for example. The PIs of the grants would be obliged to post their materials in such a way that others could quickly and easily reproduce their work. The replication teams would be reimbursed (e.g., incentivized) to do the replications. This would not only spread the grant-related wealth, but it would reward good practices across the board. PIs would be motivated to do things right from the get go if they knew someone was going to come behind them and replicate their efforts. The pool of replicators would expand as more researchers could get involved and would be motivated by the wealth provided by the feds. Generally speaking, providing concrete resources would help make doing replications the default option rather than the exception.

Making replications the default would go a long way to addressing the reproducibility crisis in psychology and other fields. To do more replications we need concrete positive incentives to do the right thing. The right thing is showing the world that our work satisfies the basic tenet of science—that an independent lab can reproduce our research. The act of independently reproducing the work of others should not be left to charity. The federal government, which spends an inordinate amount of taxpayer dollars to fund our original research, should care enough about doing the right thing that they should fund efforts to replicate the findings they are so interested in us discovering.

Posted in Uncategorized | 3 Comments

Yes or no? Are Likert scales always preferable to dichotomous rating scales?

What follows below is the result of an online discussion I had with psychologists Michael Kraus (MK) and Michael Frank (MF). We discussed scale construction, and particularly, whether items with two response options (i.e., Yes v. No) are good or bad for the reliability and validity of the scale. We had a fun discussion that we thought we would share with you.

MK: Twitter recently rolled out a polling feature that allows its users to ask and answer questions of each other. The poll feature allows polling with two possible response options (e.g., Is it Fall? Yes/No). Armed with snark and some basic training in psychometrics and scale construction, I thought it would be fun to pose the following as my first poll:


Said training suggests that, all things being equal, some people are more “Yes” or more “No” than others, so having response options that include more variety will capture more of the real variance in participant responses. To put that into an example, if I ask you if you agree with the statement: “I have high self-esteem.” A yes/no two-item response won’t capture all the true variance in people’s responses that might be otherwise captured by six items ranging from strongly disagree to strongly agree. MF/BR, is that how you would characterize your own understanding of psychometrics?
MF: Well, when I’m thinking about dependent variable selection, I tend to start from the idea that the more response options for the participant, the more bits of information are transferred. In a standard two-alternative forced-choice (2AFC) experiment with balanced probabilities, each response provides 1 bit of information. In contrast, a 4AFC provides 2 bits, an 8AFC provides 3, etc. So on this kind of reasoning, the more choices the better, as illustrated by this table from Rosenthal & Rosnow’s classic text:

Screen Shot 2015-11-06 at 10.43.09 AM

For example, in one literature I am involved in, people are interested in the ability of adults and kids to associate words and objects in the presence of systematic ambiguity. In these experiments, you see several objects and hear several words, and over time the ideas is that you build up some kind of links between objects and words that are consistently associated. In these experiments, initially people used 2 and 4AFC paradigms. But as the hypotheses about mechanism got more sophisticated, people shifted to using more stringent measures, like a 15AFC, which was argued to provide more information about the underlying representations.

On the other hand, getting more information out of such a measure presumes that there is some underlying signal. In the example above, the presence of this information was relatively likely because participants had been trained on specific associations. In contrast, in the kinds of polls or judgment studies that you’re talking about, it’s more unknown whether participants have the kind of detailed representations that allow for fine-grained judgements. So if you’re asking for a judgment in general (like in #TwitterPolls or classic likert scales), how many alternatives should you use?

MK: Right, most or all of my work (and I imagine a large portion of survey research) involves subjective judgments where it isn’t known exactly how people are making their judgments and what they’d likely be basing those judgments on. So, to reiterate your own question: How many response alternatives should you use?

MF: Turns out there is some research on this question. There’s a very well-cited paper by Preston & Coleman (2000), who ask about service rating scales for restaurants. Not the most psychological example, but it’ll do. They present different participants with different numbers of response categories, ranging from 2 – 101. Here is their primary finding:

Screen Shot 2015-11-06 at 10.44.53 AM

In a nutshell, the reliability is pretty good for two categories, but it gets somewhat better up to about 7-9 options, then goes down somewhat. In addition, scales with more than 7 options are rated as slower and harder to use. Now this doesn’t mean that all psychological constructs have enough resolution to support 7 or 9 different gradations, but at least simple ratings or preference judgements seem like they might.

MK: This is great stuff! But if I’m being completely honest here, I’d say the reliabilities for just two response categories, even though they aren’t as good as they are at 7-9 options, are good enough to use. BR, I’m guessing you agree with this because of your response to my Twitter Poll:


BR: Admittedly, I used to believe that when it came to response formats, more was always better.  I mean, we know that dichotomizing continuous variables is bad, so how could it be that a dichotomous rating scale (e.g., yes/no) would be as good if not superior to a 5-point rating scale?  Right?

Two things changed my perspective.  The first was precipitated by being forced to teach psychometrics, which is minimally on the 5th level of Dante’s Hell teaching-wise.  For some odd reason at some point I did a deep dive into the psychometrics of scale response formats and found, much to my surprise, a long and robust history going all they way back to the 1920s.  I’ll give two examples.  Like the Preston & Colemen (2000) study that Michael cites, some old old literature had done the same thing (god forbid, replication!!!).  Here’s a figure showing the test-retest reliability from Matell & Jacoby (1971), where they varied the response options from 2 to 19 on measures of values:

Screen Shot 2015-10-28 at 10.52.36 AM

The picture is a little different from the internal consistencies shown in Preston & Colemen (2000), but the message is similar.  There is not a lot of difference between 2 and 19.  What I really liked about the old school researchers is they cared as much about validity as they did reliability–here’s their figure showing simple concurrent validity of the scales:

Screen Shot 2015-10-28 at 11.00.57 AM

The numbers bounce a bit because of the small samples in each group, but the obvious take away is that there is no linear relation between scale points and validity.  

The second example is from Komorita & Graham (1965).  These authors studied two scales, the evaluative dimension from the Semantic Differential and the Sociability scale from the California Psychological Inventory.  The former is really homogeneous, the latter quite heterogeneous in terms of content.  The authors administered 2 and 6 point response formats for both measures.  Here is what they found vis a vis internal consistency reliability:

Screen Shot 2015-10-28 at 11.08.24 AM

This set of findings is much more interesting.  When the measure is homogeneous, the rating format does not matter.  When it is heterogeneous, having 6 options leads to better internal consistency.  The authors’ discussion is insightful and worth reading, but I’ll just quote them for brevity: “A more plausible explanation, therefore, is that some type of response set such as an “extreme response set” (Cronbach, 1946; 1950) may be operating to increase the reliability of heterogeneous scales. If the reliability of the response set component is greater than the reliability of the content component of the scale, the reliability of the scale will be increased by increasing the number of scale points.”

Thus, the old-school psychometricians argued that increasing the number of scale point options does not affect test-retest reliability, or validity. It does marginally increase internal consistency, but most likely because of “systematic error” such as, response sets (e.g., consistently using extreme options or not) that add some additional internal consistency to complex constructs.  

One interpretation of our modern love of multi-option rating scales is that it leads to better internal consistencies which we all believe to be a good thing.  Maybe it isn’t.

MK: I’ve have three reactions to this: First, I’m sorry that you had to teach psychometrics. Second, it’s amazing to me that all this work on scale construction and optimal item amount isn’t more widely known. Third, how come, knowing all this as you do, this is the first time I have heard you favor two-item response options?

BR: You might think that I would have become quite the zealot for yes/no formats after coming across this literature, but you would be wrong. I continued pursuing my research efforts using 4 and 5 point rating scales ad nauseum. Old dogs and new tricks and all of that.  

The second experience that has turned me toward using yes/no more often, if not by default, came as a result of working with non-WEIRD [WEIRD = White, Educated, Industrial, Rich, and Democratic] samples and being exposed to some of the newer, more sophisticated approaches to modeling response information in Item Response Theory. For a variety of reasons our research of late has been in samples not typically employed in most of psychology, like children, adolescents, and less literate populations than elite college students. In many of these samples, the standard 5-point likert rating of personality traits tend to blow up (psychometrically speaking).  We’ve considered a number of options for simplifying the assessment to make it less problematic for these populations to rate themselves, one of which is to simplify the rating scale to yes/no.  

It just so happens that we have been doing some IRT work on an assessment experiment we ran on-line where we randomly assigned people to fill out the NPI in one of three conditions–the traditional paired-comparison, a 5-point likert ratings of all of the stems, and a yes/no rating of all of the NPI item stems (here’s one paper from that effort). I assumed that if we were going to turn to a yes/no format that we would need more items to net the same amount of information as a likert-style rating.  So, I asked my colleague and collaborator, Eunike Wetzel, how many items you would need using a yes/no option to get the same amount of test information from a set of likert ratings of the NPI.  IRT techniques allow you to estimate how much of the underlying construct a set of items captures via a test information function.  What she reported back was surprising and fascinating.  You get the same amount of information out of 10 yes/no ratings as you do out of 10 5-point likert scale ratings of the NPI.  

So Professor Kraus, this is the source of the pithy comeback to your tweet.  It seems to me that there is no dramatic loss of information, reliability, or validity when using 2-point rating scales.  If you consider the benefits gained–responses will be a little quicker, fewer response set problems, and the potential to be usable in a wider population, there may be many situations in which a yes/no is just fine.  Conversely, we may want to be cautious about the gain in internal consistency reliability we find in highly verbal populations, like college students, because it may arise through response sets and have no relation to validity.  

MK: I appreciate this really helpful response (and that you address me so formally). Using a yes/no format has some clear advantages, as it forces people to fall on one side of a scale or the other, is quicker to answer than questions that rely on 4-7 Likert items, and sounds (from your work BF) that it allows scales to hold up better for non-WEIRD populations. MF, what is your reaction to this work?  

MF: This is totally fascinating. I definitely see the value of using yes/no in cases where you’re working with non-WEIRD populations. We are just in the middle of constructing an instrument dealing with values and attitudes about parenting and child development and the goal is to be able to survey broader populations than the university-town parents we often talk to. So I am certainly convinced that yes/no is a valuable option for that purpose and will do a pilot comparison shortly.

On the other hand, I do want to push back on the idea that there are never cases where you would want a more graded scale. My collaborators and I have done a bunch of work now using continuous dependent variables to get graded probabilistic judgments. Two examples of this work are Kao et al., (2014) – I’m not an author on that one but I really like it – and Frank & Goodman (2012). To take an example, in the second of those papers we showed people displays with a bunch of shapes (say a blue square, blue circle, and green square) and asked them, if someone used the word “blue,” which shape do you think they would be talking about?

In those cases, using sliders or “betting” measures (asking participants to assign dollar values between 0 and 100) really did seem to provide more information per judgement than other measures. I’ve also experimented with using binary dependent variables in these tasks, and my impression is that they both converge to the same mean, but that the confidence intervals on the binary DV are much larger. In other words, if we hypothesize in these cases that participants really are encoding some sort of continuous probability, then querying it in a continuous way should yield more information.

So Brent, I guess I’m asking you whether you think there is some wiggle room in the results we discussed above – for constructs and participants where scale calibration is a problem and psychological uncertainty is large, we’d want yes/no. But for constructs that are more cognitive in nature, tasks that are more well-specified, and populations that are more used to the experimental format, isn’t it still possible that there’s an information gain for using more fine-grained scales?

BR:  Of course there is wiggle room.  There are probably vast expanses of space where alternatives are more appropriate.  My intention is not to create a new “rule of thumb” where we only use yes/no responses throughout.  My intention was simply to point out that our confidence in certain rules of thumb is misplaced.  In this case, the assumption that likert scales are always preferably is clearly not the case.  On the other hand, there are great examples where a single, graded dimension is preferable–we just had a speaker discussing political orientation which was rated from conservative to moderate to liberal on a 9-point scale.  This seems entirely appropriate.  And, mind you, I have a nerdly fantasy of someday creating single-item personality Behaviorally Anchored Rating Scales (BARS).  These are entirely cool rating scales where the items themselves become anchors on a single dimension.  So instead of asking 20 questions about how clean your room is, I would anchor the rating points from “my room is messier than a suitcase packed by a spider monkey on crack” to “my room is so clean they make silicon memory chips there when I’m not in”.  Then you could assess the Big Five or the facets of the Big Five with one item each.  We can dream can’t we?

MF: Seems like a great dream to me. So – it sounds like if there’s one take-home from this discussion, it’s “don’t always default to the seven-point likert scale.” Sometimes such scales are appropriate and useful, but sometimes you want fewer – and maybe sometimes you’d even want more.

Posted in Uncategorized | Leave a comment

The New Rules of Research

by Brent W. Roberts

A paper on one of the most important research projects in our generation came out a few weeks ago. I’m speaking, of course, of the Reproducibility Project conducted by several hundred psychologists. It is a tour de force of good science. Most importantly, it provided definitive evidence for the state of the field. Despite the fact that 97% of the original studies reported statistically significant effects, only 36% hit the magical p < .05 mark when closely replicated.

Two defenses have been raised against the effort. The first, described by some as the “move along folks, there’s nothing to see here” defense, proposes that a 36% replication rate is no big deal. It is to be expected given how tough it is to do psychological science. At one level I’m sympathetic to the argument that science is hard to do, especially psychological science. It is the case that very few psychologists have 36% of their ideas work. And, by work, I mean in the traditional sense of the word, which is to net a p value less than .05 in whatever type of study you run. On the other hand, to make this claim about published work is disingenuous. When we publish a peer-reviewed journal article, we are saying explicitly that we think the effect is real and that it will hold up. If we really believed that our published work was so ephemeral, then much of our behavior in response to the reproducibility crisis has been nonsensical. If we all knew and expected our work not to replicate most of the time, then we wouldn’t get upset when it didn’t. We have disproven that point many times over. If we thought our effects that passed the p< .05 threshold were so flimsy, we would all write caveats at the end of our papers saying other researchers should be wary of our results as they were unlikely to replicate. We never do that. If we really thought so little of our results we would not write such confident columns to the New York Times espousing our findings, stand up on the TED stage and claim such profound conclusions, or speak to the press in such glowing terms about the implications of our unreliable findings. But we do. I won’t get into the debate over whether this is a crisis or not, but please don’t pass off a 36% reproducibility rate as if it is either the norm, expected, or a good thing. It is not.

The second argument, that is somewhat related, is to restate the subtle moderator idea. It is disturbingly common to hear people argue that the reason a study does not replicate is because of subtle differences in the setting, sample, or demeanor of the experimenter across labs. To invoke this is problematic for several reasons. First, it is an acknowledgment that you haven’t been keeping up with the scholarship surrounding reproducibility issues. The Many Labs 3 report addressed this hypothesis directly and showed that the null hypothesis could not be rejected.  Second, it means you are walking back almost every finding ever covered in an introductory psychology textbook. It makes me cringe when I hear what used to be a brazen scientist who had no qualms generalizing his or her findings based on psychology undergraduates to all humans, claiming that their once robust effects are fragile, tender shoots, that only grow on the West coast and not in the Midwest. I’m not sure if the folks invoking this argument realize that this is worse than having 66% of our findings not replicate. At least 36% did work. The subtle moderator take on things basically says we can ignore the remaining 36% too because yet unknown subtle moderators will render them ungeneralizable if tested a third time. While I am no fan of the over-generalization of findings based on undergraduate samples, I’m not yet willing to give up the aspiration of finding things out about humans. Yes, humans. Third, if this was such a widely accepted fact, and not something solely invoked after our work fails to replicate, then again, our reactions to the failures to replicate would be different. If we never expected our work to replicate in the first place, our reactions to failures to replicate wouldn’t be as extreme as they’ve been.

One thing that has not really occurred much in response to the Reproducibility Report is to recommend some changes to the way we do things. With that in mind, and in homage to Bill Maher, I offer a list of the “New Rules of Research[1]” that follow, at least in my estimate, from taking the results of the Reproducibility Report seriously.

  1. Direct replication is yooge (huge). Just do it. Feed the science. Feed it! Good science needs reliable findings and direct replication is the quickest way to good science. Don’t listen to the apologists for conducting only conceptual replications. Don’t pay attention to the purists who argue that all you need is a large sample. Build direct replications into your work so that you know yourself whether your effects hold up. At the very least, doing your own direct replications will save you from evils of sampling error. At the very most, you may catch errors in your protocol that could affect results in unforeseen ways. Then share it with us however you can. When you are done with that do some service to the field and replicate someone else’s work.
  1. If your finding fails to replicate, the field will doubt your finding—for now. Don’t take it personally. We’re just going by base rates. After all, less than half of our studies replicate on average. If your study fails to replicate, you are in good company—the majority. The same thing goes if your study replicates. Two studies do not make a critical mass of evidence. Keep at it.
  1. Published research in top journals should have high informational value. In the parlance of the NHSTers this means high power. For the Bayesian folks, compelling evidence that is robust across a range of reasonable priors. Either way, we know from some nice simulations that for the typical between subjects study this means that we need a minimum of 165 participants for average main effects and more than 400 participants for 2×2 between-subjects interaction tests. You need even more observations if you want to get fancy or reliably detect infinitesimal effect sizes (e.g., birth order and personality, genetic polymorphisms and any phenotype). We now have hundreds of studies that have failed to replicate and the most powerful reason is the lack of informational value in the design of the original research. Many protest that the burden of collecting all of those extra participants will cost too much time, effort, and money. While it is true that increasing our average sample size will make doing our research more difficult, consider the current situation in which 64% of our studies fail to replicate and are therefore are a potential waste of time to read and review because they are poorly designed to start (e.g., small N studies with no evidence of direct replication). We waste countless dollars and hours of our time processing, reviewing, and following up on poorly designed research. The time spent collecting more data in the first place will be well worth it if the consequence is increasing the amount of reproducible and replicable research. And, the journals will love it because we will publish less and their impact factors will inevitably go up—making us even more famous.
  1. The gold standard for our science is a pre-registered direct replication by an independent lab. A finding is not worth touting or inserting in the textbooks until a well-powered, pre-registered, direct replication is published. Well, to be honest, it isn’t a worth touting until a good number of well-powered, pre-registered, direct replications have been published.
  1. The peer-reviewed paper is no longer the gold standard. We need to de-reify the publication as the unit of exaltation. We shouldn’t be winning awards, or tenure, or TED talks for single papers. Conversely, we shouldn’t be slinking away in shame if one of our studies fails to replicate. We are scientists. Our job is, in part, to figure out how the world works. Our tools are inherently flawed and will sometimes give us the wrong answer. Other times we will ask the wrong question. Often we will do things incorrectly even when our question is good. That is okay. What is not okay is to act as if our work is true just because it got published. Updating your priors should be an integral part of doing science.
  1. Don’t leave the replications to the young. Senior researchers, the ones with tenure, should be the front line of replication research—especially if it is their research that is not replicating. They are the ones who can suffer the reputational hits and not lose their paychecks. If we want the field to change quickly and effectively, the senior researchers must lead, not follow.
  1. Don’t trust anyone over 50[2]. You might have noticed that the persons most likely to protest the importance of direct replications or who seem willing to accept a 36% replication rate as “not a crisis” are all chronologically advanced and eminent. And why wouldn’t they want to keep the status quo? They built their careers on the one-off, counter-intuitive, amazeballs research model. You can’t expect them to abandon it overnight can you? That said if you are young, you might want to look elsewhere for inspiration and guidance. At this juncture, defending the status quo is like arguing to stay on board the Titanic.
  1. Stop writing rejoinders. Especially stop writing rejoinders that say 1) there were hidden, subtle moderators (that we didn’t identify in the first place), and 2) a load of my friends and their graduate students conceptually replicated my initial findings so it must be kind of real. Just show us more data. If you can reliably reproduce your own effect, show it. The more time you spend on a rejoinder and not producing a replication of your own work, the less the field will believe your original finding.
  1. Beware of meta-analyses. As Daniël Lakens put it: bad data + good data does not equal good data. As much as it pains me to say it, since I like meta-analyses, they are no panacea. Meta-analyses are especially problematic when a bunch of data has been p-hacked into submission and it is included with some high quality data. The most common result of this combination is to find an effect that is different from zero and thus statistically significant but strikingly small compared to the original finding. Then, you see the folks who published the original finding (usually with a d of .8 or 1) trumpeting the meta-analytic findings as proof that their idea holds, without facing the fact that the flawed meta-analytic effect size is so small that they would have never detected it using the methods they used to detect it in the first place.
  1. If you want anyone to really believe your direct or conceptual replication then pre-register it. Yes, we know, there will be folks who will collect the data, then analyze it, then “pre-register” it after the fact. There will always be cheaters in every field. Nonetheless, most of us are motivated to find the truth and eventually if the gold standard is applied (see rule #4), we will get better estimates of the true effect. In the mean time, pre-register your own replication attempts and the field will be better for your efforts.

[1] Of course, many of these are not at all new. But, given the reactions to the Reproducibility Report and the continued invocation of any reason possible to avoid doing things differently, it is clear that these rules are new to some.

[2] Yes, that includes me. And, yes, I know that there are some chronologically challenged individuals on the pro-reproducibility side of the coin. That said, among the outspoken critics of the effort I count a disproportionate number of eminent scientists without even scratching the surface.

Posted in Uncategorized | 8 Comments

What we are reading in PIG-IE 9-14-15

Last week, we read Chabris et al (2015) The fourth law of behavior genetics another in a series of lucid papers from the GWAS consortium.

This week, with Etienne LeBel in town, we are reading the OSF’s Reproducibility Report.

Posted in Uncategorized | Leave a comment

Be your own replicator

by Brent W. Roberts

One of the conspicuous features of the ongoing reproducibility crisis stewing in psychology is that we have a lot of fear, loathing, defensiveness, and theorizing being expressed about direct replications. But, if the pages of our journals are any indication, we have very few direct replications being conducted.

Reacting with fear is not surprising. It is not fun to have your hard-earned scientific contribution challenged by some random researcher. Even if the replicator is trustworthy, it is scary to have your work be the target of a replication attempt. For example, one colleague was especially concerned that graduate students were now afraid to publish papers given the seeming inevitability of someone trying to replicate and tear down their work. Seeing the replication police in your rearview mirror would make anyone nervous, but especially new drivers.

Another prototypical reaction appears to be various forms of loathing. We don’t need to repeat the monikers used to describe researchers who conduct and attempt to publish direct replications. It is clear that they are not held in high esteem. Other scholars may not demean the replicators but hold equally negative attitudes towards the direct replication enterprise and deem the entire effort a waste of time. They are, in a word, too busy making discoveries to fuss with conducting direct replications.

Other researchers who are the target of failed replications have turned to writing long rejoinders. Often reflecting a surprising amount of work, these papers typically argue that while the effect of interest failed to replicate, there are dozens of conceptual replications of the phenomenon of interest.

Finally, there appears to be an emerging domain of scholarship focused on the theoretical definition and function of replications. While fascinating, and often compelling, these essays are typically not written by people conducting direct replications themselves—a seemingly conspicuous fact.

While each of these reactions are sensible, they are entirely ineffectual, especially in light of the steady stream of papers failing to replicate major and minor findings in psychology. Looking across the various efforts at replication, it is not too much of an exaggeration to say that less than 50% of our work is reproducible. Acting fearful, loathing replicators, being defensive and arguing for the status quo, or writing voluminous discourses on the theoretical nature of replication are fundamentally ineffective responses to this situation. We dither while a remarkable proportion of our work fails to be reproduced.


There is, of course, a deceptively simple solution to this situation. Be your own replicator.


It is that simple. And, I don’t mean conceptual replicator; I mean direct replicator. Don’t wait for someone to take your study down. Don’t dedicate more time writing a rejoinder than it would take to conduct a study. Replicate your work yourself.

Now this is not much different than the position that Joe Cesario espoused, which is surprising because as Joe can attest to I did not care for his argument when it came out. But, it is clear at this juncture that there was much wisdom in his position. It is also clear that people haven’t paid it much heed. Thus, I think it merits restating.

Consider for a moment how conducting your own direct replication of your own research might change some of the interactions that have emerged over the last few years. In the current paradigm we get incredibly uncomfortable exchanges that go something like this:

Researcher R: “Dear eminent, highly popular Researcher A, I failed to replicate your study published in that high impact factor journal.”

Researcher A: “Researcher B, you are either incompetent or malicious. Also, I’d like to note that I don’t care for direct replications. I prefer conceptual replications, especially because I can identify dozens of conceptual replications of my work.”


Imagine an alternative universe in which Researcher A had a file of direct replications of the original findings. Then the conversation would go from a spitting match to something like this:

Researcher R: “Dear eminent, highly popular Researcher A, I failed to replicate your study published in that high impact factor journal.”

Researcher A: “Interesting. You didn’t get the same effect? I wonder why. What did you do?”

Researcher B: “We replicated your study as directly as we could and failed to find the same effect” (whether judged by p-values, effect sizes, confidence intervals, Bayesian priors or whatever).

Research A: “We’ve reproduced the effect several times in the past. You can find the replication data on the OSF site linked to the original paper. Let’s look at how you did things and maybe we can figure this discrepancy out.”


That is a much different exchange than the one’s we’ve seen so far which have been dominated by conspicuous failures to replicate and, well, little more than vitriolic arguments over details with little or no old or new data.

Of course, there will be protests. Some continue to argue for conceptual replications. This perspective is fine. And, let me be clear. No one to date has argued against conceptual replications per se. What has been said is that in the absence of strong proof that the original finding is robust (as in directly replicable), conceptual replications provide little evidence for the reliability and validity of an idea. That is to say, conceptual replications rock, if and when you have shown that the original finding can be reproduced.

And that is where being your own replicator is such an ingenious strategy. Not only do you inoculate the replicators, but also you bolster the validity of your conceptual replications in the process. That is a win-win situation.

And, of course, being your own direct replicator also addresses the argument that the replicators may be screw-ups. If you feel this way, fine. Be your own replicator. Show us you can get the goods. Twice. Three times. Maybe more. But, of course, make sure to pre-register your replication attempts otherwise some may accuse you of p-hacking your way to a direct replication.

It is also common, as noted, to see a response to a failure to replicate that lists out sometimes dozens of small sample, conceptual replications of original work as some kind of response. Why waste your time? The time spent crafting arguments about tenuous evidence could easily be spent conducting your own direct replication of your own work. Now that would be a convincing response. A direct replication is worth a thousand words—or a thousand conceptual replications.

Conversely, replication failures spur some to craft nuanced arguments about just what is a replication and if there anything that is really a “direct” replication and such. These are nice essays to read. But, we’ll have time for these discussions later, after we show that some of our work actually merits discussion. Proceeding to intellectual discussions is nothing more than a waste of time when more than 50% of our research fails to replicate.

Some might want to argue that conducting our own direct replications would be an added burden to already inconvenienced researchers. But, let’s be honest. The JPSP publication arms race has gotten way out of hand. Researchers seemingly have to produce at least 8 different studies to even have a chance of getting into the first two sections of JPSP. What real harm would there be if you still did the same number of studies but just included 4 conceptually distinct studies each replicated once? That’s still 8 studies, but now the package would include information that would dissipate the fear of being replicated.

Another argument would be that it is almost impossible to get direct replications published. And, that is correct. Our only bias more foolish than the bias against null findings is the bias against the value of direct replications. Resultantly, it would be hard to get direct replications published in mainstream outlets. I have utopian dreams sometimes where I imagine our entire field moving past this bias. One can dream, right?

But, this is no longer a real barrier. Some journals or sections of journals are actively fostering the publication of direct replications. Additionally, we have numerous outlets for direct replication research, whether it is formal ones, such as PloS-ONE or Frontiers, or less formal such as Psychfiledrawer or the Open Science Framework. If you have replication data, it can find a home, and interested parties can see it. Of course, it would help even more if the data were pre-registered.

So there you have it. Be your own replicator. It is a quick, easy, entirely reasonable way dispelling the inherent tension in the current replication crisis we are enduring.




Posted in Uncategorized | 3 Comments

Sample Sizes in Personality and Social Psychology

R. Chris Fraley

Imagine that you’re a young graduate student who has just completed a research project. You think the results are exciting and that they have the potential to advance the field in a number of ways. You would like to submit your research to a journal that has a reputation for publishing the highest caliber research in your field.

How would you know which journals are regarded for publishing high-quality research?

Traditionally, scholars and promotion committees have answered this question by referencing the citation Impact Factor (IF) of journals. But as critics of the IF have noted, citation rates per se may not reflect anything informative about the quality of empirical research. A paper can receive a large number of citations in the short run because it reports surprising, debatable, or counter-intuitive findings regardless of whether the research was conducted in a rigorous manner. In other words, the citation rate of a journal may not be particularly informative concerning the quality of the research it reports.

What would be useful is a way of indexing journal quality that is based upon the strength of the research designs used in published articles rather than the citation rate of those articles alone.

In an article recently published in PLoS ONE, Simine Vazire and I attempted to do this by ranking major journals in social-personality psychology with respect to what we call their N-pact Factors (NF)–the statistical power of the studies they publish. Statistical power is defined as the probability of detecting an effect of interest when that effect actually exists. Statistical power is relevant for judging the quality of empirical research literatures because, compared to lower powered studies, studies that are highly powered are more likely to (a) detect valid effects, (b) buffer the literature against false positives, and (c) produce findings that other researchers can replicate. Although power is certainly not the only way to evaluate the quality of empirical research, the more power a study has, the better positioned it is to provide useful information and to make robust contributions to the empirical literature.

Our analyses demonstrate that, overall, the statistical power of studies published by major journals in our field tends to be inadequate, ranging from 40% to 77% for detecting the typical kinds of effect sizes reported in social-personality psychology. Moreover, we show that there is considerable variation among journals; some journals tend to consistently publish higher power studies and have lower estimated false positive rates than others. And, importantly, we show that some journals, despite their comparatively high impact factors, publish studies that are greatly underpowered for scientific research in psychology.

We hope these rankings will help researchers and promotion committees better evaluate various journals, allow the public and the press (i.e., consumers of scientific knowledge in psychology) to have a better appreciation of the credibility of published research, and perhaps even facilitate competition among journals in a way that would improve the net quality of published research. We realize that sample size and power are not and should not be the gold standard in evaluating research But we hope that this effort will be viewed as a constructive, if incomplete, contribution to improving psychological science.

Simine wrote a nice blog post about some of the issues relevant to this work. Please check it out.


Posted in Uncategorized | 1 Comment

Is It Offensive To Declare A Social Psychological Claim Or Conclusion Wrong?

By Lee Jussim

Science is about “getting it right” – this is so obvious that it should go without saying. However, there are many obstacles to doing so, some relatively benign (an honestly conducted study produces a quirky result), others less so (p-hacking). Over the last few years, the focus on practices that lead us astray have focused primarily on issues of statistics, methods, and replication.

These are all justifiably important, but here I raise the possibility that other, more subjective factors, distort social and personality psychology in ways at least as problematic. Elsewhere, I have reviewed what I now call questionable interpretive practices – how cherrypicking, double standards, blind spots, and embedding political values in research all lead to distorted conclusions (Duarte et al, 2014; Jussim et al, in press a,b).

But there are other interpretations problems. Ever notice how very few social psychological theories are refuted or overturned?   Disconfirming theories and hypotheses (including the subset of disconfirmation, failures to replicate) should be a normal part of the advance of scientific knowledge. It is ok for you (or me, or Dr. I. V. Famous) to have reached or promoted a wrong conclusion.

In social psychology, this rarely happens. Why not? Many social psychologists seem to balk at declaring some claims “wrong.” This seems to occur primarily for three reasons. The first is that junior scholars, especially pre-tenure, may justifiably feel that potentially angering senior colleagues (who may later be called on to write letters for promotion) is not a wise move. That is the nature of the tenure beast, but it only explains the behavior of, at most, a minority. What about the rest of us?

The second reason is essentially social (i.e., not scientific). Declaring some scientific claim to be “wrong” is, I suspect, often perceived as a personal attack on the claimant. This probably occurs because it is impossible to declare some claim wrong without citing some article making the claim. Articles have authors, so that declaring a claim wrong is tantamount to saying “Dr. Earnest’s claims are wrong.” This problem is further exacerbated by the fact that theories, hypotheses, and phenomenon often become identified with either the originators or apostles (prestigious researchers who popularize them). Priming social behavior? Fundamental attribution error? Bystander effect? System justification? Implicit racism?  There are individual social psychologists associated with each of these ideas. To challenge the validity, or even the power or generality of such ideas/effects/theories/hypotheses risks being interpreted as something more than a mere scientific endeavor – it risks being seen as a personal insult to the person identified with them. Thus, declaring a claim “wrong” risks being seen, not as a scientific act of theory or hypothesis disconfirmation, but as a personal attack — and no one supports personal attacks.

The third reason is grounded in a very unique philosophy of science perspective – namely, that almost every claim is true under some conditions (for explicitly articulated versions of this, see Greenwald, Pratkanis, Leippe, & Baumgardner, 1986; McGuire, 1973, 1983). As such, we have a great deal of research on “conditions under which” some theory or hypothesis holds, but very little research providing wholesale refutation of a theory or hypothesis. I have heard apocryphal stories of prestigious researchers declaring (behind closed doors) that they only run studies to prove what they already know and that they can craft a study to confirm any hypothesis they choose. These apocrypha are not evidence – but the evidence of p-hacking in social psychology and elsewhere (e.g., Ioannidis, 2005; Simmons et al, 2012; Vul et al, 2009) raises the possibility that some unknown number of social psychologists conduct their research in a manner consistent both with these apocrypha and with the notion that everything is true under some conditions. If every claim is true under some conditions, then massive flexibility in methods and data analysis in the service of demonstrating almost any notion becomes, not a flaw to be rooted out of science, but evidence of the “skill” and “craftsmanship” of researchers, and of the “quality” of their research. In this context, declaring any scientific claim, conclusion, hypothesis or theory “wrong” becomes unjustified. It reflects little more than ignorance of this “sophisticated” view of science, and arrogance in the sense that no one, according to this view, can declare anything “wrong” because it is true under some conditions. As such, declaring some claim wrong can again be viewed as an offensive act.

The idea that claims cannot be “wrong” because “every claim is true under some circumstances” goes too far for two reasons. First, some claims are outright false, such as “the Sun revolves around the Earth.” Furthermore, even if two competing claims are both correct under some conditions, this does not mean they are equally true. Knowing that something is true 90% of the time is quite different than knowing it is true 10% of the time. Claiming that some phenomena is “powerful” or “pervasive,” when the data show it is only rarely true, is wrong. Let’s say that, on average, stereotype biases in person perception are not very powerful or pervasive – which they are not (Jussim, 2012 – multiple meta-analyses yield an average estimate of r = .10 for such biases). Isn’t it better to point out that the field’s long history of declaring them to be powerful and pervasive is wrong (at least when the criterion is the field’s own data), than to just report the data without acknowledging its bearing on longstanding conclusions?

This reluctance to declare certain theories or hypotheses wrong risks leading social psychology to become populated with a plethora of “… undead theories that are ideologically popular but have little basis in fact” (Ferguson & Heene, 2012, p. 555). This amusing phrasing cannot be easily dismissed – ask yourself, “Which theories in social psychology have ever been disconfirmed?” Indeed, a former President of the National Research Council, Dr. Bruce Alberts and editor of science put it this way (quoted in The Economist, 2013):

“And scientists themselves need to develop a value system where simply moving on from one’s mistakes without publicly acknowledging them severely damages, rather than protects, a scientific reputation.’”

I agree. It is ok to be wrong. In fact, if one engages in enough scientific research for a long enough period of time, one is almost guaranteed to be wrong about something. Good research at its best can be viewed as systematic, creative, and informed trial and error. But that includes … error! Both being wrong sometimes, and correcting wrong claims are integral parts of healthy scientific processes.

Furthermore, from a prescriptive standpoint of how science should proceed, I concur with Popper’s (1959/1968) notion that we should seek to disconfirm theories and hypotheses. Ideas left standing in the face of strong attempts at disconfirmation are those most likely to be robust and valid. Thus, rather than being something we social psychologists should shrink away from, bluntly identifying which theories and hypotheses do not (and do!) hold up to tests of logic and existing data should be a core component of how we conduct our science.



Duarte, J. L., Crawford, J. T., Stern, C., Haidt, J., Jussim, L., & Tetlock, P. E. (2014). Political diversity will improve social psychological science. Manuscript that I hope is on the verge of being accepted for publication.

The Economist (October 19, 2013). Trouble at the lab. Retrieved on 7/8/14 from:


Ferguson, C. J., & Heene, M. (2012). A vast graveyard of undead theories: Publication bias and psychology’s aversion to the null. Psychological Science, 7, 555-561.

Greenwald, A. G., Pratkanis, A. R., Leippe, M. R., & Baumgardner, M. H. (1986). Under what conditions does theory obstruct research progress? Psychological Review, 93, 216-229.

Ioannidis, J. P. A. (2005).  Why most published research findings are false. PLOS Medicine, 2, 696-701.

Jussim, L. (2012). Social perception and social reality: Why accuracy dominates bias and self-fulfilling prophecy. NY: Oxford University Press.

Jussim, L., Crawford, J. T., Anglin, S. M., & Stevens, S. T. (In press a). The politics of social psychological science II: Distortions in the social psychology of liberalism and conservatism. To appear in J. Forgas, K. Fiedler, & W. Crano (Eds.), Sydney Symposium on Social Psychology and Politics.

Jussim, L. Crawford, J. T., Stevens, S. T., & Anglin, S. M. (In press b). The politics of social psychological science I: Distortions in the social psychology of intergroup relations. To appear in P. Valdesolo & J. Graham (Eds.), Bridging Ideological Divides: Claremont Symposium on Applied Social Psychology.

McGuire, W. J. (1973). The yin and yang of progress in social psychology: Seven koan. Journal of Personality and Social Psychology, 26, 446-456.

McGuire, W. J. (1983). A contextualist theory of knowledge: Its implications for innovation reform in psychological research. Advances in Experimental Social Psychology, 16, 1-47.

Popper, K. R. (1959/1968). The logic of scientific discovery. New York: Harper & Row.

Simmons, J.P., Nelson, L.D., & Simonsohn, U. (2011).  False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366.

Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition. Perspectives on Psychological Science, 4, 274-290.

Posted in Uncategorized | 2 Comments