by Brent W. Roberts
A paper on one of the most important research projects in our generation came out a few weeks ago. I’m speaking, of course, of the Reproducibility Project conducted by several hundred psychologists. It is a tour de force of good science. Most importantly, it provided definitive evidence for the state of the field. Despite the fact that 97% of the original studies reported statistically significant effects, only 36% hit the magical p < .05 mark when closely replicated.
Two defenses have been raised against the effort. The first, described by some as the “move along folks, there’s nothing to see here” defense, proposes that a 36% replication rate is no big deal. It is to be expected given how tough it is to do psychological science. At one level I’m sympathetic to the argument that science is hard to do, especially psychological science. It is the case that very few psychologists have 36% of their ideas work. And, by work, I mean in the traditional sense of the word, which is to net a p value less than .05 in whatever type of study you run. On the other hand, to make this claim about published work is disingenuous. When we publish a peer-reviewed journal article, we are saying explicitly that we think the effect is real and that it will hold up. If we really believed that our published work was so ephemeral, then much of our behavior in response to the reproducibility crisis has been nonsensical. If we all knew and expected our work not to replicate most of the time, then we wouldn’t get upset when it didn’t. We have disproven that point many times over. If we thought our effects that passed the p< .05 threshold were so flimsy, we would all write caveats at the end of our papers saying other researchers should be wary of our results as they were unlikely to replicate. We never do that. If we really thought so little of our results we would not write such confident columns to the New York Times espousing our findings, stand up on the TED stage and claim such profound conclusions, or speak to the press in such glowing terms about the implications of our unreliable findings. But we do. I won’t get into the debate over whether this is a crisis or not, but please don’t pass off a 36% reproducibility rate as if it is either the norm, expected, or a good thing. It is not.
The second argument, that is somewhat related, is to restate the subtle moderator idea. It is disturbingly common to hear people argue that the reason a study does not replicate is because of subtle differences in the setting, sample, or demeanor of the experimenter across labs. To invoke this is problematic for several reasons. First, it is an acknowledgment that you haven’t been keeping up with the scholarship surrounding reproducibility issues. The Many Labs 3 report addressed this hypothesis directly and showed that the null hypothesis could not be rejected. Second, it means you are walking back almost every finding ever covered in an introductory psychology textbook. It makes me cringe when I hear what used to be a brazen scientist who had no qualms generalizing his or her findings based on psychology undergraduates to all humans, claiming that their once robust effects are fragile, tender shoots, that only grow on the West coast and not in the Midwest. I’m not sure if the folks invoking this argument realize that this is worse than having 66% of our findings not replicate. At least 36% did work. The subtle moderator take on things basically says we can ignore the remaining 36% too because yet unknown subtle moderators will render them ungeneralizable if tested a third time. While I am no fan of the over-generalization of findings based on undergraduate samples, I’m not yet willing to give up the aspiration of finding things out about humans. Yes, humans. Third, if this was such a widely accepted fact, and not something solely invoked after our work fails to replicate, then again, our reactions to the failures to replicate would be different. If we never expected our work to replicate in the first place, our reactions to failures to replicate wouldn’t be as extreme as they’ve been.
One thing that has not really occurred much in response to the Reproducibility Report is to recommend some changes to the way we do things. With that in mind, and in homage to Bill Maher, I offer a list of the “New Rules of Research” that follow, at least in my estimate, from taking the results of the Reproducibility Report seriously.
- Direct replication is yooge (huge). Just do it. Feed the science. Feed it! Good science needs reliable findings and direct replication is the quickest way to good science. Don’t listen to the apologists for conducting only conceptual replications. Don’t pay attention to the purists who argue that all you need is a large sample. Build direct replications into your work so that you know yourself whether your effects hold up. At the very least, doing your own direct replications will save you from evils of sampling error. At the very most, you may catch errors in your protocol that could affect results in unforeseen ways. Then share it with us however you can. When you are done with that do some service to the field and replicate someone else’s work.
- If your finding fails to replicate, the field will doubt your finding—for now. Don’t take it personally. We’re just going by base rates. After all, less than half of our studies replicate on average. If your study fails to replicate, you are in good company—the majority. The same thing goes if your study replicates. Two studies do not make a critical mass of evidence. Keep at it.
- Published research in top journals should have high informational value. In the parlance of the NHSTers this means high power. For the Bayesian folks, compelling evidence that is robust across a range of reasonable priors. Either way, we know from some nice simulations that for the typical between subjects study this means that we need a minimum of 165 participants for average main effects and more than 400 participants for 2×2 between-subjects interaction tests. You need even more observations if you want to get fancy or reliably detect infinitesimal effect sizes (e.g., birth order and personality, genetic polymorphisms and any phenotype). We now have hundreds of studies that have failed to replicate and the most powerful reason is the lack of informational value in the design of the original research. Many protest that the burden of collecting all of those extra participants will cost too much time, effort, and money. While it is true that increasing our average sample size will make doing our research more difficult, consider the current situation in which 64% of our studies fail to replicate and are therefore are a potential waste of time to read and review because they are poorly designed to start (e.g., small N studies with no evidence of direct replication). We waste countless dollars and hours of our time processing, reviewing, and following up on poorly designed research. The time spent collecting more data in the first place will be well worth it if the consequence is increasing the amount of reproducible and replicable research. And, the journals will love it because we will publish less and their impact factors will inevitably go up—making us even more famous.
- The gold standard for our science is a pre-registered direct replication by an independent lab. A finding is not worth touting or inserting in the textbooks until a well-powered, pre-registered, direct replication is published. Well, to be honest, it isn’t a worth touting until a good number of well-powered, pre-registered, direct replications have been published.
- The peer-reviewed paper is no longer the gold standard. We need to de-reify the publication as the unit of exaltation. We shouldn’t be winning awards, or tenure, or TED talks for single papers. Conversely, we shouldn’t be slinking away in shame if one of our studies fails to replicate. We are scientists. Our job is, in part, to figure out how the world works. Our tools are inherently flawed and will sometimes give us the wrong answer. Other times we will ask the wrong question. Often we will do things incorrectly even when our question is good. That is okay. What is not okay is to act as if our work is true just because it got published. Updating your priors should be an integral part of doing science.
- Don’t leave the replications to the young. Senior researchers, the ones with tenure, should be the front line of replication research—especially if it is their research that is not replicating. They are the ones who can suffer the reputational hits and not lose their paychecks. If we want the field to change quickly and effectively, the senior researchers must lead, not follow.
- Don’t trust anyone over 50. You might have noticed that the persons most likely to protest the importance of direct replications or who seem willing to accept a 36% replication rate as “not a crisis” are all chronologically advanced and eminent. And why wouldn’t they want to keep the status quo? They built their careers on the one-off, counter-intuitive, amazeballs research model. You can’t expect them to abandon it overnight can you? That said if you are young, you might want to look elsewhere for inspiration and guidance. At this juncture, defending the status quo is like arguing to stay on board the Titanic.
- Stop writing rejoinders. Especially stop writing rejoinders that say 1) there were hidden, subtle moderators (that we didn’t identify in the first place), and 2) a load of my friends and their graduate students conceptually replicated my initial findings so it must be kind of real. Just show us more data. If you can reliably reproduce your own effect, show it. The more time you spend on a rejoinder and not producing a replication of your own work, the less the field will believe your original finding.
- Beware of meta-analyses. As Daniël Lakens put it: bad data + good data does not equal good data. As much as it pains me to say it, since I like meta-analyses, they are no panacea. Meta-analyses are especially problematic when a bunch of data has been p-hacked into submission and it is included with some high quality data. The most common result of this combination is to find an effect that is different from zero and thus statistically significant but strikingly small compared to the original finding. Then, you see the folks who published the original finding (usually with a d of .8 or 1) trumpeting the meta-analytic findings as proof that their idea holds, without facing the fact that the flawed meta-analytic effect size is so small that they would have never detected it using the methods they used to detect it in the first place.
- If you want anyone to really believe your direct or conceptual replication then pre-register it. Yes, we know, there will be folks who will collect the data, then analyze it, then “pre-register” it after the fact. There will always be cheaters in every field. Nonetheless, most of us are motivated to find the truth and eventually if the gold standard is applied (see rule #4), we will get better estimates of the true effect. In the mean time, pre-register your own replication attempts and the field will be better for your efforts.
 Of course, many of these are not at all new. But, given the reactions to the Reproducibility Report and the continued invocation of any reason possible to avoid doing things differently, it is clear that these rules are new to some.
 Yes, that includes me. And, yes, I know that there are some chronologically challenged individuals on the pro-reproducibility side of the coin. That said, among the outspoken critics of the effort I count a disproportionate number of eminent scientists without even scratching the surface.