The New Rules of Research

by Brent W. Roberts

A paper on one of the most important research projects in our generation came out a few weeks ago. I’m speaking, of course, of the Reproducibility Project conducted by several hundred psychologists. It is a tour de force of good science. Most importantly, it provided definitive evidence for the state of the field. Despite the fact that 97% of the original studies reported statistically significant effects, only 36% hit the magical p < .05 mark when closely replicated.

Two defenses have been raised against the effort. The first, described by some as the “move along folks, there’s nothing to see here” defense, proposes that a 36% replication rate is no big deal. It is to be expected given how tough it is to do psychological science. At one level I’m sympathetic to the argument that science is hard to do, especially psychological science. It is the case that very few psychologists have 36% of their ideas work. And, by work, I mean in the traditional sense of the word, which is to net a p value less than .05 in whatever type of study you run. On the other hand, to make this claim about published work is disingenuous. When we publish a peer-reviewed journal article, we are saying explicitly that we think the effect is real and that it will hold up. If we really believed that our published work was so ephemeral, then much of our behavior in response to the reproducibility crisis has been nonsensical. If we all knew and expected our work not to replicate most of the time, then we wouldn’t get upset when it didn’t. We have disproven that point many times over. If we thought our effects that passed the p< .05 threshold were so flimsy, we would all write caveats at the end of our papers saying other researchers should be wary of our results as they were unlikely to replicate. We never do that. If we really thought so little of our results we would not write such confident columns to the New York Times espousing our findings, stand up on the TED stage and claim such profound conclusions, or speak to the press in such glowing terms about the implications of our unreliable findings. But we do. I won’t get into the debate over whether this is a crisis or not, but please don’t pass off a 36% reproducibility rate as if it is either the norm, expected, or a good thing. It is not.

The second argument, that is somewhat related, is to restate the subtle moderator idea. It is disturbingly common to hear people argue that the reason a study does not replicate is because of subtle differences in the setting, sample, or demeanor of the experimenter across labs. To invoke this is problematic for several reasons. First, it is an acknowledgment that you haven’t been keeping up with the scholarship surrounding reproducibility issues. The Many Labs 3 report addressed this hypothesis directly and showed that the null hypothesis could not be rejected.  Second, it means you are walking back almost every finding ever covered in an introductory psychology textbook. It makes me cringe when I hear what used to be a brazen scientist who had no qualms generalizing his or her findings based on psychology undergraduates to all humans, claiming that their once robust effects are fragile, tender shoots, that only grow on the West coast and not in the Midwest. I’m not sure if the folks invoking this argument realize that this is worse than having 66% of our findings not replicate. At least 36% did work. The subtle moderator take on things basically says we can ignore the remaining 36% too because yet unknown subtle moderators will render them ungeneralizable if tested a third time. While I am no fan of the over-generalization of findings based on undergraduate samples, I’m not yet willing to give up the aspiration of finding things out about humans. Yes, humans. Third, if this was such a widely accepted fact, and not something solely invoked after our work fails to replicate, then again, our reactions to the failures to replicate would be different. If we never expected our work to replicate in the first place, our reactions to failures to replicate wouldn’t be as extreme as they’ve been.

One thing that has not really occurred much in response to the Reproducibility Report is to recommend some changes to the way we do things. With that in mind, and in homage to Bill Maher, I offer a list of the “New Rules of Research[1]” that follow, at least in my estimate, from taking the results of the Reproducibility Report seriously.

  1. Direct replication is yooge (huge). Just do it. Feed the science. Feed it! Good science needs reliable findings and direct replication is the quickest way to good science. Don’t listen to the apologists for conducting only conceptual replications. Don’t pay attention to the purists who argue that all you need is a large sample. Build direct replications into your work so that you know yourself whether your effects hold up. At the very least, doing your own direct replications will save you from evils of sampling error. At the very most, you may catch errors in your protocol that could affect results in unforeseen ways. Then share it with us however you can. When you are done with that do some service to the field and replicate someone else’s work.
  1. If your finding fails to replicate, the field will doubt your finding—for now. Don’t take it personally. We’re just going by base rates. After all, less than half of our studies replicate on average. If your study fails to replicate, you are in good company—the majority. The same thing goes if your study replicates. Two studies do not make a critical mass of evidence. Keep at it.
  1. Published research in top journals should have high informational value. In the parlance of the NHSTers this means high power. For the Bayesian folks, compelling evidence that is robust across a range of reasonable priors. Either way, we know from some nice simulations that for the typical between subjects study this means that we need a minimum of 165 participants for average main effects and more than 400 participants for 2×2 between-subjects interaction tests. You need even more observations if you want to get fancy or reliably detect infinitesimal effect sizes (e.g., birth order and personality, genetic polymorphisms and any phenotype). We now have hundreds of studies that have failed to replicate and the most powerful reason is the lack of informational value in the design of the original research. Many protest that the burden of collecting all of those extra participants will cost too much time, effort, and money. While it is true that increasing our average sample size will make doing our research more difficult, consider the current situation in which 64% of our studies fail to replicate and are therefore are a potential waste of time to read and review because they are poorly designed to start (e.g., small N studies with no evidence of direct replication). We waste countless dollars and hours of our time processing, reviewing, and following up on poorly designed research. The time spent collecting more data in the first place will be well worth it if the consequence is increasing the amount of reproducible and replicable research. And, the journals will love it because we will publish less and their impact factors will inevitably go up—making us even more famous.
  1. The gold standard for our science is a pre-registered direct replication by an independent lab. A finding is not worth touting or inserting in the textbooks until a well-powered, pre-registered, direct replication is published. Well, to be honest, it isn’t a worth touting until a good number of well-powered, pre-registered, direct replications have been published.
  1. The peer-reviewed paper is no longer the gold standard. We need to de-reify the publication as the unit of exaltation. We shouldn’t be winning awards, or tenure, or TED talks for single papers. Conversely, we shouldn’t be slinking away in shame if one of our studies fails to replicate. We are scientists. Our job is, in part, to figure out how the world works. Our tools are inherently flawed and will sometimes give us the wrong answer. Other times we will ask the wrong question. Often we will do things incorrectly even when our question is good. That is okay. What is not okay is to act as if our work is true just because it got published. Updating your priors should be an integral part of doing science.
  1. Don’t leave the replications to the young. Senior researchers, the ones with tenure, should be the front line of replication research—especially if it is their research that is not replicating. They are the ones who can suffer the reputational hits and not lose their paychecks. If we want the field to change quickly and effectively, the senior researchers must lead, not follow.
  1. Don’t trust anyone over 50[2]. You might have noticed that the persons most likely to protest the importance of direct replications or who seem willing to accept a 36% replication rate as “not a crisis” are all chronologically advanced and eminent. And why wouldn’t they want to keep the status quo? They built their careers on the one-off, counter-intuitive, amazeballs research model. You can’t expect them to abandon it overnight can you? That said if you are young, you might want to look elsewhere for inspiration and guidance. At this juncture, defending the status quo is like arguing to stay on board the Titanic.
  1. Stop writing rejoinders. Especially stop writing rejoinders that say 1) there were hidden, subtle moderators (that we didn’t identify in the first place), and 2) a load of my friends and their graduate students conceptually replicated my initial findings so it must be kind of real. Just show us more data. If you can reliably reproduce your own effect, show it. The more time you spend on a rejoinder and not producing a replication of your own work, the less the field will believe your original finding.
  1. Beware of meta-analyses. As Daniël Lakens put it: bad data + good data does not equal good data. As much as it pains me to say it, since I like meta-analyses, they are no panacea. Meta-analyses are especially problematic when a bunch of data has been p-hacked into submission and it is included with some high quality data. The most common result of this combination is to find an effect that is different from zero and thus statistically significant but strikingly small compared to the original finding. Then, you see the folks who published the original finding (usually with a d of .8 or 1) trumpeting the meta-analytic findings as proof that their idea holds, without facing the fact that the flawed meta-analytic effect size is so small that they would have never detected it using the methods they used to detect it in the first place.
  1. If you want anyone to really believe your direct or conceptual replication then pre-register it. Yes, we know, there will be folks who will collect the data, then analyze it, then “pre-register” it after the fact. There will always be cheaters in every field. Nonetheless, most of us are motivated to find the truth and eventually if the gold standard is applied (see rule #4), we will get better estimates of the true effect. In the mean time, pre-register your own replication attempts and the field will be better for your efforts.

[1] Of course, many of these are not at all new. But, given the reactions to the Reproducibility Report and the continued invocation of any reason possible to avoid doing things differently, it is clear that these rules are new to some.

[2] Yes, that includes me. And, yes, I know that there are some chronologically challenged individuals on the pro-reproducibility side of the coin. That said, among the outspoken critics of the effort I count a disproportionate number of eminent scientists without even scratching the surface.

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

8 Responses to The New Rules of Research

  1. Hi!

    Really thought it was great.

    I would say though that the hidden moderator argument is not all bad, so far as I can see. It seems like we wouldn’t expect all 100 studies to be totally replicable across time and culture (or I wouldn’t at least). Also, I would be open to a moderator being discovered for the other 36% or really most findings. a way to turn it off, though maybe not in Direct replication (though even then culture, lab conditions, sampling/ advertising style).

    Really liked the point about the reactions to the work, though honestly I like to keep my own council about what effects I think are real or not. I am agnostic about most psych, I would say and only very confident about those I work closely with.

    I would like more reasons to do direct replication, I’ve now done three experiments which are very close replications but not Direct Direct.

    Also would like to see more discussion about what preregistration means. There is a lack of a good first link for a google search of ‘how to preregister a hypothesis’, which is a blog post I might write over the weekend. But whoever does that is really setting the standard. I would like to stress how effortless and valuable it is.

    5 – 9 I thought were super 😀 stop writing rejoinders and produce data. 😀

    If everyone put that their results might not replicate across all time and places, that would be many pages of just that it seems like. But really I thought it was nice.

    Best,
    Brett

  2. Pingback: What does it mean to preregister a study? | The Psycholar

  3. David Colquhoun says:

    Very good. But I don’t much like the “don’t trust anyone over 50”. I’ve spent much of the last 10 years pointing out these problems, and my paper on the misinterpretation of P values is proving to be quite popular: http://rsos.royalsocietypublishing.org/content/1/3/140216 . I’m 79 now, dammit.

    That being said, I do tend to get cheers when I say to young audiences, “never trust your elders, but not necessarily betters”. It’s the absurd hyper-competitiveness of senior academics, and the obsession of some of them with statistically-illiterate metrics, which has caused some of the problems.

  4. D. says:

    “When we publish a peer-reviewed journal article, we are saying explicitly that we think the effect is real and that it will hold up”

    I have thought about this a lot, and still don’t know what to think. One of the definitions the dictionary provides is that “science” is “a systematically organized body of knowledge on a particular subject”. To me, this means that indeed published effects are “real” and will hold up.

    However, a lot of people state something like “science is a process, which brings us a little closer to the truth”. I think people who think that there is no current crisis, actually believe that their low-powered, p-hacked study is worthy of being published because other people can replicate it and see if it holds up, or it could “inspire” people to think about the topic/results. To me, that’s like throwing all your garbage out over the fence in the neighbors’ yard. You take no responsibility for what you put out there, and don’t think about all the possible wasted resources that other researchers will face when trying to replicate, or build up on your work.

    “Senior researchers, the ones with tenure, should be the front line of replication research—especially if it is their research that is not replicating. They are the ones who can suffer the reputational hits and not lose their paychecks. If we want the field to change quickly and effectively, the senior researchers must lead, not follow”

    It’s really a shame when you think of the fact that perhaps a lot of senior people have actively contributed to current problems (e.g. low-powered, p-hacked studies) and still do nothing about it when they are in the best position to do so. If that were me, i wouldn’t be able to sleep at night, but perhaps that is why they say to themselves, and others, that nothing is wrong…

  5. Like the other comments, I am very much sympathetic to what you (and many others) are saying here. Still, I think there’s a structural issue that is fundamental to the discussion, but which you don’t mention.

    It’s this: even scientists (young, old, whatever) who agree with the agenda set out here have little incentive to actually pursue it. On the contrary, they have many structural incentives to continue with the old approach. Publish or perish, after all. Basically, I suspect and fear that many/most young scientists will read this, or encounter other sentiments of a similar flavour, and think “Well, yes, that all sounds good, but as things currently stand, actually doing this will probably reduce my chances of getting a job. I don’t want to be a martyr to replication”. Not all will think this, but many will. And I think it’s a reasonable response. It saddens me that they face this choice between noble science and career-advancing science, but they do.

    Let me put the point another way. I think the target of your article is wrong. It’s aimed at practising scientists, but it should actually be aimed at hiring committees, grant boards, tenure boards, and they like. (These are often the same people, I know, but they wear different hats at different times.) Tell those panels to value replication. Because if they do, and if word gets out about that, scientists will follow suit. And quickly too. In short, the solution is to make the career choice the same as the noble choice. Your article should be entitled “The new rules of hiring”, not “The new rules of research”.

  6. Pingback: Better Incentives, Better Science – Ions

  7. Pingback: The Reproducibility Project and Textbook Reporting of Psychological Science | MENTAL TRAPS

  8. Pingback: The Reproducibility Project and Textbook Reporting of Psychological Science | MENTAL TRAPS

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s