To see the evil and the good without hiding
You must help me if you can
Doctor, my eyes
Tell me what is wrong Was I unwise to leave them open for so long Jackson Browne
I’m having a hard time reading scientific journal articles lately. No, not because I’m getting old, or because my sight is failing, though both are true. No, I’m having trouble reading journals like JPSP and Psychological Science because I don’t believe, can’t believe the research results that I find there.
Mind you nothing has changed in the journals. You find tightly tuned articles that portray a series of statistically significant findings testing subtle ideas using sample sizes that are barely capable if detecting whether men weigh more than women (Simons, Nelson, & Simonsohn, 2013). Or, in our new and improved publication motif, you find single, underpowered studies, with huge effects that are presented without replication (e.g., short reports). What’s more, if you bother to delve into our history and examine any given “phenomena” that we are famous for in social and personality psychology, you will find a history littered with similar stories; publication after publication with troublingly small sample sizes and commensurate, unbelievably large effect sizes. As we now know, in order to have a statistically significant finding when you employ the typical sample sizes found in our research (n = 50), the effect size must not only be large, but also overestimated. Couple that with the fact that the average power to detect even the unbelievably large effect sizes that we do report is 50% and you arrive at the inevitable conclusion that our current and past research simply does not hold up to scrutiny. Thus, much of the history of our field is unbelievable. Or, to be bit less hyperbolic, some unknown proportion of our findings can’t be trusted. That is to say, we have no history, or at least no history we can trust.
This was brought home for me recently when a reporter asked me to weigh in on a well-known psychological phenomenon that he was writing about. I poked around the literature and found a disconcertingly large number of “supportive” studies using remarkably small sample sizes and netting (without telling of course), amazingly large effect sizes, despite the fact that the effect was supposed to be delicate. I mentioned this in passing to a colleague who was more of an expert on the topic and he said “well, the real effect for that phenomenon is much smaller.” His comment reflected the fact that he, unlike the reporter, or the text book writer, or the graduate student, or the scholar from another field, or me, knew about all the failed studies that had never been published. However, if you took the history lodged in our most elite journals you would have to come to a different conclusion—the effect size was huge in the published literature. If you bother to look at many of our most prized ideas, you will find a similar pattern.
The Beginning of History Effect is, of course, a play on the End of History idea put forward by Fukuyama that with the end of overt and subtle battles of the cold war and the transition to almost universal liberal democracy would essentially end the tension requisite for the narrative of history to continue. The Beginning of History Effect (no, unfortunately, it is not an illusion) is an attempt to put positive spin on the fact that we can’t rely on our own scientific history. The most positive take on this situation is that we have the chance of making history from here on out by conducting more reliable research. I guess the most telling question is whether there is any reason to be optimistic that we will begin our history anew or whether we will continue to fight for ideas and questionable methods that have left us little empirical edifice on which to rest our weary bones?
To bring the point home, and to illustrate just how difficult it will be to begin our history over again, I thought it would be instructive to highlight a set of personality findings that are evidently untrue, but still get published in our top journals. Specifically, any study that has been published showing a statistically significant link between a candidate gene and any personality phenotype is demonstrably wrong. How do I know? If one spends a little time examining these studies you will find a very consistent profile. The original study will have what we think is a relatively large sample—hundreds of people—and no replication. Ever. If you go to the supporting literature to find replications you find none or the typical “inconsistent” pattern. More tellingly, if you go to the genome-wide association studies, you will find that they have never, ever replicated any of the candidate gene studies that litter the history of personality psychology, despite the fact that they contain tens of thousands of participants.
What this means in the terminology of the current replication crisis in the field of social and personality psychology is that the effect sizes associated with any given candidate gene polymorphism are so small that they cannot be detected without a sample size in the tens of thousands (if not hundreds of thousands). It is the same low power issue plaguing experimental psychology just playing out on a different scale. This should caution any blanket prescriptions for a priori acceptable sample sizes for any kind of research. The sample size you need is dictated by your effect size and that can’t always be known before hand. Who would have known that the correct sample size for candidate gene research was 50K? Many people still don’t know, including reviewers at our top journals.
The interesting, and appalling thing about the genetics research in personality psychology is that the geneticists knew all along that the modal approach used in our research linking candidate genes or even gwas data to phenotypes was wrong from the beginning (Terwilliger & Goring, 2000). In fact, the current arguments in genetics revolve around whether the right genetic model is a “rare variant” or “infinitesimal model” (Gibson, 2012). Either model accepts the fact that there are almost no common genetic variants that have a notable relation to any phenotype of interest, in personality, or psychology, or otherwise. And by notable, I mean “effect size that is detectable using standard operating procedures in personality psychology (e.g., N of 100 to 500).
What this means in practical terms is that a bunch of research, some done by close friends and colleagues, is patently wrong. And by close friends, I mean really close friends—award winning close friends. What are we going to do about that? What am I supposed to do about that? Simply ignore it? Talk about it in the hallways of conferences and over drinks at the bar? Tell people quietly that they shouldn’t really do that type of research?
Multiply this dilemma across our subfields and you see the problem we face. So, maybe we should hit the reboot button and start our history over again. At the very least, we need to confront our history. Our current approach to the replication crisis is to either deny it or recommend changes to the way we do our current research. Given our history of conducting unreliable research we need to do more. In other essays I’ve called for a USADA of psychology to monitor our ongoing use of improper research methods. I think we need something more like a Consumer Reports for Psychology. We need a group of researchers to go back and redo our key findings to show that they are reliable—to evaluate the sturdiness and value of our various concepts year in and year out. Brian Nosek’s Reproducibility Project has started in this direction, but we need more. We need to vet our legacy, otherwise our research findings are of unknown value, at best.
Brent W. Roberts