By Brent W. Roberts
As of late, psychological science has arguably done more to address the ongoing believability crisis than most other areas of science. Many notable efforts have been put forward to improve our methods. From the Open Science Framework (OSF), to changes in journal reporting practices, to new statistics, psychologists are doing more than any other science to rectify practices that allow far too many unbelievable findings to populate our journal pages.
The efforts in psychology to improve the believability of our science can be boiled down to some relatively simple changes. We need to replace/supplement the typical reporting practices and statistical approaches by:
- Providing more information with each paper so others can double-check our work, such as the study materials, hypotheses, data, and syntax (through the OSF or journal reporting practices).
- Designing our studies so they have adequate power or precision to evaluate the theories we are purporting to test (i.e., use larger sample sizes).
- Providing more information about effect sizes in each report, such as what the effect sizes are for each analysis and their respective confidence intervals.
- Valuing direct replication.
It seems pretty simple. Actually, the proposed changes are simple, even mundane.
What has been most surprising is the consistent push back and protests against these seemingly innocuous recommendations. When confronted with these recommendations it seems many psychological researchers balk. Despite calls for transparency, most researchers avoid platforms like the OSF. A striking number of individuals argue against and are quite disdainful of reporting effect sizes. Direct replications are disparaged. In response to the various recommendations outlined above, prototypical protests are:
- Effect sizes are unimportant because we are “testing theory” and effect sizes are only for “applied research.”
- Reporting effect sizes is nonsensical because our research is on constructs and ideas that have no natural metric, so that documenting effect sizes is meaningless.
- Having highly powered studies is cheating because it allows you to lay claim to effects that are so small as to be uninteresting.
- Direct replications are uninteresting and uninformative.
- Conceptual replications are to be preferred because we are testing theories, not confirming techniques.
While these protestations seem reasonable, the passion with which they are provided is disproportionate to the changes being recommended. After all, if you’ve run a t-test, it is little trouble to estimate an effect size too. Furthermore, running a direct replication is hardly a serious burden, especially when the typical study only examines 50 to 60 odd subjects in a 2×2 design. Writing entire treatises arguing against direct replication when direct replication is so easy to do falls into the category of “the lady doth protest too much, methinks.” Maybe it is a reflection of my repressed Freudian predilections, but it is hard not to take a Depth Psychology stance on these protests. If smart people balk at seemingly benign changes, then there must be something psychologically big lurking behind those protests. What might that big thing be? I believe the reason for the passion behind the protests lies in the fact that, though mundane, the changes that are being recommended to improve the believability of psychological science undermine the incentive structure on which the field is built.
I think this confrontation needs to be more closely examined because we need to consider the challenges and consequences of deconstructing our incentive system and status structure. This, then begs the question, what is our incentive system and just what are we proposing to do to it? For this, I believe a good analogy is the dilemma faced by Harry Potter in the last book of the eponymously titled book series.
The Deathly Hallows of Psychological Science
In the last book of the Harry Potter series “The Deathly Hallows,” Harry Potter faces a dilemma. Should he pursue the destruction of the Horcruxes or gather together the Deathly Hallows. The Horcruxes are pieces of Voldemort’s soul encapsulated in small trinkets, jewelry, and such. If they were destroyed, then it would be possible to destroy Voldemort. The Deathly Hallows are three powerful magical objects, which are alluring because by possessing all three, one becomes the “master of death.” The Deathly Hallows are the Cloak of Invisibility, the Elder Wand, and the Resurrection Stone. The dilemma Harry faced was whether to pursue and destroy the Horcruxes, which was a painful and difficult path; or Harry could choose to pursue the Deathly Hallows, with which he could quite possibly conquer Voldemort, and, if not conquer him, live on despite him. He chose to destroy the Horcruxes.
Like Harry Potter, the field of psychological science (and many other sciences) faces a similar dilemma. Pursue changes in our approach to science that eliminate problematic practices that lead to unreliable science—a “destroy the Horcrux” approach. Or, continue down the path of least resistance, which is nicely captured in the pursuit of the Deathly Hallows.
What are the Deathly Hallows of psychological science? I would argue that the Deathly Hallows of psychological science, which I will enumerate below, are 1) p values less than .05, 2) experimental studies, and 3) counter-intuitive findings.
Why am I highlighting this dilemma at this time? I believe we are at a critical juncture. The nascent efforts at reform may either succeed or fade away like they have done so many times before. For it is a fact that we’ve confronted this dilemma many times before and have failed to overcome the allure of the Deathly Hallows of psychological science. Eminent methodologists such as Cohen, Meehl, Lykken, Gigerenzer, Schmidt, Fraley, and lately Cumming, have told us how to do things better since the 1960s to no avail. Revising our approach to science has never been a question of knowing the right the thing to do, but rather it has been whether we were willing to do the thing we knew was right.
The Deathly Hallows of Psychological Science: p-values, experiments, and counter-intuitive/surprising findings
The cloak of invisibility: p<.05. The first Deathly Hallow of psychological science is the infamous p-value. You must attain a p-value less than .05 to be a success in psychological science. Period. If your p-value is greater than .05, you have no finding and nothing to say. Without anything to say, you cannot attain status in our field. Find a p-value below .05 and you can wrap it around yourself and hide from the contempt aimed at those who fail to cross that magical threshold.
Because the p-value is the primary key to the domain of scientific success, we do almost anything we can to find it. We root around in our data, digging up p-values either by cherry picking studies, selectively reporting outcomes, or through some arcane statistical wizardry. One only has to read Bem’s classic article on how to write an article in psychological science to see how we approach p-values as a field:
“…the data. Examine them from every angle. Analyze the sexes separately. Make up new composite indices. If a datum suggests a new hypothesis, try to find further evidence for it elsewhere in the data. If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief. If there are participants you don’t like, or trials, observers, or interviewers who gave you anomalous results, drop them (temporarily). Go on a fishing expedition for something–anything–interesting.”
What makes it worse is that when authors try to report null effects they are beaten down because we as reviewers and editors do everything in our power to hide the null effects. Null effects make for a messy narrative. Our most prestigious journals almost never publish null effects because reviewers and editors act as gatekeepers and mistakenly recommend against publishing null effects. Consider the following personal example. In one study, reviewer 2 argued that our study was not up for publication in JPSP because one of our effects was null (there were other reasons too). Consider the fact that the null effect in question was a test of a hypothesis drawn from my own theory. I was trying to show that my theory did not work all of the time and the reviewer was criticizing me for showing that my own ideas might need revision. This captures quite nicely the tyranny of the p-value. The reviewer was so wedded to my ideas that he or she wouldn’t even let me, the author of said ideas, offer up some data that would argue for revising them.
In the absence of simply rejecting null effects, we often recommend cutting the null effects. I have seen countless recommendations in reviews of my papers and the papers of colleagues to simply drop studies or results that show null effects. It is not then surprising that psychology confirms 95% of its hypotheses.
Even worse, we often commit the fundamental attribution error by thinking that the person trying to publish null effects is an incompetent researcher—especially if they fail to replicate an already published effect that has crossed the magical p< .05 threshold. Not to be too cynical, but the reviewers may have a point. If you are too naïve to understand “the game”, which is to produce something with p < .05, then maybe you shouldn’t succeed in our field. Setting sarcasm aside, what the gatekeepers don’t understand is that they are sending a clear message to graduate students and assistant professors that they must compromise their own integrity in order to succeed in our field. Of course, this leads to the winnowing of the field of researchers who don’t want to play the game.
The Elder Wand: Running Experiments
Everyone wants to draw a causal conclusion, even observational scientists. And, of course, the best way to draw a causal conclusion, if you are not an economist, is to run an experiment. The second Deathly Hallow for psychological science is doing experimental research at all costs. As one of my past colleagues told a first year graduate student, “if you have a choice between a correlational or an experimental study, run an experiment.”
Where things go awry, I suspect, is when you value experiments so much, you do anything in your power to avoid any other method. This leads to levels of artificiality that can get perverse. Rather than studying the effect of racism, we study the idea of racism. Where we go wrong is that, as Cialdini has noted before, we seldom work back and forth between the fake world of our labs and the real world where the phenomenon of interest exists. We become methodologists, rather than scientists. We prioritize lab-based experimental methods because they are most valued by our guild not because they necessarily help us illuminate or understand our phenomenon but because they putatively lead to causal inferences. One consequence of valuing experiments so highly is that we get caught up in a world of hypothetical findings that have unknown relationships to the real world because we seldom if ever touch base with applied or field research. As Cialdini so poignantly pointed out, we simply don’t value field research enough to pursue it with equal vigor to lab experiments.
And, though some great arguments have been made that we should all relax a bit about our use of exploratory techniques and dig around in our data, what these individuals don’t realize is that half the reason we do experiments is not to do good science but to look good. To be a “good scientist” means being a confirmer of hypotheses and there is no better way to be a definitive tester of hypotheses than to run a true experiment.
Of course, now that we know many researchers run as many experiments as they need to in order to construct what findings they want, we need to be even more skeptical of the motive to look good by running experiments. Many of us publishing experimental work are really doing exploratory work under the guise of being a wand wielding wizard of an experimentalist simply because that is the best route to fame and fortune in our guild.
Most of my work has been in the observational domain, and admittedly, we have the same motive, but lack the opportunity to implement our desires for causal inference. So far, the IRB has not agreed to let us randomly assign participants to the “divorced or not divorced” or the “employed, unemployed” conditions. In the absence of being able to run a good, clean experiment, observational researchers, like myself, bulk up on the statistics as a proxy for running an experiment. The fancier, more complex, and indecipherable the statistics, the closer one gets to the status of an experimenter. We even go so far as to mistake our statistical methods, such as cross-lag panel, longitudinal designs, for ones that would afford us the opportunity to make causal inferences (hint: they don’t). Reviewers are often so befuddled by our fancy statistics that they fail to notice the inappropriateness of that inferential leap.
I’ve always held my colleague Ed Diener in high esteem. One reason I think he is great is that as a rule he works back and forth between experiments and observational studies, all in the service of creating greater understanding of well-being. He prioritizes his construct over his method. I have to assume that this is a much better value system than our long standing obsession with lab experiments.
The Resurrection Stone: Counter-intuitive findings
The final Deathly Hallow of psychological science is to be the creative destroyer of widely held assumptions. In fact, the foundational writings about the field of social psychology lay it out quite clearly. One of the primary routes to success in social psychology, for example, is to be surprising. The best way to be surprising is to be the counter-intuitive innovator—identifying ways in which humans are irrational, unpredictable, or downright surprising (Ross, Lepper, & Ward, 2010).
It is hard to argue with this motive. We hold those scientists who bring unique discoveries to their field in the highest esteem. And, every once in a while, someone actually does do something truly innovative. In the mean time, the rest of us make up little theories about trivial effects that we market with cute names, such as the “End of History Effect”, or the “Macbeth Effect” or, whatever. We get caught up in the pursuit of cutesy counter-intuitiveness all under the hope that our little innovation will become a big innovation. To the extent that our cleverness survives the test of time, we will, like the resurrection stone, live on in our timeless ideas even if they are incorrect.
What makes the pursuit of innovation so formidable an obstacle to reform is that it sometimes works. Every once in a while someone does revolutionize a field. The aspiration to be the wizard of one’s research world is not misplaced. Thus, we have an incentive system that produces a variable-ratio schedule of reinforcement—one of the toughest to break according to those long forgotten behaviorists (We need not mention behavioral psychologists, since their ideas are no longer new, innovative, or interesting–even if they were right).
Reasons for Pessimism
The problem with the current push for methodological reform is that, like pursuing the Horcruxes, it is hard and unrewarding in comparison to using the Deathly Hallows of psychological science. As one of our esteemed colleagues has noted, no one will win the APA Distinguished Scientific Award by failing to replicate another researcher’s work. Will a person who simply conducts replications of other researcher’s work get tenure? It is hard to imagine. Will researchers do well to replicate their own research? Why? It will simply slow them down and handicap their ability to compete with the other aspiring wizards who are producing the conceptually-replicated, small N lab-based experimental studies at a frightening rate. No, it is still best to produce new ideas, even if it comes at the cost of believability. And, everyone is in on the deal. We all disparage null findings in reviews because we want errors of commission rather than omission.
Another reason why the current system may be difficult to fix is that it provides a weird p-value driven utopia. With the infinite flexibility of Deathly Hallows of psychological science we can pretty much prove any idea is a good one. When combined with our antipathy toward directly replicating our own work or the work of others, everyone can be a winner in the current system. All it takes is a clever idea applied to enough analyses and every researcher can be the new hot wizard. Without any push to replicate, everyone can co-exist in his or her own happy p-value driven world.
So, there you have it. My Depth Psychology analysis of why I fear that the seemingly benign recommendations for methodological change are falling on deaf ears. The proposed changes contradict the entire status structure that has served our field for decades. I have to imagine that the ease with which the Deathly Hallows can be used is one reason why reform efforts have failed in the past. Since, as many have indicated, the same recommendations to revise our methods have been made for over 50 years. Each time, the effort has failed.
In sum, while there have been many proposed solutions to our problems, I believe we have not yet faced our real issue, which is how are we going to re-structure our incentive structure? Many of us have stated, as loudly and persistently as we can that there are Horcruxes all around us that need to be destroyed. The move to improve our methods and to conduct direct replications can be seen as an effort to eliminate our believability Horcruxes. But, I believe the success of that effort rides on how clearly we see the task ahead of us. Our task is to convince a skeptical majority of scientists to dismantle an incentive structure that has worked for them for many decades. This will be a formidable task.