By Brent W. Roberts
In a recent blog post, I argued that the Deathly Hallows of Psychological Science—p values < .05, experiments, and counter-intuitive findings—represent the combination of factors that are most highly valued by our field and are the explicit criteria for high impact publications. Some commenters mistook my identification of the Deathly Hallows of Psychological Science as a criticism of experimental methods and an endorsement of correlational methods. They even went so far as to say my vision for science was “scary.”
Of course, these critics reacted negatively to the post because I was being less than charitable to some hallowed institutions in psychological science. Regardless, I stand by the original argument. Counter-intuitive findings from experiments “verified” with p values less than .05 are the most valued commodities of our scientific efforts. And, the slavish worshiping of these criteria is at the root of many of our replicability and believability problems.
I will admit that I could have been clearer in my original blog post on the Deathly Hallows. I could have explained in simple language that it is not the ingredients of the Deathly Hallows of Psychological Science, per se, that are the problem, but the blind obedience that too many scholars pay to these criteria. I hope most readers got that point.
Of course, the comment saying my vision was scary did make me think. Just what is my vision for the ideal scientific process in psychology? Actually, that’s an easy question to answer. My vision of good scientific work in psychological science has two basic features. First, ask good questions. Second, answer those questions with informative methods that are well suited for answering those questions. See that? No p-values. No statistics. No experiments. No counter-intuitiveness. We just need good questions and appropriate methods. That’s all.
Good questions, of course, are not so easy to come by. By “good” I mean questions that when answered will provide valuable information. A good question often emerges from the foundation of knowledge in one’s field. It is a question that needs to be answered given the knowledge that has accrued to date. Of course, given the fact that our false positive rate in psychological science ranges from 20% to 80% depending on who you ask, it is genuinely difficult to know what a good question is nowadays. I take that as an arbitrage opportunity—every question is back on the table.
How do you know your question is good? Easy. Your research question is good if the answer is interesting regardless of the result. It should be just as interesting whether the effect is null or not provided the design was appropriate and high-powered. There is an abundance of examples of good scientific questions that have been answered over the years, such as Milgram and Asch’s question of whether humans are conforming. The significance of their work does not ride on whether their effects were p < .05. The significance of their work rests on figuring out that people behave in a very conforming fashion, at least in western populations. It would have been fascinating to find the opposite too. It was a good question and the importance of their results has stood the test of time.
Similarly, the question of whether human phenotypes are heritable and to what extent environmental influences are shared or unique was, and remains a good question. The answer would have been informative regardless of the proportion of genetic, shared, and unique environmental variance behavior geneticists found in outcomes like personality or psychopathology. The findings were, and still are fascinating given the relatively modest variance attributable to shared environmental influences.
Appropriate methods are, in part, dictated by the question that needs to be answered. Sometimes that leads to a correlational design, sometimes an experiment, sometimes something in between. God forbid sometimes it might even call for a case study or a qualitative design. Regardless, a good method is one that provides reliable information on the original research question that was asked. When behavior geneticists were criticized for the equal environments assumption, they went out and found samples of twins that were raised apart. What did they find? They found that phenotypes were just as heritable in twins who shared no environment. You can complain as much as you want about identical twins being treated more alike than fraternal twins, but the studies where twins who were raised apart show the same levels of correspondence as twins raised together was a design that answered that question perfectly.
Likewise, when people and researchers questioned the efficacy of psychotherapy, it was the true experimental designs that brought clinical psychology back from the abyss. Decades of diligently run field experiments have now shown that therapy works, at least in those populations that stay in clinical trials. Correlational designs could not have answered the question of whether clinical interventions worked. Only good experimental evidence could answer that question.
My criticism of the Deathly Hallows of Psychological Science rides on the fact that the blind pursuit of this Holy Grail incentivizes bad methods. It is much easier to get your desired finding if you run a series of underpowered studies and then either p-hack by dropping null findings or fish around for significant effects by testing moderators to death. That means that the prototypical package of underpowered conceptually replicated experiments is uninformative about the actual question that motivated the studies in the first place. These practices represent bad methodology and they waste limited and valuable resources. Most, if not all of the recommended changes that have been proposed by the “skeptics” of unreplicable research have been to simply improve the informational value of the methods by increasing sample sizes, directly replicating findings, and avoiding p-hacking. Please, someone, tell me why these are bad recommendations?
I’d add one more ingredient to my “vision” and remind the reader of the late Carl Sagan’s first maxim of his Baloney Detection Kit. The best scientific information comes not only from a study that is directly replicated, but one that is directly replicated by an independent source. That means a researcher who is indifferent, if not hostile to your finding should be able to reproduce it. That’s good information. That’s a finding that can be trusted. For example, I would put money on the fact that any researcher who has a distaste for the idea of personality traits, would, if given the responsibility of tracking personality traits over time, find that they show robust rank-order consistency.
So that is my grand, scary vision for conducting good science. Ask good questions. Answer the question with methods that are informative. An underpowered study is not informative. A properly powered study that can be replicated by a hostile audience is very, very informative. Good science doesn’t have to be an experiment. It doesn’t have to produce a statistically significant finding. Nor does the topic have to be counter-intuitive. It just has to be a trustworthy set of data that attempts to answer a good scientific question.
If that vision scares you I can recommend a good, if cheap bottle of red wine or an anxiolytic. Of course, I wouldn’t recommend mixing alcohol and medications as that can be detrimental to your health, p < 05.
Sometimes, it feels weird we spend quite some time on teaching students statistics, while we spend so very little time on asking good questions. I think one of the major differences between psychology and more mature sciences is that other disciplines can agree on what important questions are.
I think we know what the important questions are in psychology. But it is hard to get people to address them when the carrot we’re chasing is novelty.
Agreed! Sometimes it feels like the toothbrush problem is unsolvable — I’d love to have a good discussion of the ideas people have on how to overcome it. Funding seems like it has potential, but there are a lot ways that can go wrong, since it implies top-down prioritization of research agendas. A turn back towards dealing with real-world issues with lab research being only a part of the research cycle would probably focus attention on clarifying important phenomena rather than uncovering novel ones. And then collaboration, with more people working on fewer problems seems like it might have potential as well.
There really aren’t basic Statistics 101 books that are based on, centrally, an effect size understanding of stats, are there? It has seemed to me that intro stats should be taught with a textbook which is essentially an intro version of Cohen, Cohen, Aiken, and West.
If such a textbook doesn’t exist, then just with Cohen, Cohen, Aiken, and West.
I guess my point is that I think it goes deeper in a lot of ways than the reward structure at the journals. Psychology students basically learn from Day 1 – their very first introduction to how to do research – that what they need to do is get t > 2, p < .05. This way of understanding how to do research is in their bones. 100 years of NHST is hard to shake.
Pingback: On the Research University, and on Good Science – two articles/blogs you should read. | Åse Fixes Science