The Deathly Hallows of Psychological Science

By Brent W. Roberts

As of late, psychological science has arguably done more to address the ongoing believability crisis than most other areas of science.  Many notable efforts have been put forward to improve our methods.  From the Open Science Framework (OSF), to changes in journal reporting practices, to new statistics, psychologists are doing more than any other science to rectify practices that allow far too many unbelievable findings to populate our journal pages.

The efforts in psychology to improve the believability of our science can be boiled down to some relatively simple changes.  We need to replace/supplement the typical reporting practices and statistical approaches by:

  1. Providing more information with each paper so others can double-check our work, such as the study materials, hypotheses, data, and syntax (through the OSF or journal reporting practices).
  2. Designing our studies so they have adequate power or precision to evaluate the theories we are purporting to test (i.e., use larger sample sizes).
  3. Providing more information about effect sizes in each report, such as what the effect sizes are for each analysis and their respective confidence intervals.
  4. Valuing direct replication.

It seems pretty simple.  Actually, the proposed changes are simple, even mundane.

What has been most surprising is the consistent push back and protests against these seemingly innocuous recommendations.  When confronted with these recommendations it seems many psychological researchers balk. Despite calls for transparency, most researchers avoid platforms like the OSF.  A striking number of individuals argue against and are quite disdainful of reporting effect sizes. Direct replications are disparaged. In response to the various recommendations outlined above, prototypical protests are:

  1. Effect sizes are unimportant because we are “testing theory” and effect sizes are only for “applied research.”
  2. Reporting effect sizes is nonsensical because our research is on constructs and ideas that have no natural metric, so that documenting effect sizes is meaningless.
  3. Having highly powered studies is cheating because it allows you to lay claim to effects that are so small as to be uninteresting.
  4. Direct replications are uninteresting and uninformative.
  5. Conceptual replications are to be preferred because we are testing theories, not confirming techniques.

While these protestations seem reasonable, the passion with which they are provided is disproportionate to the changes being recommended.  After all, if you’ve run a t-test, it is little trouble to estimate an effect size too. Furthermore, running a direct replication is hardly a serious burden, especially when the typical study only examines 50 to 60 odd subjects in a 2×2 design. Writing entire treatises arguing against direct replication when direct replication is so easy to do falls into the category of “the lady doth protest too much, methinks.” Maybe it is a reflection of my repressed Freudian predilections, but it is hard not to take a Depth Psychology stance on these protests.  If smart people balk at seemingly benign changes, then there must be something psychologically big lurking behind those protests.  What might that big thing be?  I believe the reason for the passion behind the protests lies in the fact that, though mundane, the changes that are being recommended to improve the believability of psychological science undermine the incentive structure on which the field is built.

I think this confrontation needs to be more closely examined because we need to consider the challenges and consequences of deconstructing our incentive system and status structure.  This, then begs the question, what is our incentive system and just what are we proposing to do to it?  For this, I believe a good analogy is the dilemma faced by Harry Potter in the last book of the eponymously titled book series.

deathly hallows

The Deathly Hallows of Psychological Science

In the last book of the Harry Potter series “The Deathly Hallows,” Harry Potter faces a dilemma.  Should he pursue the destruction of the Horcruxes or gather together the Deathly Hallows. The Horcruxes are pieces of Voldemort’s soul encapsulated in small trinkets, jewelry, and such.  If they were destroyed, then it would be possible to destroy Voldemort.  The Deathly Hallows are three powerful magical objects, which are alluring because by possessing all three, one becomes the “master of death.”  The Deathly Hallows are the Cloak of Invisibility, the Elder Wand, and the Resurrection Stone. The dilemma Harry faced was whether to pursue and destroy the Horcruxes, which was a painful and difficult path; or Harry could choose to pursue the Deathly Hallows, with which he could quite possibly conquer Voldemort, and, if not conquer him, live on despite him.  He chose to destroy the Horcruxes.

Like Harry Potter, the field of psychological science (and many other sciences) faces a similar dilemma. Pursue changes in our approach to science that eliminate problematic practices that lead to unreliable science—a “destroy the Horcrux” approach. Or, continue down the path of least resistance, which is nicely captured in the pursuit of the Deathly Hallows.

What are the Deathly Hallows of psychological science? I would argue that the Deathly Hallows of psychological science, which I will enumerate below, are 1) p values less than .05, 2) experimental studies, and 3) counter-intuitive findings.

Why am I highlighting this dilemma at this time? I believe we are at a critical juncture.  The nascent efforts at reform may either succeed or fade away like they have done so many times before.  For it is a fact that we’ve confronted this dilemma many times before and have failed to overcome the allure of the Deathly Hallows of psychological science. Eminent methodologists such as Cohen, Meehl, Lykken, Gigerenzer, Schmidt, Fraley, and lately Cumming, have told us how to do things better since the 1960s to no avail. Revising our approach to science has never been a question of knowing the right the thing to do, but rather it has been whether we were willing to do the thing we knew was right.

Screen Shot 2014-03-09 at 9.31.27 PMThe Deathly Hallows of Psychological Science: p-values, experiments, and counter-intuitive/surprising findings

The cloak of invisibility: p<.05. The first Deathly Hallow of psychological science is the infamous p-value. You must attain a p-value less than .05 to be a success in psychological science.  Period.  If your p-value is greater than .05, you have no finding and nothing to say. Without anything to say, you cannot attain status in our field. Find a p-value below .05 and you can wrap it around yourself and hide from the contempt aimed at those who fail to cross that magical threshold.

Because the p-value is the primary key to the domain of scientific success, we do almost anything we can to find it.  We root around in our data, digging up p-values either by cherry picking studies, selectively reporting outcomes, or through some arcane statistical wizardry.  One only has to read Bem’s classic article on how to write an article in psychological science to see how we approach p-values as a field:

“…the data.  Examine them from every angle. Analyze the sexes separately.  Make up new composite indices.  If a datum suggests a new hypothesis, try to find further evidence for it elsewhere in the data.  If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief.  If there are participants you don’t like, or trials, observers, or interviewers who gave you anomalous results, drop them (temporarily).  Go on a fishing expedition for something–anything–interesting.”

What makes it worse is that when authors try to report null effects they are beaten down because we as reviewers and editors do everything in our power to hide the null effects. Null effects make for a messy narrative. Our most prestigious journals almost never publish null effects because reviewers and editors act as gatekeepers and mistakenly recommend against publishing null effects.  Consider the following personal example. In one study, reviewer 2 argued that our study was not up for publication in JPSP because one of our effects was null (there were other reasons too). Consider the fact that the null effect in question was a test of a hypothesis drawn from my own theory. I was trying to show that my theory did not work all of the time and the reviewer was criticizing me for showing that my own ideas might need revision. This captures quite nicely the tyranny of the p-value. The reviewer was so wedded to my ideas that he or she wouldn’t even let me, the author of said ideas, offer up some data that would argue for revising them.

In the absence of simply rejecting null effects, we often recommend cutting the null effects. I have seen countless recommendations in reviews of my papers and the papers of colleagues to simply drop studies or results that show null effects.  It is not then surprising that psychology confirms 95% of its hypotheses.

Even worse, we often commit the fundamental attribution error by thinking that the person trying to publish null effects is an incompetent researcher—especially if they fail to replicate an already published effect that has crossed the magical p< .05 threshold. Not to be too cynical, but the reviewers may have a point.  If you are too naïve to understand “the game”, which is to produce something with p < .05, then maybe you shouldn’t succeed in our field.  Setting sarcasm aside, what the gatekeepers don’t understand is that they are sending a clear message to graduate students and assistant professors that they must compromise their own integrity in order to succeed in our field. Of course, this leads to the winnowing of the field of researchers who don’t want to play the game.

The Elder Wand: Running Experiments

Everyone wants to draw a causal conclusion, even observational scientists. And, of course, the best way to draw a causal conclusion, if you are not an economist, is to run an experiment.  The second Deathly Hallow for psychological science is doing experimental research at all costs.  As one of my past colleagues told a first year graduate student, “if you have a choice between a correlational or an experimental study, run an experiment.”

Where things go awry, I suspect, is when you value experiments so much, you do anything in your power to avoid any other method. This leads to levels of artificiality that can get perverse. Rather than studying the effect of racism, we study the idea of racism.  Where we go wrong is that, as Cialdini has noted before, we seldom work back and forth between the fake world of our labs and the real world where the phenomenon of interest exists. We become methodologists, rather than scientists.  We prioritize lab-based experimental methods because they are most valued by our guild not because they necessarily help us illuminate or understand our phenomenon but because they putatively lead to causal inferences. One consequence of valuing experiments so highly is that we get caught up in a world of hypothetical findings that have unknown relationships to the real world because we seldom if ever touch base with applied or field research.  As Cialdini so poignantly pointed out, we simply don’t value field research enough to pursue it with equal vigor to lab experiments.

And, though some great arguments have been made that we should all relax a bit about our use of exploratory techniques and dig around in our data, what these individuals don’t realize is that half the reason we do experiments is not to do good science but to look good.  To be a “good scientist” means being a confirmer of hypotheses and there is no better way to be a definitive tester of hypotheses than to run a true experiment.

Of course, now that we know many researchers run as many experiments as they need to in order to construct what findings they want, we need to be even more skeptical of the motive to look good by running experiments. Many of us publishing experimental work are really doing exploratory work under the guise of being a wand wielding wizard of an experimentalist simply because that is the best route to fame and fortune in our guild.

Most of my work has been in the observational domain, and admittedly, we have the same motive, but lack the opportunity to implement our desires for causal inference.  So far, the IRB has not agreed to let us randomly assign participants to the “divorced or not divorced” or the “employed, unemployed” conditions. In the absence of being able to run a good, clean experiment, observational researchers, like myself, bulk up on the statistics as a proxy for running an experiment. The fancier, more complex, and indecipherable the statistics, the closer one gets to the status of an experimenter. We even go so far as to mistake our statistical methods, such as cross-lag panel, longitudinal designs, for ones that would afford us the opportunity to make causal inferences (hint: they don’t). Reviewers are often so befuddled by our fancy statistics that they fail to notice the inappropriateness of that inferential leap.

I’ve always held my colleague Ed Diener in high esteem.  One reason I think he is great is that as a rule he works back and forth between experiments and observational studies, all in the service of creating greater understanding of well-being.  He prioritizes his construct over his method. I have to assume that this is a much better value system than our long standing obsession with lab experiments.

The Resurrection Stone: Counter-intuitive findings

The final Deathly Hallow of psychological science is to be the creative destroyer of widely held assumptions. In fact, the foundational writings about the field of social psychology lay it out quite clearly. One of the primary routes to success in social psychology, for example, is to be surprising.  The best way to be surprising is to be the counter-intuitive innovator—identifying ways in which humans are irrational, unpredictable, or downright surprising (Ross, Lepper, & Ward, 2010).

It is hard to argue with this motive.  We hold those scientists who bring unique discoveries to their field in the highest esteem.  And, every once in a while, someone actually does do something truly innovative. In the mean time, the rest of us make up little theories about trivial effects that we market with cute names, such as the “End of History Effect”, or the “Macbeth Effect” or, whatever.  We get caught up in the pursuit of cutesy counter-intuitiveness all under the hope that our little innovation will become a big innovation.  To the extent that our cleverness survives the test of time, we will, like the resurrection stone, live on in our timeless ideas even if they are incorrect.

What makes the pursuit of innovation so formidable an obstacle to reform is that it sometimes works. Every once in a while someone does revolutionize a field. The aspiration to be the wizard of one’s research world is not misplaced.  Thus, we have an incentive system that produces a variable-ratio schedule of reinforcement—one of the toughest to break according to those long forgotten behaviorists (We need not mention behavioral psychologists, since their ideas are no longer new, innovative, or interesting–even if they were right).

Reasons for Pessimism

The problem with the current push for methodological reform is that, like pursuing the Horcruxes, it is hard and unrewarding in comparison to using the Deathly Hallows of psychological science. As one of our esteemed colleagues has noted, no one will win the APA Distinguished Scientific Award by failing to replicate another researcher’s work. Will a person who simply conducts replications of other researcher’s work get tenure?  It is hard to imagine. Will researchers do well to replicate their own research? Why? It will simply slow them down and handicap their ability to compete with the other aspiring wizards who are producing the conceptually-replicated, small N lab-based experimental studies at a frightening rate. No, it is still best to produce new ideas, even if it comes at the cost of believability. And, everyone is in on the deal. We all disparage null findings in reviews because we want errors of commission rather than omission.

Another reason why the current system may be difficult to fix is that it provides a weird p-value driven utopia. With the infinite flexibility of Deathly Hallows of psychological science we can pretty much prove any idea is a good one. When combined with our antipathy toward directly replicating our own work or the work of others, everyone can be a winner in the current system. All it takes is a clever idea applied to enough analyses and every researcher can be the new hot wizard. Without any push to replicate, everyone can co-exist in his or her own happy p-value driven world.

So, there you have it.  My Depth Psychology analysis of why I fear that the seemingly benign recommendations for methodological change are falling on deaf ears.  The proposed changes contradict the entire status structure that has served our field for decades.  I have to imagine that the ease with which the Deathly Hallows can be used is one reason why reform efforts have failed in the past. Since, as many have indicated, the same recommendations to revise our methods have been made for over 50 years. Each time, the effort has failed.

In sum, while there have been many proposed solutions to our problems, I believe we have not yet faced our real issue, which is how are we going to re-structure our incentive structure?  Many of us have stated, as loudly and persistently as we can that there are Horcruxes all around us that need to be destroyed. The move to improve our methods and to conduct direct replications can be seen as an effort to eliminate our believability Horcruxes. But, I believe the success of that effort rides on how clearly we see the task ahead of us. Our task is to convince a skeptical majority of scientists to dismantle an incentive structure that has worked for them for many decades. This will be a formidable task.

Image | Posted on by | 31 Comments

For the love of p-values

We recently read Karg et al (2011) for a local reading group.  It is one of the many of attempts to meta-analytically examine the idea that the 5-HTTLPR serotonin transporter polymorphism moderates the effect of stress on depression.

It drove me batty.  No, it drove me to apoplectia–a small country in my mind I occupy far too often.

Let’s focus on the worst part.  Here’s the write up in the first paragraph of the results:

“We found strong evidence that 5-HTTLPR moderates the relationship between stress and depression, with the s allele associated with an increased risk of developing depression under stress (P = .00002).  The significance of the result was robust to sensitivity analysis, with the overall P values remaining significant when each study was individually removed form the analysis (1.0×10-6<P<.00016).”

Wow.  Isn’t that cool?  Isn’t that impressive?  Throw out all of the confused literature and meta-analyses that came before this one.  They found “strong evidence” for this now infamous moderator effect.  Line up the spit vials. I’m getting back in the candidate gene GxE game.

Just what did the authors mean by “strong?”  Well, that’s an interesting question.  There is nary an effect size in the review as the authors chose not to examine effect sizes, but focused on synthesizing p-values instead.  Of course, if you have any experience with meta-analytic types, you know how they feel about meta-analyzing p-values.  It’s like Nancy Reagan to drugs.  Just say no.  If you are interested in why, read Lipsey and Wilson or any other meta-analysis guru.  They are unsympathetic, to say the least.

But, all is not lost. All you, the reader, have to do is transform the p-value into an effect size using any of the numerous on-line transformation programs that are available.  It takes about 15 seconds to do it yourself.  Or, if you want to be thorough, you can take the data from Table 1 in Karg et al (2011) and transform the p-values into effect sizes for your own meta-analytic pleasure. That takes about 15 minutes.

So what happens when you take their really, really significant p-value of p = .00002 and transform it in to an effect size estimate?  Like good meta-analytic types, the authors provide the overall N, which is 40,749.  What does that really impressive p-value translate into when you translate it into an r metric?

.0199 or .02 if you round up.

It is even smaller than Rosenthal’s famous .03 correlation between aspirin consumption and protection from heart disease.  You get the same thing when you plug all of the numbers from Table 1 into Comprehensive Meta-Analysis, by-the-way.

So the average interaction between the serotonin transporter promoter and stress on depression is “strong,” “robust,” yet infinitesimal.  It sounds like a Monty Python review of Australian wine (“Bold, yet naïve.” “Flaccid, yet robust”).

Back to our original question, what did the authors mean when they described their results as “strong?”  One can only assume that they mean to say that their p-value of .00002 looks a lot better than our usual suspect, the p < .05.  Yippee.

Why should we care?  Well, this is a nice example of what you get when you ignore effect size estimates and just use p-values–misguided conclusions. The Karg et al (2011) paper has been cited 454 times so far.  Here’s a quote from one of the papers that cites their work “This finding, initially refuted by smaller meta-analyses, has now been supported by a more comprehensive meta-analysis” (Palazidou, 2012). Wrong.

Mind you, there is no inconsistency across the meta-analyses.  If the average effect is really equal to an r of .02, and I doubt it is this big, it is really, really unlikely to be consistently detected by any study, no less a meta-analysis. The fact that the meta-analyses appear to disagree is only because the target effect size is so small that even dozens of studies and thousands of participants might fail to detect it.  

Another reason to care about misguided findings is the potential mistaken conclusion either individuals or granting agencies will make if they take these findings at face value.  They might conclude that the GxE game is back on and start funding candidate gene research (doubtful, but possible).  Researchers themselves might come to the mistaken conclusion that they too can investigate GxE designs.  Heck, the average sample size in the meta-analysis is 755.  With a little money and diligence, one could come by that kind of sample, right?

Of course, that leads to an interesting question.  How many people do you need to detect a correlation of .02? Those pesky granting agencies might ask you to do a power analysis, right?  Well, to achieve 80% power to detect a correlation of .02 you would need 8,699 participants.  That means the average sample in the meta-analysis was woefully underpowered to detect the average effect size.  For that matter, it means that none of the studies in the meta-analysis were adequately powered to detect the average effect size because the largest study, which was a null effect, had an N of 3,243.

So, this paper proves a point; that if you cumulate enough participants in your research almost anything is statistically significant.  And this warrants publication in the Archives of General Psychiatry?  Fascinating.

Brent W. Roberts

Posted in Uncategorized | 10 Comments

Are conceptual replications part of the solution to the crisis currently facing psychological science?

by R. Chris Fraley

Stroebe and Strack (2014) recently argued that the current crisis regarding replication in psychological science has been greatly exaggerated. They observed that there are multiple replications of classic social/behavioral priming findings in social psychology. Moreover, they suggested that the current call for replications of classic findings is not especially useful. If a researcher conducts an exact replication study and finds what was originally reported, no new knowledge has been generated. If the replication study does not find what was originally reported, this mismatch could be due to a number of factors and may speak more to the replication study than the original study per se.

As an alternative, Stroebe and Strack (2014) argue that, if researchers choose to pursue replication, the most constructive way to do so is through conceptual replications. Conceptual replications are potentially more valuable because they serve to probe the validity of the theoretical hypotheses rather than a specific protocol.

Are conceptual replications part of the solution to the crisis currently facing psychological science?

The purpose of this post is to argue that we can only learn anything of value—whether it is from an original study, an exact replication, or a conceptual replication—if we can trust the data. And, ultimately, a lack of trust is what lies at the heart of current debates. There is no “replicability crisis” per se, but there is an enormous “crisis of confidence.”

To better appreciate the distinction, consider the following scenarios.

A. At University of A researchers have found that X1 leads to Y1. They go on to show that X2 leads to Y. And that X3 leads to Y2. In other words, there are several studies that suggest that X, operationalized in multiple ways, leads to Y in ways anticipated by their theoretical model.

B. At University of B researchers have found that X1 leads to Y1. They go on to show that X2 leads to Y. And that X3 leads to Y2. In other words, there are several studies that suggest that X, operationalized in multiple ways, leads to Y in ways anticipated by their theoretical model.

Is one set of research findings more credible than the other? What’s the difference?

At the University of A researchers conducted 8 studies total. Some of these were pilot studies that didn’t pan out, but led to some ideas about how to tweak the measure of Y. A few of the studies involved exact replications with extensions, but the so-called exact replication part didn’t quite work, but one of the other variables did reveal a difference that made sense in light of the theory, so that finding was submitted (and accepted) for publication. In each case, the data from on-going studies were analyzed each week for lab meetings and studies were considered “completed” when a statistically significant effect was found. The sample sizes were typically small (20 per cell) because a few other labs studying a similar issue had successfully obtained significant results with small samples.

In contrast, at the University of B a total of 3 studies were conducted. The researchers used large sample sizes to estimate the parameters/effects well. Moreover, the third study had been preregistered such that the stopping rules for data collection and the primary analyses were summarized briefly (3 sentences) on a time-stamped site.

Both research literatures contain conceptual replications. But, once one has full knowledge of how these literatures were produced via a Simmons et al. (2011) sleight of hand, one may doubt whether the findings and theories being studied by the researchers at the University of A are as solid as those being studied at the University of B. This example is designed to help separate two key issues that are often conflated in debates concerning  the current crisis.

Specifically, as a field, we need to draw a sharper distinction between (a) replications (exact vs. conceptual) and (b) the integrity of the research process (see Figure) when considering the credibility of knowledge generated in psychological science. We sometimes conflate these two things, but they are clearly separable.

Replication and Integrity

The difference between methodological integrity and replication and their relation to the the credibility of research

Speaking for myself, I don’t care whether a replication is exact or conceptual. Both kinds of studies serve different purposes and both are valuable under different circumstances. But what matters critically for the current crisis is the integrity of the methods used to populate the empirical literature. If the studies are not planned, conducted, and published in a manner that has integrity, then—regardless of whether those findings have been conceptually replicated—they offer little in the way of genuine scientific value. The University of A example above illustrates a research field that has multiple conceptual replications. But those replications do little to boost the credibility of the theoretical model because the process that generated the findings was too flexible and not transparent (Simmons, Nelson, & Simonsohn, 2011).

When skeptics call for “exact replications,” what they really mean is that “we don’t trust the integrity of the process that led to the publication of the findings in the first place.” An exact replication provides the most obvious way to address that matter; that is why skeptics, such as my colleague, Brent Roberts, are increasingly demanding them. But improving the integrity of the research process is the most direct way to improve the credibility of published work. This can be accomplished, in part, by using established and validated measures, taking statistical power or precision seriously, using larger sample sizes, preregistering analyses and designs when viable, and, of course, conducting replications along the way.

I agree with Stroebe and Strack (2014) that conceptual replication is something for which we should be striving. But, if we don’t practice methodological integrity, no number of replications will solve the crisis of confidence.

-

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22(11), 1359-1366.

Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives in Psychological Science, 8, 59-71. http://pps.sagepub.com/content/9/1/59

Posted in Uncategorized | 6 Comments

The Pre-Publication Transparency Checklist

The Pre-Publication Transparency Checklist: A Small Step Toward Increasing the Believability of Psychological Science

 We now know that some of the well-accepted practices of psychological science do not produce reliable knowledge. For example, widely accepted but questionable research practices contribute to the fact that many of our research findings are unbelievable (that is, that one is ill-advised to revise one’s beliefs based on the reported findings). Post-hoc analyses of seemingly convincing studies have shown that some findings are too good to be true.  And, a string of seminal studies have failed to replicate.  These factors have come together to create a believability crisis in psychological science.

Many solutions have been proffered to address the believability crisis.  These solutions have come in four general forms.  First, many individuals and organizations have listed recommendations about how to make things better.  Second, other organizations have set up infrastructures so that individual researchers can pre-register their studies, to document hypotheses, methods, analyses, research materials, and data so that others can reproduce published research results (Open Science Framework).  Third, specific journals, such as Psychological Science, have set up pre-review confessionals of sorts to indicate the conditions under which the data were collected and analyzed.  Fourth, others have created vehicles so that researchers can confess to their methodological sins after their work has been published (psychdisclosure.org).  In fact, psychology should be lauded for the reform efforts it has put forward to address the believability crisis, as it is only one of many scientific fields in which the crisis is currently raging, and it is arguably doing more than many other fields.

While we fully support many of these efforts at reform, it has become clear that they leave a gaping hole through which researchers can blithely walk.  People can and do ignore recommendations.  Researchers can avoid pre-registering their work.  Researchers can also avoid publishing in journals that require confessing one’s QRPs before review.  And, published authors can avoid admitting to their questionable research practices post hoc.  What this means is that research continues to be published every month in our most prestigious journals that in design and method looks indistinguishable from the research that lead to the believability crisis in the first place.

In searching for solutions to this problem, we thought that instead of relying on the good graces of individual researchers to change their own behavior, or waiting for the slow pace of institutional change (e.g., journals to follow Psychological Science’s lead), that it might be productive to provide a tool that could be used by everyone, right now.  So what are we proposing?  We propose a set of questions that all researchers should be able to answer pre-publication in the review process—the Pre-Publication Transparency Checklist (PPTC).  Who should use these questions?  Reviewers.  Reviewers are free to ask any question they want, as many of us can attest to.  There is nothing stopping researchers from holding other researchers accountable. The goal of these questions is to get even those unwilling to pre- or post-register their research process to cough up background information on how they conducted their research and the extent to which their results are “fragile” or “robust”.  The questions are inspired by the changes recommended by many different groups and would hopefully help to improve the believability of the research by making authors describe the conditions under which the research was conducted before their paper is accepted for publication[1].

The Pre-Publication Transparency Checklist

  1. How many studies were run in conceptualizing and preparing the set of studies reported in this paper?
    • How many studies were run under the original IRB proposal?
    • How many “pilot” studies were run in support of the reported studies?
  2. If an experiment was run, how many conditions were included in the original study and were all of these conditions included in the current manuscript?a
  3. Was any attempt made to directly replicate any of the studies reported in this paper?
    • Would you be willing to report the direct replications as an on-line appendix?
    • Note: In some fields it is common to replicate research but not report the efforts.
    • Note: Some studies are difficult to replicate (e.g., longitudinal, costly, technologically intense).
  4. Approximately how many outcome measures were assessed in each study?a
    • How many of these outcome measures were intended for this study?
    • How many outcome measures were analyzed for each study?
    • Do all of the conceptually and empirically related DVs show the same pattern of effects?
  5. In any of the studies presented, were the data analyzed as they were being collected (i.e., peeked at)?
    • If it was “peeked” at, was there an effort to address the potential increase in the Type I error rate that results from peeking, such as conducting direct replications or using Bayesian estimation approaches?
    • Note: The goal is not necessarily to eliminate p-hacking but to make sure our findings are replicable despite p-hacking (see Asendorpf, et al, 2012 for a discussion).
  6. What was the logic behind the sample sizes used for each study?a
    • Was a power analysis performed in the absence of pilot data?
    • Was an effect size estimate made on the initial work and used for power estimates of subsequent studies whether they were direct or conceptual replications?
  7. Were any participants assessed and their data not included in the final report of each study?a
    • What was the rationale for not including these participants?
  8. Do all of the co-authors of the study have access to the data?
  9. Can all of the co-authors of the study reproduce the analyses?
    • If not, why and who can?
    • Note: It is common for statistical experts to lend a helping hand so it is not necessarily bad that all the authors cannot reproduce the analyses.  But, it is important to know who can and cannot reproduce the analyses for future efforts to reproduce the results.

10. Were there any attempts to test whether the results were robust across analytical models and different sets of control variables?

  • If the results do not replicate across models was this factored into the alpha level (multiple tests of significance?)?

11. Approximately how many different analytical approaches were used?

  • Were alternative ways of analyzing the data considered and tested?
  • Note: it is common to try different variants of the general linear model (ANOVA, ANCOVA, regression, HLM, SEM).  It would be important to know whether the results replicate across the various ways of analyzing the data.

So, as we noted above, these questions could be asked of researchers when they present their work for review.  Ideally, the answers to these questions would become part of the reported methods of every paper submitted, possibly as an on-line appendix. If reviewers asked these types of questions of every paper that was submitted, that in itself would change the publication incentive structure quite dramatically.

A second way that the Pre-Publication Transparency Checklist could be used is by editors of journals other than Psychological Science.  Like reviewers, editors could ask authors to simply answer each of these questions along with their submission.  There is no reason why Psychological Science should go it alone with this type of questioning.  The effort to answer these questions is minor—far less, for example, than the time taken to complete the typical IRB form.  Again, if editors used the PPTC, which they should be free to do today, we could be on our way to a better more substantial body of research on which to base our future scientific efforts.

Given the heterogeneity of reactions to the believability crisis in psychology, we do not foresee the answers to these questions being “right” or “wrong” so much as providing information that other researchers can use to determine whether they personally would want further follow up before concluding that the research was reliable. But, of course, like the traditional methods we use in psychological science which rely on transparency, accuracy, and honesty, the answers will only be as good as they are truthful.

We are also sympathetic to the point that many of the questions will be difficult to answer and that many questions will not apply to different types of research.  That is okay.  The goal is not an exacting account of every behavior and decision made on the way to a finished publication.  The goal is to provide background information to help determine how robust or delicate the findings may be.  For example, if dozens of studies were run looking at scores of outcomes and only a few of the studies and outcomes were reported, then other researchers may not want to attempt to build on the findings before directly replicating the results themselves.  Similarly, if multiple analyses were conducted and only the statistically significant ones reported, then other researchers would likewise be cautious when following up on the findings.

As noted above, the PPTC would not be necessary if researchers pre-registered their studies, posted their materials and data on-line, and were transparent with the description of their methods.  But, given the obvious fact that not every researcher is going to pre-register their materials, the Pre-Publication Transparency Checklist provides a means through which reviewers and editors can get these individuals to provide desperately needed information on which to judge the robustness or fragility of the reported findings in any submitted manuscript.

Brent W. Roberts, University of Illinois, Urbana-Champaign

Daniel Hart, Rutgers University


[1] These accords include those proposed for pre-review at Psychological Science, post-publication disclosure at Psychdislosure.org and as part of the 21 Word Solution.  Those questions marked with an “a” are similar in content to the questions found on existing systems.

Posted in Uncategorized | 7 Comments

Science or law: Choose your career

I recently saw an article by an astute reporter that described one of our colleagues as a researcher who “…has made a career out of finding data….”

Finding data.

What a lush expression.  In this case, as it seems always to be the case, the researcher had a knack for finding data that supported his or her theory. On the positive side of the ledger, “finding data” denotes the intrepid explorer who discovers a hidden oasis or the wonder that comes with a NASA probe that unlocks long lost secrets on Mars.

On the negative side of the ledger, “finding data” alludes to researchers who will hunt down findings that confirm their theories and ignore data that do not. I remember coming across this phenomenon for the first time as a graduate student, when a faculty member asked whether any of us could “find some data to support X”.  I thought it was an odd request.  I thought in science one tested ideas rather than hunted down confirming data and ignored disconfirming data.

Of course, “finding data” is an all too common practice in psychology.  Given the fact that 92% of our published findings are statistically significant and that it is common practice to suppress null findings, it strikes me that the enterprise of psychological science has defaulted to the task of finding data.  One needs only to have an idea, say that ESP is real, and, given enough time and effort, the supporting data can be found.  Given our typical power (50%) and the Type 1 error rate (at least 50% according to some), the job is not too tough.  One only has to run a few underpowered studies, with a few questionable research practices thrown in and the data will be found.  Of course, you will have to ignore the null findings.  But, that apparently is easy to do because as one of our esteemed colleagues wrote recently “everyone does it”—“it” meaning throw null effects away.

There are other careers and jobs that call for a similar approach—pundits and lawyers.  The job of Fox or MSNBC pundits is not to report the data as it is, but to find the data that supports their preconceived notion of how the world works.  Similarly, good lawyers don’t necessarily seek the truth, but rather the data that benefits their client the most.  It appears that we have become a field of lawyers who diligently defend our clients, which happen to be our ideas.

To the extent that this portrait is true, it leads to some painful implications.  Are psychological researchers just poorly paid lawyers?  I mean, most of us didn’t get into this career for the money, but if we are going to do soulless lawyer-like work, why not make the same dough as the corporate lawyers do?  Of course, given our value system psychologists would most likely be public defenders so maybe asking for more money would be wrong.  But consider the fact that law school only lasts three years.  The current timeline for a psychology Ph.D. seems to be five years minimum, sometimes six, with post doc years to boot.  Do you mean to tell me that I could have simply gone to law school instead of a Ph.D. program and been done in half the time and compensated far better? Maybe it is not too late to switch.

What’s so bad about being a lawyer?

Nothing. Really. I have no prejudice against lawyers.  Practicing law can be noble and rewarding.  And, like many careers it can be a complete drag.  It is work after all.

And, there are similarities between science and law.  Both professions and the professionals therein pursue certain ideas, often relentlessly.  Many defendants are grateful for the relentless pursuit of justice practiced by their lawyers.  Similarly, many ideas in science would not have been discovered without herculean, single-minded focus, combined with dogmatic persistence.

Then again, there are the lawyers who defend mob bosses, tobacco firms, or Big Oil.  None of us would want to be like them, right?

In an ideal world, there is one, very large difference between practicing law and science.  At some point, scientists are supposed to use data as the arbiter of truth.  That is to say, at some point we must not only entertain the possibility that our all-consuming idea is wrong, but also firmly conclude that it is incorrect.  I had an economist friend who pursued the idea that affirmative action programs were economically detrimental to beneficiaries of those programs.  He eventually determined that his idea was wrong.  Admittedly, it took him ten years to come to that position, but he at least admitted it.  Changing one’s mind like this would be akin to the tobacco lawyer suddenly admitting in the middle of a trial that smoking cigarettes really is bad for you.  This doesn’t happen exactly because these lawyers are paid big money to ignore the truth and defend their clients despite these truths.

This means that the difference between being an advocate and a scientist lies almost solely on the integrity of our data and our response to that data.  If our data are flawed, then we can act like scientists and really be no better than a pundit or propagandist.  If we hide our “bad” data (e.g., non-significant findings), we are likewise practicing a less than noble form of law—we are ambulance chasers or tobacco lawyers.  If we don’t change our minds as a result of data that disconfirms our most closely held ideas, we are again, advocates not scientists.

The bottom line is that many of us are being lawyers/pundits with our research.  We drop studies, ignore problematic data, squeeze numbers out of analyses, and use a variety of techniques in order to present the best possible case for our idea.  This is the fundamental problem with the p-hacking craze going on in many sciences, including psychology.  We are not truly testing ideas but advocating for them, and often we are really advocating only for our careers when we do this.  Just because we defend seemingly noble ideas, such as social justice, doesn’t make the work any different.  If we only pay attention to the data that supports our client, then we aren’t doing science.

What should we do?

Many, many earnest recommendations have been made to date and I will not reiterate or contradict any of the missives describing optimal publishing practices and the like.  What I think has been missing from the dialogue is a clear case made for us to change our attitudes, not only our publishing practices and research behavior.  So, the recommendations below go to that effect.

First, and most ironically, I believe we need to be legalistic in our approach to our research.  That is, we need to be judge, jury, prosecutor, and defense council of our own ideas.  As noted elsewhere, psychology is a field that only confirms ideas (and only in data that reveals a statistically significant finding).  Alternatively, we need to do more to prosecute our own research before we hoist it on the world.  The economists call this testing the robustness of a finding.  Instead of just portraying the optimal finding (e.g., the statistically significant one), we need to present what happens when we put our own finding to the test. What happens to our finding when statistical assumptions are relaxed or restricted, when different control variables are included, or different dependent variables are predicted? If your finding falls apart when you conduct a slightly different statistical approach, use a new DV that correlates .75 with your preferred DV, or run the study in a sample of Maine versus Massachusetts undergraduates, do we really want to endorse that finding as true? No.

Second, we need value direct replication.  I get a lot of push back on this argument, and that push back deserves its own essay (later Chris).  But, given how prevalent p-hacking is in our field, we need an outbreak of direct replications and healthy skepticism of “conceptual” replications.  For example, those who argue that they value and would prefer a 4-study paper with 3 conceptual replications, have to assume that p-hacking is not prevalent.  Unfortunately, p-hacking is wide-spread (see quote about “everyone does it”).  At this juncture, a 4-study paper with 3 conceptual replications using some perversely nonsensical range of sample sizes for each study (from 30 to 300) screams out “I P-HACKED!”  Combining conceptual replications with simple, direct replications is not difficult and is really hard to argue against in light of how difficult it is to replicate our findings.

Third, we need to walk back our endorsement and valuing of brief journal formats found in journals like Science, Psychological Science, and Social Psychological and Personality Science.  This is not because short reports are evil per se, but because they promote a lax attitude toward research that exacerbates our problematic methodological peccadillos.  I must admit that I used to believe that we needed more outlets for our science and I loved the short report format.  I was wrong.  We made a huge mistake—and I was part of that mistake—in promoting quick reports with formats so brief and review processes so quick that we end up promoting bad research practices. At JPSP after all, you have to “find” 3 or 4 statistically significant effects to have a chance at publication.  At short report journal outlets, you only have to “find” one such study to get published, especially if the finding is “breathtaking.”  Thus, we promote even less reliable research in top journals in an effort to garner better press. In some ideal world, these formats would not be a problem.  In the context of pervasive p-hacking, short, often single-shot studies are a problem. We have inadvertently promoted a “quick-and-dirty” attitude toward our research efforts, making it even easier to infuse our field with unreliable findings.  Until we have our methodological house in order, we should reconsider our love of the short report and the short report outlet.

Fourth, we need to be less enamored with numbers and more impressed with quality.  Building a lengthy CV is not that difficult.  All one needs to do is put together a highly motivated team of graduate and undergraduate assistants to churn through dozens of studies per year.  Then, combine that type of cartel with a willingness to ignore the null effects or practice some basic QRPs and you will have at least 4 JPSP/Psych Science-like multiple study papers completed per year.  If you are willing to work the “messier” studies in lower-tier journals you are well on your way to an award.  Even better, publish unreplicable, provocative findings and get into a nasty argument with colleagues about your findings. Then, your CV explodes with the profusion of tit-for-tat publications that come with the controversy.  In contrast, if we evaluate researchers based on the ideas they have and how they go about testing them, rather than their ability to churn the system to discover statistical significance, we might actually do more to clean up our methodological mess than any pre-registration registry could ever achieve.

The obvious ramification of adopting a more skeptical attitude toward our own research would be to slow things down.  As Michael Kraus has argued, why not adopt a “slow research” movement akin to the slow food movement?  If rumors are true, over half of our research findings cannot be directly replicated.  That means we are wasting a lot of time and energy writing, reviewing, reading, and believing arguments that are, well, just that, arguments—arguments that look like they have supporting data, but are really fiction.  While I appreciate a good argument and impassioned punditry, science is not supposed to be an opinion echo chamber.  It is supposed to be a field dedicated to creating knowledge.  Unlike a baseless argument, knowledge stands up to cross-examination and direct replication.

Brent W. Roberts

Posted in Uncategorized | 15 Comments

Owning it

What happens when the authors of studies linking candidate gene polymorphisms to response to drug consumption tried to replicate their own research?

As many of you know, the saga of replication problems continues unabated in social and personality psychology. The most recent dust up being over the ability of some researchers to replicate Dijksterhuis’ professor prime studies and the ensuing arguments over those attempts.

While social and personality psychologists “discuss” the adequacies of the replication attempts in our field a truly remarkable paper was published in Neuropsychopharmacology (Hart, de Wit, & Palmer, 2013).  The second and third authors have a long collaborative history working on the genetics of drug addiction.  In fact, they have published 12 studies linking variations in candidate genes, such as BDNF, DRD2, and COMT to intermediary phenotypes related to drug addiction.  As they note in the introduction to their paper, these studies have been cited hundreds of times and would lead one to believe that single SNPS or variations in specific genes are strongly linked to the way people react to amphetamines.

The 12 original studies all relied on a really nice experimental paradigm.  The participants received placebos and varying doses of amphetamines across several sessions, and the experimenters and participants were blind to what dose they received.  The order of drug administration was counterbalanced.  After taking the drugs, the participants rated their drug-related experience over the few hours that they stayed in the lab.  The authors, their post docs, and graduate students published 12 studies linking the genetic polymorphisms to outcomes like feelings of anxiety, elation, vigor, positive mood, and even concrete outcomes such as heart rate and blood pressure.

While the experimental paradigm had rather robust experimental fidelity and validity, the studies themselves were modestly powered (Ns = 84 to 162).  Sound familiar?  It is exactly the same situation we face in many areas of psychological research now—a history of statistically significant effects discovered using modestly powered studies.

As these 12 studies were going to press (a 5-year period), the science of genetics was making strides in identifying the appropriate underlying model of genotype-phenotype associations.  The prevailing model moved from the common-variant model to the rare variant or infinitesimal model.  The import of the latter two models was that it would be highly unlikely to find any candidate gene effects linked to any phenotype, whether it be endophenotype, intermediate phenotype, subjective, or objective phenotype because the effect of any candidate gene polymorphism would be so small.  The conclusion would be that the findings published by this team would be called into question, with the remote possibility that they had been lucky enough to find one of the few polymorphisms that might have a big effect, like APOE.

So what did the authors do?  They kept on assessing people using their exemplary methods and also kept on collecting DNA.  When they reached a much larger sample size (N = 398), they decided to stop and try to replicate their previously published work.  So, at least in terms of our ongoing conversations about how to conduct a replication, the authors did what we all want replicators to do—they used the exact same method and gathered a replication sample that had more power than the original study.

What did they find?  None of the 12 studies replicated.  Zip, zero, zilch.

What did they do?  Did they bury the results?  No, they published them.  And, in their report they go through each and every previous study in painful, sordid detail and show how the findings failed to replicate—every one of them.

Wow.

Think about it.  Publishing your own paper showing that your previous papers were wrong.  What a remarkably noble and honorable thing to do–putting the truth ahead of your own career.

Sanjay Srivastava proposed the Pottery Barn Rule for journals–if a journal publishes a paper that other researchers fail to replicate, then the journal is obliged to publish the failure to replicate.  The Hart et al (2013) paper seems to go one step further.  Call it “Clean up your own mess” rule or the “Own it” rule—if you bothered to publish the original finding, then you should be the first to try and directly replicate the finding and publish the results regardless of their statistical significance.

We are several years into our p-hacking, replication lacking, nadir in social and personality psychology and have yet to see a similar paper.  Wouldn’t be remarkable if we owned our own findings well enough to try and directly replicate them ourselves without being prodded by others?  One can only hope.

Posted in Uncategorized | 1 Comment

Schadenfreude

This week in PIG-IE we discussed the just published paper by an all-star team of “skeptical” researchers that examined the reliability of neuroscience research.  It was a chance to take a break from our self-flagellation to see whether some of our colleagues suffer from similar problematic research practices.

Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience.

If you’d like to skip the particulars and go directly to an excellent overview of the paper, head over to Ed Yong’s blog.

There are too many gems in this little paper to ignore, so I’m going to highlight a few features that we thought were invaluable.  First, the opening paragraph is an almost poetic introduction to all of the integrity issues facing science, not only psychological science.  So, I quote verbatim:

“It has been claimed and demonstrated that many (and possibly most) of the conclusions drawn from biomedical research are probably false. A central cause for this important problem is that researchers must publish in order to succeed, and publishing is a highly competitive enterprise, with certain kinds of findings more likely to be published than others. Research that produces novel results, statistically significant results (that is, typically p < 0.05) and seemingly ‘clean’ results is more likely to be published. As a consequence, researchers have strong incentives to engage in research practices that make their findings publishable quickly, even if those practices reduce the likelihood that the findings reflect a true (that is, non-null) effect. Such practices include using flexible study designs and flexible statistical analyses and running small studies with low statistical power. A simulation of genetic association studies showed that a typical dataset would generate at least one false positive result almost 97% of the time, and two efforts to replicate promising findings in biomedicine reveal replication rates of 25% or less. Given that these publishing biases are pervasive across scientific practice, it is possible that false positives heavily contaminate the neuroscience literature as well, and this problem may affect at least as much, if not even more so, the most prominent journals.”

The authors go on to show that the average power of neuroscience research is an abysmal 21%.  Of course, “neuroscience” includes animal and human studies.  When broken out separately, the human fMRI studies had an average statistical power of 8%.  That’s right, 8%.  Might we suggest that the new Brain Initiative money be spent by going back and replicating the last ten years of fMRI research so we know which findings are reliable?  Heck, we gripe about our “coin flip” powered studies in social and personality psychology (50% power).  Compared to 8% power, we rock.

Here are some additional concepts, thoughts, and conclusions from their study worth noting:

1.  Excess Significance: “The phenomenon whereby the published literature has an excess of statistically significant results that are due to biases in reporting.”

2. Positive predictive value:  What the p-rep was supposed to be; “the probability that a positive research finding reflects a true effect (as in a replicable effect).”  They even provide a sensible formula for computing it.

3.  Proteus phenomenon: “The first published study is often the most biased towards an extreme result.”  This seems to be our legacy.  Unreliable but “breathtaking” findings that are untrue, but can’t be discarded because we seldom if ever publish the lack of replications.

4.  Vibration of effects:  “low -powered studies are more likely to provide a wide range of estimates of the magnitude of an effect”

Vibration effects are really, really important because there are some in our tribe who believe that using smaller sample sizes “protects” one from reporting spuriously small effects.  In reality, the authors describe how using small samples increases the likelihood of Type I and Type II errors.  Underpowered studies are simply bad news.

 

Posted in Uncategorized | 1 Comment