An apology and proposal

Brent W. Roberts

My tweet, “Failure to replicate hurting your career? What about PhDs with no career because they were honest” was taken by some as a personal attack on Dr. Schnall.  It was not and I apologize to Dr. Schnall if it were taken that way. The tweet was in reference to the field as a whole because our current publication and promotion system does not reward the honest design and reporting of research. And this places many young investigators at a disadvantage. Let me explain.

Our publication practices reward the reporting of optimized data—the data that looks the best or that could be dressed up to look nice through whatever means necessary. We have no choice given the way we incentivize our publication system. That system, which punishes null findings and rewards only statistically significant effects means that our published science is not currently an honest portrait of how our science works. The current rash of failures to replicate famous and not so famous studies is simply a symptom of a system that is in dire need of reform. Moreover, students who are unwilling to work within this system—to be honest with their failures to replicate published work, for example—are punished disproportionately. They wash out, get counseled into other fields, or simply choose to leave our field of their own accord.

Of course, I could be wrong. It is possible that the majority of researchers publish all of their tests of all of their ideas somewhere, including their null findings. I’m open to that possibility.  But, like many hypotheses, it should be tested and I have an idea for how to test it.

Take any one of our flagship journals and for 1 year follow a publication practice much like that followed for the special replication issue just published. During that year, the editors agree to only review and publish manuscripts that have been 1) pre-registered, 2) have only their introduction, methods, and planned analyses described, not their results, 3) each paper would contain at least one direct replication of each unique study presented in any given proposed package of studies. The papers would be “accepted” based on the elegance of the theory and the adequacy of the methods alone. The results would not be considered in the review process. Of course, the pre-registered studies would be “published” in a form where readers would know that the idea was proposed even if the authors do not follow through with reporting the results.

After a year, we can examine what honest science looks like. I suspect the success rate for statistically significant findings will go down dramatically, but that is only a hypothesis. Generally speaking, think of the impact this would have on our field and science in general. The journal that takes up this challenge would have the chance to show the field and the world, what honest science looks like. It would be held up as an example for all fields of science for exactly how the process works, warts and all. And, if I’m wrong, if at the end that year the science produced in that journal looks exactly like the pages of our current journals I’ll not only apologize to the field, I’ll stop tweeting entirely.

Posted in Uncategorized | 9 Comments

Additional Reflections on Ceiling Effects in Recent Replication Research

By R. Chris Fraley

In her commentary on the Johnson, Cheung, and Donnellan (2014) replication attempt, Schnall (2014) writes that the analyses reported in the Johnson et al. (2014) paper “are invalid and allow no conclusions about the reproducibility of the original findings” because of “the observed ceiling effect.”

I agree with Schnall that researchers should be concerned with ceiling effects. When there is relatively little room for scores to move around, it is more difficult to demonstrate that experimental manipulations are effective. But are the ratings so high in Johnson et al.’s (2014) Study 1 that the study is incapable of detecting an effect if one is present?


To address this question, I programmed some simulations in R. The details of the simulations are available at, but here is a summary of some of the key results:

  • Although there are a large number of scores on the high end of the scale in the Johnson et al. Study 1 (I’m focusing on the “Kitten” scenario in particular), the amount of compression that takes place is not sufficient to undermine the study’s ability to detect genuine effects.
  • If the true effect size for the manipulation is relatively large (e.g., Cohen’s d = -.60; See Table 1 of Johnson et al.), but we pass that through a squashing function that produces the distributions observed in the Johnson et al. study, the effect is still evident (see the Figure for a randomly selected example from the thousands of simulations conducted). And, given the sample size used in the Johnson et al. (2014) report, the authors had reasonable statistical power to detect it (70% to 84%, depending on exactly how things get parameterized).
  • Although it is possible to make the effect undetectable by compressing the scores, this requires either (a) that we assume the actual effect size is much smaller than what was originally reported or (b) that the scores be compressed so tightly that 80% or more of participants endorsed the highest response or (c) that the effect work in the opposite direction of what was expected (i.e., that the manipulation pushes scores upwards towards rather than away from the ceiling).

In short, although the Johnson et al. (2014) sample does differ from the original in some interesting ways (e.g., higher ratings), I don’t think it is clear at this point that those higher ratings produced a ceiling effect that precludes their conclusions.

Posted in Uncategorized | 11 Comments

My Scary Vision of Good Science

By Brent W. Roberts

In a recent blog post, I argued that the Deathly Hallows of Psychological Science—p values < .05, experiments, and counter-intuitive findings—represent the combination of factors that are most highly valued by our field and are the explicit criteria for high impact publications. Some commenters mistook my identification of the Deathly Hallows of Psychological Science as a criticism of experimental methods and an endorsement of correlational methods. They even went so far as to say my vision for science was “scary.”


Of course, these critics reacted negatively to the post because I was being less than charitable to some hallowed institutions in psychological science. Regardless, I stand by the original argument. Counter-intuitive findings from experiments “verified” with p values less than .05 are the most valued commodities of our scientific efforts. And, the slavish worshiping of these criteria is at the root of many of our replicability and believability problems.

I will admit that I could have been clearer in my original blog post on the Deathly Hallows. I could have explained in simple language that it is not the ingredients of the Deathly Hallows of Psychological Science, per se, that are the problem, but the blind obedience that too many scholars pay to these criteria. I hope most readers got that point.

Of course, the comment saying my vision was scary did make me think. Just what is my vision for the ideal scientific process in psychology? Actually, that’s an easy question to answer. My vision of good scientific work in psychological science has two basic features. First, ask good questions. Second, answer those questions with informative methods that are well suited for answering those questions. See that? No p-values. No statistics. No experiments. No counter-intuitiveness. We just need good questions and appropriate methods. That’s all.

Good questions, of course, are not so easy to come by. By “good” I mean questions that when answered will provide valuable information. A good question often emerges from the foundation of knowledge in one’s field. It is a question that needs to be answered given the knowledge that has accrued to date. Of course, given the fact that our false positive rate in psychological science ranges from 20% to 80% depending on who you ask, it is genuinely difficult to know what a good question is nowadays. I take that as an arbitrage opportunity—every question is back on the table.

How do you know your question is good? Easy. Your research question is good if the answer is interesting regardless of the result. It should be just as interesting whether the effect is null or not provided the design was appropriate and high-powered. There is an abundance of examples of good scientific questions that have been answered over the years, such as Milgram and Asch’s question of whether humans are conforming. The significance of their work does not ride on whether their effects were p < .05. The significance of their work rests on figuring out that people behave in a very conforming fashion, at least in western populations. It would have been fascinating to find the opposite too. It was a good question and the importance of their results has stood the test of time.

Similarly, the question of whether human phenotypes are heritable and to what extent environmental influences are shared or unique was, and remains a good question. The answer would have been informative regardless of the proportion of genetic, shared, and unique environmental variance behavior geneticists found in outcomes like personality or psychopathology. The findings were, and still are fascinating given the relatively modest variance attributable to shared environmental influences.

Appropriate methods are, in part, dictated by the question that needs to be answered. Sometimes that leads to a correlational design, sometimes an experiment, sometimes something in between. God forbid sometimes it might even call for a case study or a qualitative design. Regardless, a good method is one that provides reliable information on the original research question that was asked. When behavior geneticists were criticized for the equal environments assumption, they went out and found samples of twins that were raised apart. What did they find? They found that phenotypes were just as heritable in twins who shared no environment. You can complain as much as you want about identical twins being treated more alike than fraternal twins, but the studies where twins who were raised apart show the same levels of correspondence as twins raised together was a design that answered that question perfectly.

Likewise, when people and researchers questioned the efficacy of psychotherapy, it was the true experimental designs that brought clinical psychology back from the abyss. Decades of diligently run field experiments have now shown that therapy works, at least in those populations that stay in clinical trials. Correlational designs could not have answered the question of whether clinical interventions worked. Only good experimental evidence could answer that question.

My criticism of the Deathly Hallows of Psychological Science rides on the fact that the blind pursuit of this Holy Grail incentivizes bad methods. It is much easier to get your desired finding if you run a series of underpowered studies and then either p-hack by dropping null findings or fish around for significant effects by testing moderators to death. That means that the prototypical package of underpowered conceptually replicated experiments is uninformative about the actual question that motivated the studies in the first place. These practices represent bad methodology and they waste limited and valuable resources. Most, if not all of the recommended changes that have been proposed by the “skeptics” of unreplicable research have been to simply improve the informational value of the methods by increasing sample sizes, directly replicating findings, and avoiding p-hacking. Please, someone, tell me why these are bad recommendations?

I’d add one more ingredient to my “vision” and remind the reader of the late Carl Sagan’s first maxim of his Baloney Detection Kit. The best scientific information comes not only from a study that is directly replicated, but one that is directly replicated by an independent source. That means a researcher who is indifferent, if not hostile to your finding should be able to reproduce it. That’s good information. That’s a finding that can be trusted. For example, I would put money on the fact that any researcher who has a distaste for the idea of personality traits, would, if given the responsibility of tracking personality traits over time, find that they show robust rank-order consistency.

So that is my grand, scary vision for conducting good science. Ask good questions. Answer the question with methods that are informative. An underpowered study is not informative. A properly powered study that can be replicated by a hostile audience is very, very informative. Good science doesn’t have to be an experiment. It doesn’t have to produce a statistically significant finding. Nor does the topic have to be counter-intuitive. It just has to be a trustworthy set of data that attempts to answer a good scientific question.

If that vision scares you I can recommend a good, if cheap bottle of red wine or an anxiolytic. Of course, I wouldn’t recommend mixing alcohol and medications as that can be detrimental to your health, p < 05.

Posted in Uncategorized | 5 Comments

The Deathly Hallows of Psychological Science

By Brent W. Roberts

As of late, psychological science has arguably done more to address the ongoing believability crisis than most other areas of science.  Many notable efforts have been put forward to improve our methods.  From the Open Science Framework (OSF), to changes in journal reporting practices, to new statistics, psychologists are doing more than any other science to rectify practices that allow far too many unbelievable findings to populate our journal pages.

The efforts in psychology to improve the believability of our science can be boiled down to some relatively simple changes.  We need to replace/supplement the typical reporting practices and statistical approaches by:

  1. Providing more information with each paper so others can double-check our work, such as the study materials, hypotheses, data, and syntax (through the OSF or journal reporting practices).
  2. Designing our studies so they have adequate power or precision to evaluate the theories we are purporting to test (i.e., use larger sample sizes).
  3. Providing more information about effect sizes in each report, such as what the effect sizes are for each analysis and their respective confidence intervals.
  4. Valuing direct replication.

It seems pretty simple.  Actually, the proposed changes are simple, even mundane.

What has been most surprising is the consistent push back and protests against these seemingly innocuous recommendations.  When confronted with these recommendations it seems many psychological researchers balk. Despite calls for transparency, most researchers avoid platforms like the OSF.  A striking number of individuals argue against and are quite disdainful of reporting effect sizes. Direct replications are disparaged. In response to the various recommendations outlined above, prototypical protests are:

  1. Effect sizes are unimportant because we are “testing theory” and effect sizes are only for “applied research.”
  2. Reporting effect sizes is nonsensical because our research is on constructs and ideas that have no natural metric, so that documenting effect sizes is meaningless.
  3. Having highly powered studies is cheating because it allows you to lay claim to effects that are so small as to be uninteresting.
  4. Direct replications are uninteresting and uninformative.
  5. Conceptual replications are to be preferred because we are testing theories, not confirming techniques.

While these protestations seem reasonable, the passion with which they are provided is disproportionate to the changes being recommended.  After all, if you’ve run a t-test, it is little trouble to estimate an effect size too. Furthermore, running a direct replication is hardly a serious burden, especially when the typical study only examines 50 to 60 odd subjects in a 2×2 design. Writing entire treatises arguing against direct replication when direct replication is so easy to do falls into the category of “the lady doth protest too much, methinks.” Maybe it is a reflection of my repressed Freudian predilections, but it is hard not to take a Depth Psychology stance on these protests.  If smart people balk at seemingly benign changes, then there must be something psychologically big lurking behind those protests.  What might that big thing be?  I believe the reason for the passion behind the protests lies in the fact that, though mundane, the changes that are being recommended to improve the believability of psychological science undermine the incentive structure on which the field is built.

I think this confrontation needs to be more closely examined because we need to consider the challenges and consequences of deconstructing our incentive system and status structure.  This, then begs the question, what is our incentive system and just what are we proposing to do to it?  For this, I believe a good analogy is the dilemma faced by Harry Potter in the last book of the eponymously titled book series.

deathly hallows

The Deathly Hallows of Psychological Science

In the last book of the Harry Potter series “The Deathly Hallows,” Harry Potter faces a dilemma.  Should he pursue the destruction of the Horcruxes or gather together the Deathly Hallows. The Horcruxes are pieces of Voldemort’s soul encapsulated in small trinkets, jewelry, and such.  If they were destroyed, then it would be possible to destroy Voldemort.  The Deathly Hallows are three powerful magical objects, which are alluring because by possessing all three, one becomes the “master of death.”  The Deathly Hallows are the Cloak of Invisibility, the Elder Wand, and the Resurrection Stone. The dilemma Harry faced was whether to pursue and destroy the Horcruxes, which was a painful and difficult path; or Harry could choose to pursue the Deathly Hallows, with which he could quite possibly conquer Voldemort, and, if not conquer him, live on despite him.  He chose to destroy the Horcruxes.

Like Harry Potter, the field of psychological science (and many other sciences) faces a similar dilemma. Pursue changes in our approach to science that eliminate problematic practices that lead to unreliable science—a “destroy the Horcrux” approach. Or, continue down the path of least resistance, which is nicely captured in the pursuit of the Deathly Hallows.

What are the Deathly Hallows of psychological science? I would argue that the Deathly Hallows of psychological science, which I will enumerate below, are 1) p values less than .05, 2) experimental studies, and 3) counter-intuitive findings.

Why am I highlighting this dilemma at this time? I believe we are at a critical juncture.  The nascent efforts at reform may either succeed or fade away like they have done so many times before.  For it is a fact that we’ve confronted this dilemma many times before and have failed to overcome the allure of the Deathly Hallows of psychological science. Eminent methodologists such as Cohen, Meehl, Lykken, Gigerenzer, Schmidt, Fraley, and lately Cumming, have told us how to do things better since the 1960s to no avail. Revising our approach to science has never been a question of knowing the right the thing to do, but rather it has been whether we were willing to do the thing we knew was right.

Screen Shot 2014-03-09 at 9.31.27 PMThe Deathly Hallows of Psychological Science: p-values, experiments, and counter-intuitive/surprising findings

The cloak of invisibility: p<.05. The first Deathly Hallow of psychological science is the infamous p-value. You must attain a p-value less than .05 to be a success in psychological science.  Period.  If your p-value is greater than .05, you have no finding and nothing to say. Without anything to say, you cannot attain status in our field. Find a p-value below .05 and you can wrap it around yourself and hide from the contempt aimed at those who fail to cross that magical threshold.

Because the p-value is the primary key to the domain of scientific success, we do almost anything we can to find it.  We root around in our data, digging up p-values either by cherry picking studies, selectively reporting outcomes, or through some arcane statistical wizardry.  One only has to read Bem’s classic article on how to write an article in psychological science to see how we approach p-values as a field:

“…the data.  Examine them from every angle. Analyze the sexes separately.  Make up new composite indices.  If a datum suggests a new hypothesis, try to find further evidence for it elsewhere in the data.  If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief.  If there are participants you don’t like, or trials, observers, or interviewers who gave you anomalous results, drop them (temporarily).  Go on a fishing expedition for something–anything–interesting.”

What makes it worse is that when authors try to report null effects they are beaten down because we as reviewers and editors do everything in our power to hide the null effects. Null effects make for a messy narrative. Our most prestigious journals almost never publish null effects because reviewers and editors act as gatekeepers and mistakenly recommend against publishing null effects.  Consider the following personal example. In one study, reviewer 2 argued that our study was not up for publication in JPSP because one of our effects was null (there were other reasons too). Consider the fact that the null effect in question was a test of a hypothesis drawn from my own theory. I was trying to show that my theory did not work all of the time and the reviewer was criticizing me for showing that my own ideas might need revision. This captures quite nicely the tyranny of the p-value. The reviewer was so wedded to my ideas that he or she wouldn’t even let me, the author of said ideas, offer up some data that would argue for revising them.

In the absence of simply rejecting null effects, we often recommend cutting the null effects. I have seen countless recommendations in reviews of my papers and the papers of colleagues to simply drop studies or results that show null effects.  It is not then surprising that psychology confirms 95% of its hypotheses.

Even worse, we often commit the fundamental attribution error by thinking that the person trying to publish null effects is an incompetent researcher—especially if they fail to replicate an already published effect that has crossed the magical p< .05 threshold. Not to be too cynical, but the reviewers may have a point.  If you are too naïve to understand “the game”, which is to produce something with p < .05, then maybe you shouldn’t succeed in our field.  Setting sarcasm aside, what the gatekeepers don’t understand is that they are sending a clear message to graduate students and assistant professors that they must compromise their own integrity in order to succeed in our field. Of course, this leads to the winnowing of the field of researchers who don’t want to play the game.

The Elder Wand: Running Experiments

Everyone wants to draw a causal conclusion, even observational scientists. And, of course, the best way to draw a causal conclusion, if you are not an economist, is to run an experiment.  The second Deathly Hallow for psychological science is doing experimental research at all costs.  As one of my past colleagues told a first year graduate student, “if you have a choice between a correlational or an experimental study, run an experiment.”

Where things go awry, I suspect, is when you value experiments so much, you do anything in your power to avoid any other method. This leads to levels of artificiality that can get perverse. Rather than studying the effect of racism, we study the idea of racism.  Where we go wrong is that, as Cialdini has noted before, we seldom work back and forth between the fake world of our labs and the real world where the phenomenon of interest exists. We become methodologists, rather than scientists.  We prioritize lab-based experimental methods because they are most valued by our guild not because they necessarily help us illuminate or understand our phenomenon but because they putatively lead to causal inferences. One consequence of valuing experiments so highly is that we get caught up in a world of hypothetical findings that have unknown relationships to the real world because we seldom if ever touch base with applied or field research.  As Cialdini so poignantly pointed out, we simply don’t value field research enough to pursue it with equal vigor to lab experiments.

And, though some great arguments have been made that we should all relax a bit about our use of exploratory techniques and dig around in our data, what these individuals don’t realize is that half the reason we do experiments is not to do good science but to look good.  To be a “good scientist” means being a confirmer of hypotheses and there is no better way to be a definitive tester of hypotheses than to run a true experiment.

Of course, now that we know many researchers run as many experiments as they need to in order to construct what findings they want, we need to be even more skeptical of the motive to look good by running experiments. Many of us publishing experimental work are really doing exploratory work under the guise of being a wand wielding wizard of an experimentalist simply because that is the best route to fame and fortune in our guild.

Most of my work has been in the observational domain, and admittedly, we have the same motive, but lack the opportunity to implement our desires for causal inference.  So far, the IRB has not agreed to let us randomly assign participants to the “divorced or not divorced” or the “employed, unemployed” conditions. In the absence of being able to run a good, clean experiment, observational researchers, like myself, bulk up on the statistics as a proxy for running an experiment. The fancier, more complex, and indecipherable the statistics, the closer one gets to the status of an experimenter. We even go so far as to mistake our statistical methods, such as cross-lag panel, longitudinal designs, for ones that would afford us the opportunity to make causal inferences (hint: they don’t). Reviewers are often so befuddled by our fancy statistics that they fail to notice the inappropriateness of that inferential leap.

I’ve always held my colleague Ed Diener in high esteem.  One reason I think he is great is that as a rule he works back and forth between experiments and observational studies, all in the service of creating greater understanding of well-being.  He prioritizes his construct over his method. I have to assume that this is a much better value system than our long standing obsession with lab experiments.

The Resurrection Stone: Counter-intuitive findings

The final Deathly Hallow of psychological science is to be the creative destroyer of widely held assumptions. In fact, the foundational writings about the field of social psychology lay it out quite clearly. One of the primary routes to success in social psychology, for example, is to be surprising.  The best way to be surprising is to be the counter-intuitive innovator—identifying ways in which humans are irrational, unpredictable, or downright surprising (Ross, Lepper, & Ward, 2010).

It is hard to argue with this motive.  We hold those scientists who bring unique discoveries to their field in the highest esteem.  And, every once in a while, someone actually does do something truly innovative. In the mean time, the rest of us make up little theories about trivial effects that we market with cute names, such as the “End of History Effect”, or the “Macbeth Effect” or, whatever.  We get caught up in the pursuit of cutesy counter-intuitiveness all under the hope that our little innovation will become a big innovation.  To the extent that our cleverness survives the test of time, we will, like the resurrection stone, live on in our timeless ideas even if they are incorrect.

What makes the pursuit of innovation so formidable an obstacle to reform is that it sometimes works. Every once in a while someone does revolutionize a field. The aspiration to be the wizard of one’s research world is not misplaced.  Thus, we have an incentive system that produces a variable-ratio schedule of reinforcement—one of the toughest to break according to those long forgotten behaviorists (We need not mention behavioral psychologists, since their ideas are no longer new, innovative, or interesting–even if they were right).

Reasons for Pessimism

The problem with the current push for methodological reform is that, like pursuing the Horcruxes, it is hard and unrewarding in comparison to using the Deathly Hallows of psychological science. As one of our esteemed colleagues has noted, no one will win the APA Distinguished Scientific Award by failing to replicate another researcher’s work. Will a person who simply conducts replications of other researcher’s work get tenure?  It is hard to imagine. Will researchers do well to replicate their own research? Why? It will simply slow them down and handicap their ability to compete with the other aspiring wizards who are producing the conceptually-replicated, small N lab-based experimental studies at a frightening rate. No, it is still best to produce new ideas, even if it comes at the cost of believability. And, everyone is in on the deal. We all disparage null findings in reviews because we want errors of commission rather than omission.

Another reason why the current system may be difficult to fix is that it provides a weird p-value driven utopia. With the infinite flexibility of Deathly Hallows of psychological science we can pretty much prove any idea is a good one. When combined with our antipathy toward directly replicating our own work or the work of others, everyone can be a winner in the current system. All it takes is a clever idea applied to enough analyses and every researcher can be the new hot wizard. Without any push to replicate, everyone can co-exist in his or her own happy p-value driven world.

So, there you have it.  My Depth Psychology analysis of why I fear that the seemingly benign recommendations for methodological change are falling on deaf ears.  The proposed changes contradict the entire status structure that has served our field for decades.  I have to imagine that the ease with which the Deathly Hallows can be used is one reason why reform efforts have failed in the past. Since, as many have indicated, the same recommendations to revise our methods have been made for over 50 years. Each time, the effort has failed.

In sum, while there have been many proposed solutions to our problems, I believe we have not yet faced our real issue, which is how are we going to re-structure our incentive structure?  Many of us have stated, as loudly and persistently as we can that there are Horcruxes all around us that need to be destroyed. The move to improve our methods and to conduct direct replications can be seen as an effort to eliminate our believability Horcruxes. But, I believe the success of that effort rides on how clearly we see the task ahead of us. Our task is to convince a skeptical majority of scientists to dismantle an incentive structure that has worked for them for many decades. This will be a formidable task.

Image | Posted on by | 37 Comments

For the love of p-values

We recently read Karg et al (2011) for a local reading group.  It is one of the many of attempts to meta-analytically examine the idea that the 5-HTTLPR serotonin transporter polymorphism moderates the effect of stress on depression.

It drove me batty.  No, it drove me to apoplectia–a small country in my mind I occupy far too often.

Let’s focus on the worst part.  Here’s the write up in the first paragraph of the results:

“We found strong evidence that 5-HTTLPR moderates the relationship between stress and depression, with the s allele associated with an increased risk of developing depression under stress (P = .00002).  The significance of the result was robust to sensitivity analysis, with the overall P values remaining significant when each study was individually removed form the analysis (1.0×10-6<P<.00016).”

Wow.  Isn’t that cool?  Isn’t that impressive?  Throw out all of the confused literature and meta-analyses that came before this one.  They found “strong evidence” for this now infamous moderator effect.  Line up the spit vials. I’m getting back in the candidate gene GxE game.

Just what did the authors mean by “strong?”  Well, that’s an interesting question.  There is nary an effect size in the review as the authors chose not to examine effect sizes, but focused on synthesizing p-values instead.  Of course, if you have any experience with meta-analytic types, you know how they feel about meta-analyzing p-values.  It’s like Nancy Reagan to drugs.  Just say no.  If you are interested in why, read Lipsey and Wilson or any other meta-analysis guru.  They are unsympathetic, to say the least.

But, all is not lost. All you, the reader, have to do is transform the p-value into an effect size using any of the numerous on-line transformation programs that are available.  It takes about 15 seconds to do it yourself.  Or, if you want to be thorough, you can take the data from Table 1 in Karg et al (2011) and transform the p-values into effect sizes for your own meta-analytic pleasure. That takes about 15 minutes.

So what happens when you take their really, really significant p-value of p = .00002 and transform it in to an effect size estimate?  Like good meta-analytic types, the authors provide the overall N, which is 40,749.  What does that really impressive p-value translate into when you translate it into an r metric?

.0199 or .02 if you round up.

It is even smaller than Rosenthal’s famous .03 correlation between aspirin consumption and protection from heart disease.  You get the same thing when you plug all of the numbers from Table 1 into Comprehensive Meta-Analysis, by-the-way.

So the average interaction between the serotonin transporter promoter and stress on depression is “strong,” “robust,” yet infinitesimal.  It sounds like a Monty Python review of Australian wine (“Bold, yet naïve.” “Flaccid, yet robust”).

Back to our original question, what did the authors mean when they described their results as “strong?”  One can only assume that they mean to say that their p-value of .00002 looks a lot better than our usual suspect, the p < .05.  Yippee.

Why should we care?  Well, this is a nice example of what you get when you ignore effect size estimates and just use p-values–misguided conclusions. The Karg et al (2011) paper has been cited 454 times so far.  Here’s a quote from one of the papers that cites their work “This finding, initially refuted by smaller meta-analyses, has now been supported by a more comprehensive meta-analysis” (Palazidou, 2012). Wrong.

Mind you, there is no inconsistency across the meta-analyses.  If the average effect is really equal to an r of .02, and I doubt it is this big, it is really, really unlikely to be consistently detected by any study, no less a meta-analysis. The fact that the meta-analyses appear to disagree is only because the target effect size is so small that even dozens of studies and thousands of participants might fail to detect it.  

Another reason to care about misguided findings is the potential mistaken conclusion either individuals or granting agencies will make if they take these findings at face value.  They might conclude that the GxE game is back on and start funding candidate gene research (doubtful, but possible).  Researchers themselves might come to the mistaken conclusion that they too can investigate GxE designs.  Heck, the average sample size in the meta-analysis is 755.  With a little money and diligence, one could come by that kind of sample, right?

Of course, that leads to an interesting question.  How many people do you need to detect a correlation of .02? Those pesky granting agencies might ask you to do a power analysis, right?  Well, to achieve 80% power to detect a correlation of .02 you would need 8,699 participants.  That means the average sample in the meta-analysis was woefully underpowered to detect the average effect size.  For that matter, it means that none of the studies in the meta-analysis were adequately powered to detect the average effect size because the largest study, which was a null effect, had an N of 3,243.

So, this paper proves a point; that if you cumulate enough participants in your research almost anything is statistically significant.  And this warrants publication in the Archives of General Psychiatry?  Fascinating.

Brent W. Roberts

Posted in Uncategorized | 10 Comments

Are conceptual replications part of the solution to the crisis currently facing psychological science?

by R. Chris Fraley

Stroebe and Strack (2014) recently argued that the current crisis regarding replication in psychological science has been greatly exaggerated. They observed that there are multiple replications of classic social/behavioral priming findings in social psychology. Moreover, they suggested that the current call for replications of classic findings is not especially useful. If a researcher conducts an exact replication study and finds what was originally reported, no new knowledge has been generated. If the replication study does not find what was originally reported, this mismatch could be due to a number of factors and may speak more to the replication study than the original study per se.

As an alternative, Stroebe and Strack (2014) argue that, if researchers choose to pursue replication, the most constructive way to do so is through conceptual replications. Conceptual replications are potentially more valuable because they serve to probe the validity of the theoretical hypotheses rather than a specific protocol.

Are conceptual replications part of the solution to the crisis currently facing psychological science?

The purpose of this post is to argue that we can only learn anything of value—whether it is from an original study, an exact replication, or a conceptual replication—if we can trust the data. And, ultimately, a lack of trust is what lies at the heart of current debates. There is no “replicability crisis” per se, but there is an enormous “crisis of confidence.”

To better appreciate the distinction, consider the following scenarios.

A. At University of A researchers have found that X1 leads to Y1. They go on to show that X2 leads to Y. And that X3 leads to Y2. In other words, there are several studies that suggest that X, operationalized in multiple ways, leads to Y in ways anticipated by their theoretical model.

B. At University of B researchers have found that X1 leads to Y1. They go on to show that X2 leads to Y. And that X3 leads to Y2. In other words, there are several studies that suggest that X, operationalized in multiple ways, leads to Y in ways anticipated by their theoretical model.

Is one set of research findings more credible than the other? What’s the difference?

At the University of A researchers conducted 8 studies total. Some of these were pilot studies that didn’t pan out, but led to some ideas about how to tweak the measure of Y. A few of the studies involved exact replications with extensions, but the so-called exact replication part didn’t quite work, but one of the other variables did reveal a difference that made sense in light of the theory, so that finding was submitted (and accepted) for publication. In each case, the data from on-going studies were analyzed each week for lab meetings and studies were considered “completed” when a statistically significant effect was found. The sample sizes were typically small (20 per cell) because a few other labs studying a similar issue had successfully obtained significant results with small samples.

In contrast, at the University of B a total of 3 studies were conducted. The researchers used large sample sizes to estimate the parameters/effects well. Moreover, the third study had been preregistered such that the stopping rules for data collection and the primary analyses were summarized briefly (3 sentences) on a time-stamped site.

Both research literatures contain conceptual replications. But, once one has full knowledge of how these literatures were produced via a Simmons et al. (2011) sleight of hand, one may doubt whether the findings and theories being studied by the researchers at the University of A are as solid as those being studied at the University of B. This example is designed to help separate two key issues that are often conflated in debates concerning  the current crisis.

Specifically, as a field, we need to draw a sharper distinction between (a) replications (exact vs. conceptual) and (b) the integrity of the research process (see Figure) when considering the credibility of knowledge generated in psychological science. We sometimes conflate these two things, but they are clearly separable.

Replication and Integrity

The difference between methodological integrity and replication and their relation to the the credibility of research

Speaking for myself, I don’t care whether a replication is exact or conceptual. Both kinds of studies serve different purposes and both are valuable under different circumstances. But what matters critically for the current crisis is the integrity of the methods used to populate the empirical literature. If the studies are not planned, conducted, and published in a manner that has integrity, then—regardless of whether those findings have been conceptually replicated—they offer little in the way of genuine scientific value. The University of A example above illustrates a research field that has multiple conceptual replications. But those replications do little to boost the credibility of the theoretical model because the process that generated the findings was too flexible and not transparent (Simmons, Nelson, & Simonsohn, 2011).

When skeptics call for “exact replications,” what they really mean is that “we don’t trust the integrity of the process that led to the publication of the findings in the first place.” An exact replication provides the most obvious way to address that matter; that is why skeptics, such as my colleague, Brent Roberts, are increasingly demanding them. But improving the integrity of the research process is the most direct way to improve the credibility of published work. This can be accomplished, in part, by using established and validated measures, taking statistical power or precision seriously, using larger sample sizes, preregistering analyses and designs when viable, and, of course, conducting replications along the way.

I agree with Stroebe and Strack (2014) that conceptual replication is something for which we should be striving. But, if we don’t practice methodological integrity, no number of replications will solve the crisis of confidence.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22(11), 1359-1366.

Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives in Psychological Science, 8, 59-71.

Posted in Uncategorized | 6 Comments

The Pre-Publication Transparency Checklist

The Pre-Publication Transparency Checklist: A Small Step Toward Increasing the Believability of Psychological Science

 We now know that some of the well-accepted practices of psychological science do not produce reliable knowledge. For example, widely accepted but questionable research practices contribute to the fact that many of our research findings are unbelievable (that is, that one is ill-advised to revise one’s beliefs based on the reported findings). Post-hoc analyses of seemingly convincing studies have shown that some findings are too good to be true.  And, a string of seminal studies have failed to replicate.  These factors have come together to create a believability crisis in psychological science.

Many solutions have been proffered to address the believability crisis.  These solutions have come in four general forms.  First, many individuals and organizations have listed recommendations about how to make things better.  Second, other organizations have set up infrastructures so that individual researchers can pre-register their studies, to document hypotheses, methods, analyses, research materials, and data so that others can reproduce published research results (Open Science Framework).  Third, specific journals, such as Psychological Science, have set up pre-review confessionals of sorts to indicate the conditions under which the data were collected and analyzed.  Fourth, others have created vehicles so that researchers can confess to their methodological sins after their work has been published (  In fact, psychology should be lauded for the reform efforts it has put forward to address the believability crisis, as it is only one of many scientific fields in which the crisis is currently raging, and it is arguably doing more than many other fields.

While we fully support many of these efforts at reform, it has become clear that they leave a gaping hole through which researchers can blithely walk.  People can and do ignore recommendations.  Researchers can avoid pre-registering their work.  Researchers can also avoid publishing in journals that require confessing one’s QRPs before review.  And, published authors can avoid admitting to their questionable research practices post hoc.  What this means is that research continues to be published every month in our most prestigious journals that in design and method looks indistinguishable from the research that lead to the believability crisis in the first place.

In searching for solutions to this problem, we thought that instead of relying on the good graces of individual researchers to change their own behavior, or waiting for the slow pace of institutional change (e.g., journals to follow Psychological Science’s lead), that it might be productive to provide a tool that could be used by everyone, right now.  So what are we proposing?  We propose a set of questions that all researchers should be able to answer pre-publication in the review process—the Pre-Publication Transparency Checklist (PPTC).  Who should use these questions?  Reviewers.  Reviewers are free to ask any question they want, as many of us can attest to.  There is nothing stopping researchers from holding other researchers accountable. The goal of these questions is to get even those unwilling to pre- or post-register their research process to cough up background information on how they conducted their research and the extent to which their results are “fragile” or “robust”.  The questions are inspired by the changes recommended by many different groups and would hopefully help to improve the believability of the research by making authors describe the conditions under which the research was conducted before their paper is accepted for publication[1].

The Pre-Publication Transparency Checklist

  1. How many studies were run in conceptualizing and preparing the set of studies reported in this paper?
    • How many studies were run under the original IRB proposal?
    • How many “pilot” studies were run in support of the reported studies?
  2. If an experiment was run, how many conditions were included in the original study and were all of these conditions included in the current manuscript?a
  3. Was any attempt made to directly replicate any of the studies reported in this paper?
    • Would you be willing to report the direct replications as an on-line appendix?
    • Note: In some fields it is common to replicate research but not report the efforts.
    • Note: Some studies are difficult to replicate (e.g., longitudinal, costly, technologically intense).
  4. Approximately how many outcome measures were assessed in each study?a
    • How many of these outcome measures were intended for this study?
    • How many outcome measures were analyzed for each study?
    • Do all of the conceptually and empirically related DVs show the same pattern of effects?
  5. In any of the studies presented, were the data analyzed as they were being collected (i.e., peeked at)?
    • If it was “peeked” at, was there an effort to address the potential increase in the Type I error rate that results from peeking, such as conducting direct replications or using Bayesian estimation approaches?
    • Note: The goal is not necessarily to eliminate p-hacking but to make sure our findings are replicable despite p-hacking (see Asendorpf, et al, 2012 for a discussion).
  6. What was the logic behind the sample sizes used for each study?a
    • Was a power analysis performed in the absence of pilot data?
    • Was an effect size estimate made on the initial work and used for power estimates of subsequent studies whether they were direct or conceptual replications?
  7. Were any participants assessed and their data not included in the final report of each study?a
    • What was the rationale for not including these participants?
  8. Do all of the co-authors of the study have access to the data?
  9. Can all of the co-authors of the study reproduce the analyses?
    • If not, why and who can?
    • Note: It is common for statistical experts to lend a helping hand so it is not necessarily bad that all the authors cannot reproduce the analyses.  But, it is important to know who can and cannot reproduce the analyses for future efforts to reproduce the results.

10. Were there any attempts to test whether the results were robust across analytical models and different sets of control variables?

  • If the results do not replicate across models was this factored into the alpha level (multiple tests of significance?)?

11. Approximately how many different analytical approaches were used?

  • Were alternative ways of analyzing the data considered and tested?
  • Note: it is common to try different variants of the general linear model (ANOVA, ANCOVA, regression, HLM, SEM).  It would be important to know whether the results replicate across the various ways of analyzing the data.

So, as we noted above, these questions could be asked of researchers when they present their work for review.  Ideally, the answers to these questions would become part of the reported methods of every paper submitted, possibly as an on-line appendix. If reviewers asked these types of questions of every paper that was submitted, that in itself would change the publication incentive structure quite dramatically.

A second way that the Pre-Publication Transparency Checklist could be used is by editors of journals other than Psychological Science.  Like reviewers, editors could ask authors to simply answer each of these questions along with their submission.  There is no reason why Psychological Science should go it alone with this type of questioning.  The effort to answer these questions is minor—far less, for example, than the time taken to complete the typical IRB form.  Again, if editors used the PPTC, which they should be free to do today, we could be on our way to a better more substantial body of research on which to base our future scientific efforts.

Given the heterogeneity of reactions to the believability crisis in psychology, we do not foresee the answers to these questions being “right” or “wrong” so much as providing information that other researchers can use to determine whether they personally would want further follow up before concluding that the research was reliable. But, of course, like the traditional methods we use in psychological science which rely on transparency, accuracy, and honesty, the answers will only be as good as they are truthful.

We are also sympathetic to the point that many of the questions will be difficult to answer and that many questions will not apply to different types of research.  That is okay.  The goal is not an exacting account of every behavior and decision made on the way to a finished publication.  The goal is to provide background information to help determine how robust or delicate the findings may be.  For example, if dozens of studies were run looking at scores of outcomes and only a few of the studies and outcomes were reported, then other researchers may not want to attempt to build on the findings before directly replicating the results themselves.  Similarly, if multiple analyses were conducted and only the statistically significant ones reported, then other researchers would likewise be cautious when following up on the findings.

As noted above, the PPTC would not be necessary if researchers pre-registered their studies, posted their materials and data on-line, and were transparent with the description of their methods.  But, given the obvious fact that not every researcher is going to pre-register their materials, the Pre-Publication Transparency Checklist provides a means through which reviewers and editors can get these individuals to provide desperately needed information on which to judge the robustness or fragility of the reported findings in any submitted manuscript.

Brent W. Roberts, University of Illinois, Urbana-Champaign

Daniel Hart, Rutgers University

[1] These accords include those proposed for pre-review at Psychological Science, post-publication disclosure at and as part of the 21 Word Solution.  Those questions marked with an “a” are similar in content to the questions found on existing systems.

Posted in Uncategorized | 7 Comments