Sample Sizes in Personality and Social Psychology

R. Chris Fraley

Imagine that you’re a young graduate student who has just completed a research project. You think the results are exciting and that they have the potential to advance the field in a number of ways. You would like to submit your research to a journal that has a reputation for publishing the highest caliber research in your field.

How would you know which journals are regarded for publishing high-quality research?

Traditionally, scholars and promotion committees have answered this question by referencing the citation Impact Factor (IF) of journals. But as critics of the IF have noted, citation rates per se may not reflect anything informative about the quality of empirical research. A paper can receive a large number of citations in the short run because it reports surprising, debatable, or counter-intuitive findings regardless of whether the research was conducted in a rigorous manner. In other words, the citation rate of a journal may not be particularly informative concerning the quality of the research it reports.

What would be useful is a way of indexing journal quality that is based upon the strength of the research designs used in published articles rather than the citation rate of those articles alone.

In an article recently published in PLoS ONE, Simine Vazire and I attempted to do this by ranking major journals in social-personality psychology with respect to what we call their N-pact Factors (NF)–the statistical power of the studies they publish. Statistical power is defined as the probability of detecting an effect of interest when that effect actually exists. Statistical power is relevant for judging the quality of empirical research literatures because, compared to lower powered studies, studies that are highly powered are more likely to (a) detect valid effects, (b) buffer the literature against false positives, and (c) produce findings that other researchers can replicate. Although power is certainly not the only way to evaluate the quality of empirical research, the more power a study has, the better positioned it is to provide useful information and to make robust contributions to the empirical literature.

Our analyses demonstrate that, overall, the statistical power of studies published by major journals in our field tends to be inadequate, ranging from 40% to 77% for detecting the typical kinds of effect sizes reported in social-personality psychology. Moreover, we show that there is considerable variation among journals; some journals tend to consistently publish higher power studies and have lower estimated false positive rates than others. And, importantly, we show that some journals, despite their comparatively high impact factors, publish studies that are greatly underpowered for scientific research in psychology.

We hope these rankings will help researchers and promotion committees better evaluate various journals, allow the public and the press (i.e., consumers of scientific knowledge in psychology) to have a better appreciation of the credibility of published research, and perhaps even facilitate competition among journals in a way that would improve the net quality of published research. We realize that sample size and power are not and should not be the gold standard in evaluating research But we hope that this effort will be viewed as a constructive, if incomplete, contribution to improving psychological science.

Simine wrote a nice blog post about some of the issues relevant to this work. Please check it out.


Posted in Uncategorized | 1 Comment

Is It Offensive To Declare A Social Psychological Claim Or Conclusion Wrong?

By Lee Jussim

Science is about “getting it right” – this is so obvious that it should go without saying. However, there are many obstacles to doing so, some relatively benign (an honestly conducted study produces a quirky result), others less so (p-hacking). Over the last few years, the focus on practices that lead us astray have focused primarily on issues of statistics, methods, and replication.

These are all justifiably important, but here I raise the possibility that other, more subjective factors, distort social and personality psychology in ways at least as problematic. Elsewhere, I have reviewed what I now call questionable interpretive practices – how cherrypicking, double standards, blind spots, and embedding political values in research all lead to distorted conclusions (Duarte et al, 2014; Jussim et al, in press a,b).

But there are other interpretations problems. Ever notice how very few social psychological theories are refuted or overturned?   Disconfirming theories and hypotheses (including the subset of disconfirmation, failures to replicate) should be a normal part of the advance of scientific knowledge. It is ok for you (or me, or Dr. I. V. Famous) to have reached or promoted a wrong conclusion.

In social psychology, this rarely happens. Why not? Many social psychologists seem to balk at declaring some claims “wrong.” This seems to occur primarily for three reasons. The first is that junior scholars, especially pre-tenure, may justifiably feel that potentially angering senior colleagues (who may later be called on to write letters for promotion) is not a wise move. That is the nature of the tenure beast, but it only explains the behavior of, at most, a minority. What about the rest of us?

The second reason is essentially social (i.e., not scientific). Declaring some scientific claim to be “wrong” is, I suspect, often perceived as a personal attack on the claimant. This probably occurs because it is impossible to declare some claim wrong without citing some article making the claim. Articles have authors, so that declaring a claim wrong is tantamount to saying “Dr. Earnest’s claims are wrong.” This problem is further exacerbated by the fact that theories, hypotheses, and phenomenon often become identified with either the originators or apostles (prestigious researchers who popularize them). Priming social behavior? Fundamental attribution error? Bystander effect? System justification? Implicit racism?  There are individual social psychologists associated with each of these ideas. To challenge the validity, or even the power or generality of such ideas/effects/theories/hypotheses risks being interpreted as something more than a mere scientific endeavor – it risks being seen as a personal insult to the person identified with them. Thus, declaring a claim “wrong” risks being seen, not as a scientific act of theory or hypothesis disconfirmation, but as a personal attack — and no one supports personal attacks.

The third reason is grounded in a very unique philosophy of science perspective – namely, that almost every claim is true under some conditions (for explicitly articulated versions of this, see Greenwald, Pratkanis, Leippe, & Baumgardner, 1986; McGuire, 1973, 1983). As such, we have a great deal of research on “conditions under which” some theory or hypothesis holds, but very little research providing wholesale refutation of a theory or hypothesis. I have heard apocryphal stories of prestigious researchers declaring (behind closed doors) that they only run studies to prove what they already know and that they can craft a study to confirm any hypothesis they choose. These apocrypha are not evidence – but the evidence of p-hacking in social psychology and elsewhere (e.g., Ioannidis, 2005; Simmons et al, 2012; Vul et al, 2009) raises the possibility that some unknown number of social psychologists conduct their research in a manner consistent both with these apocrypha and with the notion that everything is true under some conditions. If every claim is true under some conditions, then massive flexibility in methods and data analysis in the service of demonstrating almost any notion becomes, not a flaw to be rooted out of science, but evidence of the “skill” and “craftsmanship” of researchers, and of the “quality” of their research. In this context, declaring any scientific claim, conclusion, hypothesis or theory “wrong” becomes unjustified. It reflects little more than ignorance of this “sophisticated” view of science, and arrogance in the sense that no one, according to this view, can declare anything “wrong” because it is true under some conditions. As such, declaring some claim wrong can again be viewed as an offensive act.

The idea that claims cannot be “wrong” because “every claim is true under some circumstances” goes too far for two reasons. First, some claims are outright false, such as “the Sun revolves around the Earth.” Furthermore, even if two competing claims are both correct under some conditions, this does not mean they are equally true. Knowing that something is true 90% of the time is quite different than knowing it is true 10% of the time. Claiming that some phenomena is “powerful” or “pervasive,” when the data show it is only rarely true, is wrong. Let’s say that, on average, stereotype biases in person perception are not very powerful or pervasive – which they are not (Jussim, 2012 – multiple meta-analyses yield an average estimate of r = .10 for such biases). Isn’t it better to point out that the field’s long history of declaring them to be powerful and pervasive is wrong (at least when the criterion is the field’s own data), than to just report the data without acknowledging its bearing on longstanding conclusions?

This reluctance to declare certain theories or hypotheses wrong risks leading social psychology to become populated with a plethora of “… undead theories that are ideologically popular but have little basis in fact” (Ferguson & Heene, 2012, p. 555). This amusing phrasing cannot be easily dismissed – ask yourself, “Which theories in social psychology have ever been disconfirmed?” Indeed, a former President of the National Research Council, Dr. Bruce Alberts and editor of science put it this way (quoted in The Economist, 2013):

“And scientists themselves need to develop a value system where simply moving on from one’s mistakes without publicly acknowledging them severely damages, rather than protects, a scientific reputation.’”

I agree. It is ok to be wrong. In fact, if one engages in enough scientific research for a long enough period of time, one is almost guaranteed to be wrong about something. Good research at its best can be viewed as systematic, creative, and informed trial and error. But that includes … error! Both being wrong sometimes, and correcting wrong claims are integral parts of healthy scientific processes.

Furthermore, from a prescriptive standpoint of how science should proceed, I concur with Popper’s (1959/1968) notion that we should seek to disconfirm theories and hypotheses. Ideas left standing in the face of strong attempts at disconfirmation are those most likely to be robust and valid. Thus, rather than being something we social psychologists should shrink away from, bluntly identifying which theories and hypotheses do not (and do!) hold up to tests of logic and existing data should be a core component of how we conduct our science.



Duarte, J. L., Crawford, J. T., Stern, C., Haidt, J., Jussim, L., & Tetlock, P. E. (2014). Political diversity will improve social psychological science. Manuscript that I hope is on the verge of being accepted for publication.

The Economist (October 19, 2013). Trouble at the lab. Retrieved on 7/8/14 from:

Ferguson, C. J., & Heene, M. (2012). A vast graveyard of undead theories: Publication bias and psychology’s aversion to the null. Psychological Science, 7, 555-561.

Greenwald, A. G., Pratkanis, A. R., Leippe, M. R., & Baumgardner, M. H. (1986). Under what conditions does theory obstruct research progress? Psychological Review, 93, 216-229.

Ioannidis, J. P. A. (2005).  Why most published research findings are false. PLOS Medicine, 2, 696-701.

Jussim, L. (2012). Social perception and social reality: Why accuracy dominates bias and self-fulfilling prophecy. NY: Oxford University Press.

Jussim, L., Crawford, J. T., Anglin, S. M., & Stevens, S. T. (In press a). The politics of social psychological science II: Distortions in the social psychology of liberalism and conservatism. To appear in J. Forgas, K. Fiedler, & W. Crano (Eds.), Sydney Symposium on Social Psychology and Politics.

Jussim, L. Crawford, J. T., Stevens, S. T., & Anglin, S. M. (In press b). The politics of social psychological science I: Distortions in the social psychology of intergroup relations. To appear in P. Valdesolo & J. Graham (Eds.), Bridging Ideological Divides: Claremont Symposium on Applied Social Psychology.

McGuire, W. J. (1973). The yin and yang of progress in social psychology: Seven koan. Journal of Personality and Social Psychology, 26, 446-456.

McGuire, W. J. (1983). A contextualist theory of knowledge: Its implications for innovation reform in psychological research. Advances in Experimental Social Psychology, 16, 1-47.

Popper, K. R. (1959/1968). The logic of scientific discovery. New York: Harper & Row.

Simmons, J.P., Nelson, L.D., & Simonsohn, U. (2011).  False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366.

Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition. Perspectives on Psychological Science, 4, 274-290.

Posted in Uncategorized | 2 Comments

An apology and proposal

Brent W. Roberts

My tweet, “Failure to replicate hurting your career? What about PhDs with no career because they were honest” was taken by some as a personal attack on Dr. Schnall.  It was not and I apologize to Dr. Schnall if it were taken that way. The tweet was in reference to the field as a whole because our current publication and promotion system does not reward the honest design and reporting of research. And this places many young investigators at a disadvantage. Let me explain.

Our publication practices reward the reporting of optimized data—the data that looks the best or that could be dressed up to look nice through whatever means necessary. We have no choice given the way we incentivize our publication system. That system, which punishes null findings and rewards only statistically significant effects means that our published science is not currently an honest portrait of how our science works. The current rash of failures to replicate famous and not so famous studies is simply a symptom of a system that is in dire need of reform. Moreover, students who are unwilling to work within this system—to be honest with their failures to replicate published work, for example—are punished disproportionately. They wash out, get counseled into other fields, or simply choose to leave our field of their own accord.

Of course, I could be wrong. It is possible that the majority of researchers publish all of their tests of all of their ideas somewhere, including their null findings. I’m open to that possibility.  But, like many hypotheses, it should be tested and I have an idea for how to test it.

Take any one of our flagship journals and for 1 year follow a publication practice much like that followed for the special replication issue just published. During that year, the editors agree to only review and publish manuscripts that have been 1) pre-registered, 2) have only their introduction, methods, and planned analyses described, not their results, 3) each paper would contain at least one direct replication of each unique study presented in any given proposed package of studies. The papers would be “accepted” based on the elegance of the theory and the adequacy of the methods alone. The results would not be considered in the review process. Of course, the pre-registered studies would be “published” in a form where readers would know that the idea was proposed even if the authors do not follow through with reporting the results.

After a year, we can examine what honest science looks like. I suspect the success rate for statistically significant findings will go down dramatically, but that is only a hypothesis. Generally speaking, think of the impact this would have on our field and science in general. The journal that takes up this challenge would have the chance to show the field and the world, what honest science looks like. It would be held up as an example for all fields of science for exactly how the process works, warts and all. And, if I’m wrong, if at the end that year the science produced in that journal looks exactly like the pages of our current journals I’ll not only apologize to the field, I’ll stop tweeting entirely.

Posted in Uncategorized | 9 Comments

Additional Reflections on Ceiling Effects in Recent Replication Research

By R. Chris Fraley

In her commentary on the Johnson, Cheung, and Donnellan (2014) replication attempt, Schnall (2014) writes that the analyses reported in the Johnson et al. (2014) paper “are invalid and allow no conclusions about the reproducibility of the original findings” because of “the observed ceiling effect.”

I agree with Schnall that researchers should be concerned with ceiling effects. When there is relatively little room for scores to move around, it is more difficult to demonstrate that experimental manipulations are effective. But are the ratings so high in Johnson et al.’s (2014) Study 1 that the study is incapable of detecting an effect if one is present?


To address this question, I programmed some simulations in R. The details of the simulations are available at, but here is a summary of some of the key results:

  • Although there are a large number of scores on the high end of the scale in the Johnson et al. Study 1 (I’m focusing on the “Kitten” scenario in particular), the amount of compression that takes place is not sufficient to undermine the study’s ability to detect genuine effects.
  • If the true effect size for the manipulation is relatively large (e.g., Cohen’s d = -.60; See Table 1 of Johnson et al.), but we pass that through a squashing function that produces the distributions observed in the Johnson et al. study, the effect is still evident (see the Figure for a randomly selected example from the thousands of simulations conducted). And, given the sample size used in the Johnson et al. (2014) report, the authors had reasonable statistical power to detect it (70% to 84%, depending on exactly how things get parameterized).
  • Although it is possible to make the effect undetectable by compressing the scores, this requires either (a) that we assume the actual effect size is much smaller than what was originally reported or (b) that the scores be compressed so tightly that 80% or more of participants endorsed the highest response or (c) that the effect work in the opposite direction of what was expected (i.e., that the manipulation pushes scores upwards towards rather than away from the ceiling).

In short, although the Johnson et al. (2014) sample does differ from the original in some interesting ways (e.g., higher ratings), I don’t think it is clear at this point that those higher ratings produced a ceiling effect that precludes their conclusions.

Posted in Uncategorized | 11 Comments

My Scary Vision of Good Science

By Brent W. Roberts

In a recent blog post, I argued that the Deathly Hallows of Psychological Science—p values < .05, experiments, and counter-intuitive findings—represent the combination of factors that are most highly valued by our field and are the explicit criteria for high impact publications. Some commenters mistook my identification of the Deathly Hallows of Psychological Science as a criticism of experimental methods and an endorsement of correlational methods. They even went so far as to say my vision for science was “scary.”


Of course, these critics reacted negatively to the post because I was being less than charitable to some hallowed institutions in psychological science. Regardless, I stand by the original argument. Counter-intuitive findings from experiments “verified” with p values less than .05 are the most valued commodities of our scientific efforts. And, the slavish worshiping of these criteria is at the root of many of our replicability and believability problems.

I will admit that I could have been clearer in my original blog post on the Deathly Hallows. I could have explained in simple language that it is not the ingredients of the Deathly Hallows of Psychological Science, per se, that are the problem, but the blind obedience that too many scholars pay to these criteria. I hope most readers got that point.

Of course, the comment saying my vision was scary did make me think. Just what is my vision for the ideal scientific process in psychology? Actually, that’s an easy question to answer. My vision of good scientific work in psychological science has two basic features. First, ask good questions. Second, answer those questions with informative methods that are well suited for answering those questions. See that? No p-values. No statistics. No experiments. No counter-intuitiveness. We just need good questions and appropriate methods. That’s all.

Good questions, of course, are not so easy to come by. By “good” I mean questions that when answered will provide valuable information. A good question often emerges from the foundation of knowledge in one’s field. It is a question that needs to be answered given the knowledge that has accrued to date. Of course, given the fact that our false positive rate in psychological science ranges from 20% to 80% depending on who you ask, it is genuinely difficult to know what a good question is nowadays. I take that as an arbitrage opportunity—every question is back on the table.

How do you know your question is good? Easy. Your research question is good if the answer is interesting regardless of the result. It should be just as interesting whether the effect is null or not provided the design was appropriate and high-powered. There is an abundance of examples of good scientific questions that have been answered over the years, such as Milgram and Asch’s question of whether humans are conforming. The significance of their work does not ride on whether their effects were p < .05. The significance of their work rests on figuring out that people behave in a very conforming fashion, at least in western populations. It would have been fascinating to find the opposite too. It was a good question and the importance of their results has stood the test of time.

Similarly, the question of whether human phenotypes are heritable and to what extent environmental influences are shared or unique was, and remains a good question. The answer would have been informative regardless of the proportion of genetic, shared, and unique environmental variance behavior geneticists found in outcomes like personality or psychopathology. The findings were, and still are fascinating given the relatively modest variance attributable to shared environmental influences.

Appropriate methods are, in part, dictated by the question that needs to be answered. Sometimes that leads to a correlational design, sometimes an experiment, sometimes something in between. God forbid sometimes it might even call for a case study or a qualitative design. Regardless, a good method is one that provides reliable information on the original research question that was asked. When behavior geneticists were criticized for the equal environments assumption, they went out and found samples of twins that were raised apart. What did they find? They found that phenotypes were just as heritable in twins who shared no environment. You can complain as much as you want about identical twins being treated more alike than fraternal twins, but the studies where twins who were raised apart show the same levels of correspondence as twins raised together was a design that answered that question perfectly.

Likewise, when people and researchers questioned the efficacy of psychotherapy, it was the true experimental designs that brought clinical psychology back from the abyss. Decades of diligently run field experiments have now shown that therapy works, at least in those populations that stay in clinical trials. Correlational designs could not have answered the question of whether clinical interventions worked. Only good experimental evidence could answer that question.

My criticism of the Deathly Hallows of Psychological Science rides on the fact that the blind pursuit of this Holy Grail incentivizes bad methods. It is much easier to get your desired finding if you run a series of underpowered studies and then either p-hack by dropping null findings or fish around for significant effects by testing moderators to death. That means that the prototypical package of underpowered conceptually replicated experiments is uninformative about the actual question that motivated the studies in the first place. These practices represent bad methodology and they waste limited and valuable resources. Most, if not all of the recommended changes that have been proposed by the “skeptics” of unreplicable research have been to simply improve the informational value of the methods by increasing sample sizes, directly replicating findings, and avoiding p-hacking. Please, someone, tell me why these are bad recommendations?

I’d add one more ingredient to my “vision” and remind the reader of the late Carl Sagan’s first maxim of his Baloney Detection Kit. The best scientific information comes not only from a study that is directly replicated, but one that is directly replicated by an independent source. That means a researcher who is indifferent, if not hostile to your finding should be able to reproduce it. That’s good information. That’s a finding that can be trusted. For example, I would put money on the fact that any researcher who has a distaste for the idea of personality traits, would, if given the responsibility of tracking personality traits over time, find that they show robust rank-order consistency.

So that is my grand, scary vision for conducting good science. Ask good questions. Answer the question with methods that are informative. An underpowered study is not informative. A properly powered study that can be replicated by a hostile audience is very, very informative. Good science doesn’t have to be an experiment. It doesn’t have to produce a statistically significant finding. Nor does the topic have to be counter-intuitive. It just has to be a trustworthy set of data that attempts to answer a good scientific question.

If that vision scares you I can recommend a good, if cheap bottle of red wine or an anxiolytic. Of course, I wouldn’t recommend mixing alcohol and medications as that can be detrimental to your health, p < 05.

Posted in Uncategorized | 5 Comments

The Deathly Hallows of Psychological Science

By Brent W. Roberts

As of late, psychological science has arguably done more to address the ongoing believability crisis than most other areas of science.  Many notable efforts have been put forward to improve our methods.  From the Open Science Framework (OSF), to changes in journal reporting practices, to new statistics, psychologists are doing more than any other science to rectify practices that allow far too many unbelievable findings to populate our journal pages.

The efforts in psychology to improve the believability of our science can be boiled down to some relatively simple changes.  We need to replace/supplement the typical reporting practices and statistical approaches by:

  1. Providing more information with each paper so others can double-check our work, such as the study materials, hypotheses, data, and syntax (through the OSF or journal reporting practices).
  2. Designing our studies so they have adequate power or precision to evaluate the theories we are purporting to test (i.e., use larger sample sizes).
  3. Providing more information about effect sizes in each report, such as what the effect sizes are for each analysis and their respective confidence intervals.
  4. Valuing direct replication.

It seems pretty simple.  Actually, the proposed changes are simple, even mundane.

What has been most surprising is the consistent push back and protests against these seemingly innocuous recommendations.  When confronted with these recommendations it seems many psychological researchers balk. Despite calls for transparency, most researchers avoid platforms like the OSF.  A striking number of individuals argue against and are quite disdainful of reporting effect sizes. Direct replications are disparaged. In response to the various recommendations outlined above, prototypical protests are:

  1. Effect sizes are unimportant because we are “testing theory” and effect sizes are only for “applied research.”
  2. Reporting effect sizes is nonsensical because our research is on constructs and ideas that have no natural metric, so that documenting effect sizes is meaningless.
  3. Having highly powered studies is cheating because it allows you to lay claim to effects that are so small as to be uninteresting.
  4. Direct replications are uninteresting and uninformative.
  5. Conceptual replications are to be preferred because we are testing theories, not confirming techniques.

While these protestations seem reasonable, the passion with which they are provided is disproportionate to the changes being recommended.  After all, if you’ve run a t-test, it is little trouble to estimate an effect size too. Furthermore, running a direct replication is hardly a serious burden, especially when the typical study only examines 50 to 60 odd subjects in a 2×2 design. Writing entire treatises arguing against direct replication when direct replication is so easy to do falls into the category of “the lady doth protest too much, methinks.” Maybe it is a reflection of my repressed Freudian predilections, but it is hard not to take a Depth Psychology stance on these protests.  If smart people balk at seemingly benign changes, then there must be something psychologically big lurking behind those protests.  What might that big thing be?  I believe the reason for the passion behind the protests lies in the fact that, though mundane, the changes that are being recommended to improve the believability of psychological science undermine the incentive structure on which the field is built.

I think this confrontation needs to be more closely examined because we need to consider the challenges and consequences of deconstructing our incentive system and status structure.  This, then begs the question, what is our incentive system and just what are we proposing to do to it?  For this, I believe a good analogy is the dilemma faced by Harry Potter in the last book of the eponymously titled book series.

deathly hallows

The Deathly Hallows of Psychological Science

In the last book of the Harry Potter series “The Deathly Hallows,” Harry Potter faces a dilemma.  Should he pursue the destruction of the Horcruxes or gather together the Deathly Hallows. The Horcruxes are pieces of Voldemort’s soul encapsulated in small trinkets, jewelry, and such.  If they were destroyed, then it would be possible to destroy Voldemort.  The Deathly Hallows are three powerful magical objects, which are alluring because by possessing all three, one becomes the “master of death.”  The Deathly Hallows are the Cloak of Invisibility, the Elder Wand, and the Resurrection Stone. The dilemma Harry faced was whether to pursue and destroy the Horcruxes, which was a painful and difficult path; or Harry could choose to pursue the Deathly Hallows, with which he could quite possibly conquer Voldemort, and, if not conquer him, live on despite him.  He chose to destroy the Horcruxes.

Like Harry Potter, the field of psychological science (and many other sciences) faces a similar dilemma. Pursue changes in our approach to science that eliminate problematic practices that lead to unreliable science—a “destroy the Horcrux” approach. Or, continue down the path of least resistance, which is nicely captured in the pursuit of the Deathly Hallows.

What are the Deathly Hallows of psychological science? I would argue that the Deathly Hallows of psychological science, which I will enumerate below, are 1) p values less than .05, 2) experimental studies, and 3) counter-intuitive findings.

Why am I highlighting this dilemma at this time? I believe we are at a critical juncture.  The nascent efforts at reform may either succeed or fade away like they have done so many times before.  For it is a fact that we’ve confronted this dilemma many times before and have failed to overcome the allure of the Deathly Hallows of psychological science. Eminent methodologists such as Cohen, Meehl, Lykken, Gigerenzer, Schmidt, Fraley, and lately Cumming, have told us how to do things better since the 1960s to no avail. Revising our approach to science has never been a question of knowing the right the thing to do, but rather it has been whether we were willing to do the thing we knew was right.

Screen Shot 2014-03-09 at 9.31.27 PMThe Deathly Hallows of Psychological Science: p-values, experiments, and counter-intuitive/surprising findings

The cloak of invisibility: p<.05. The first Deathly Hallow of psychological science is the infamous p-value. You must attain a p-value less than .05 to be a success in psychological science.  Period.  If your p-value is greater than .05, you have no finding and nothing to say. Without anything to say, you cannot attain status in our field. Find a p-value below .05 and you can wrap it around yourself and hide from the contempt aimed at those who fail to cross that magical threshold.

Because the p-value is the primary key to the domain of scientific success, we do almost anything we can to find it.  We root around in our data, digging up p-values either by cherry picking studies, selectively reporting outcomes, or through some arcane statistical wizardry.  One only has to read Bem’s classic article on how to write an article in psychological science to see how we approach p-values as a field:

“…the data.  Examine them from every angle. Analyze the sexes separately.  Make up new composite indices.  If a datum suggests a new hypothesis, try to find further evidence for it elsewhere in the data.  If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief.  If there are participants you don’t like, or trials, observers, or interviewers who gave you anomalous results, drop them (temporarily).  Go on a fishing expedition for something–anything–interesting.”

What makes it worse is that when authors try to report null effects they are beaten down because we as reviewers and editors do everything in our power to hide the null effects. Null effects make for a messy narrative. Our most prestigious journals almost never publish null effects because reviewers and editors act as gatekeepers and mistakenly recommend against publishing null effects.  Consider the following personal example. In one study, reviewer 2 argued that our study was not up for publication in JPSP because one of our effects was null (there were other reasons too). Consider the fact that the null effect in question was a test of a hypothesis drawn from my own theory. I was trying to show that my theory did not work all of the time and the reviewer was criticizing me for showing that my own ideas might need revision. This captures quite nicely the tyranny of the p-value. The reviewer was so wedded to my ideas that he or she wouldn’t even let me, the author of said ideas, offer up some data that would argue for revising them.

In the absence of simply rejecting null effects, we often recommend cutting the null effects. I have seen countless recommendations in reviews of my papers and the papers of colleagues to simply drop studies or results that show null effects.  It is not then surprising that psychology confirms 95% of its hypotheses.

Even worse, we often commit the fundamental attribution error by thinking that the person trying to publish null effects is an incompetent researcher—especially if they fail to replicate an already published effect that has crossed the magical p< .05 threshold. Not to be too cynical, but the reviewers may have a point.  If you are too naïve to understand “the game”, which is to produce something with p < .05, then maybe you shouldn’t succeed in our field.  Setting sarcasm aside, what the gatekeepers don’t understand is that they are sending a clear message to graduate students and assistant professors that they must compromise their own integrity in order to succeed in our field. Of course, this leads to the winnowing of the field of researchers who don’t want to play the game.

The Elder Wand: Running Experiments

Everyone wants to draw a causal conclusion, even observational scientists. And, of course, the best way to draw a causal conclusion, if you are not an economist, is to run an experiment.  The second Deathly Hallow for psychological science is doing experimental research at all costs.  As one of my past colleagues told a first year graduate student, “if you have a choice between a correlational or an experimental study, run an experiment.”

Where things go awry, I suspect, is when you value experiments so much, you do anything in your power to avoid any other method. This leads to levels of artificiality that can get perverse. Rather than studying the effect of racism, we study the idea of racism.  Where we go wrong is that, as Cialdini has noted before, we seldom work back and forth between the fake world of our labs and the real world where the phenomenon of interest exists. We become methodologists, rather than scientists.  We prioritize lab-based experimental methods because they are most valued by our guild not because they necessarily help us illuminate or understand our phenomenon but because they putatively lead to causal inferences. One consequence of valuing experiments so highly is that we get caught up in a world of hypothetical findings that have unknown relationships to the real world because we seldom if ever touch base with applied or field research.  As Cialdini so poignantly pointed out, we simply don’t value field research enough to pursue it with equal vigor to lab experiments.

And, though some great arguments have been made that we should all relax a bit about our use of exploratory techniques and dig around in our data, what these individuals don’t realize is that half the reason we do experiments is not to do good science but to look good.  To be a “good scientist” means being a confirmer of hypotheses and there is no better way to be a definitive tester of hypotheses than to run a true experiment.

Of course, now that we know many researchers run as many experiments as they need to in order to construct what findings they want, we need to be even more skeptical of the motive to look good by running experiments. Many of us publishing experimental work are really doing exploratory work under the guise of being a wand wielding wizard of an experimentalist simply because that is the best route to fame and fortune in our guild.

Most of my work has been in the observational domain, and admittedly, we have the same motive, but lack the opportunity to implement our desires for causal inference.  So far, the IRB has not agreed to let us randomly assign participants to the “divorced or not divorced” or the “employed, unemployed” conditions. In the absence of being able to run a good, clean experiment, observational researchers, like myself, bulk up on the statistics as a proxy for running an experiment. The fancier, more complex, and indecipherable the statistics, the closer one gets to the status of an experimenter. We even go so far as to mistake our statistical methods, such as cross-lag panel, longitudinal designs, for ones that would afford us the opportunity to make causal inferences (hint: they don’t). Reviewers are often so befuddled by our fancy statistics that they fail to notice the inappropriateness of that inferential leap.

I’ve always held my colleague Ed Diener in high esteem.  One reason I think he is great is that as a rule he works back and forth between experiments and observational studies, all in the service of creating greater understanding of well-being.  He prioritizes his construct over his method. I have to assume that this is a much better value system than our long standing obsession with lab experiments.

The Resurrection Stone: Counter-intuitive findings

The final Deathly Hallow of psychological science is to be the creative destroyer of widely held assumptions. In fact, the foundational writings about the field of social psychology lay it out quite clearly. One of the primary routes to success in social psychology, for example, is to be surprising.  The best way to be surprising is to be the counter-intuitive innovator—identifying ways in which humans are irrational, unpredictable, or downright surprising (Ross, Lepper, & Ward, 2010).

It is hard to argue with this motive.  We hold those scientists who bring unique discoveries to their field in the highest esteem.  And, every once in a while, someone actually does do something truly innovative. In the mean time, the rest of us make up little theories about trivial effects that we market with cute names, such as the “End of History Effect”, or the “Macbeth Effect” or, whatever.  We get caught up in the pursuit of cutesy counter-intuitiveness all under the hope that our little innovation will become a big innovation.  To the extent that our cleverness survives the test of time, we will, like the resurrection stone, live on in our timeless ideas even if they are incorrect.

What makes the pursuit of innovation so formidable an obstacle to reform is that it sometimes works. Every once in a while someone does revolutionize a field. The aspiration to be the wizard of one’s research world is not misplaced.  Thus, we have an incentive system that produces a variable-ratio schedule of reinforcement—one of the toughest to break according to those long forgotten behaviorists (We need not mention behavioral psychologists, since their ideas are no longer new, innovative, or interesting–even if they were right).

Reasons for Pessimism

The problem with the current push for methodological reform is that, like pursuing the Horcruxes, it is hard and unrewarding in comparison to using the Deathly Hallows of psychological science. As one of our esteemed colleagues has noted, no one will win the APA Distinguished Scientific Award by failing to replicate another researcher’s work. Will a person who simply conducts replications of other researcher’s work get tenure?  It is hard to imagine. Will researchers do well to replicate their own research? Why? It will simply slow them down and handicap their ability to compete with the other aspiring wizards who are producing the conceptually-replicated, small N lab-based experimental studies at a frightening rate. No, it is still best to produce new ideas, even if it comes at the cost of believability. And, everyone is in on the deal. We all disparage null findings in reviews because we want errors of commission rather than omission.

Another reason why the current system may be difficult to fix is that it provides a weird p-value driven utopia. With the infinite flexibility of Deathly Hallows of psychological science we can pretty much prove any idea is a good one. When combined with our antipathy toward directly replicating our own work or the work of others, everyone can be a winner in the current system. All it takes is a clever idea applied to enough analyses and every researcher can be the new hot wizard. Without any push to replicate, everyone can co-exist in his or her own happy p-value driven world.

So, there you have it.  My Depth Psychology analysis of why I fear that the seemingly benign recommendations for methodological change are falling on deaf ears.  The proposed changes contradict the entire status structure that has served our field for decades.  I have to imagine that the ease with which the Deathly Hallows can be used is one reason why reform efforts have failed in the past. Since, as many have indicated, the same recommendations to revise our methods have been made for over 50 years. Each time, the effort has failed.

In sum, while there have been many proposed solutions to our problems, I believe we have not yet faced our real issue, which is how are we going to re-structure our incentive structure?  Many of us have stated, as loudly and persistently as we can that there are Horcruxes all around us that need to be destroyed. The move to improve our methods and to conduct direct replications can be seen as an effort to eliminate our believability Horcruxes. But, I believe the success of that effort rides on how clearly we see the task ahead of us. Our task is to convince a skeptical majority of scientists to dismantle an incentive structure that has worked for them for many decades. This will be a formidable task.

Image | Posted on by | 37 Comments

For the love of p-values

We recently read Karg et al (2011) for a local reading group.  It is one of the many of attempts to meta-analytically examine the idea that the 5-HTTLPR serotonin transporter polymorphism moderates the effect of stress on depression.

It drove me batty.  No, it drove me to apoplectia–a small country in my mind I occupy far too often.

Let’s focus on the worst part.  Here’s the write up in the first paragraph of the results:

“We found strong evidence that 5-HTTLPR moderates the relationship between stress and depression, with the s allele associated with an increased risk of developing depression under stress (P = .00002).  The significance of the result was robust to sensitivity analysis, with the overall P values remaining significant when each study was individually removed form the analysis (1.0×10-6<P<.00016).”

Wow.  Isn’t that cool?  Isn’t that impressive?  Throw out all of the confused literature and meta-analyses that came before this one.  They found “strong evidence” for this now infamous moderator effect.  Line up the spit vials. I’m getting back in the candidate gene GxE game.

Just what did the authors mean by “strong?”  Well, that’s an interesting question.  There is nary an effect size in the review as the authors chose not to examine effect sizes, but focused on synthesizing p-values instead.  Of course, if you have any experience with meta-analytic types, you know how they feel about meta-analyzing p-values.  It’s like Nancy Reagan to drugs.  Just say no.  If you are interested in why, read Lipsey and Wilson or any other meta-analysis guru.  They are unsympathetic, to say the least.

But, all is not lost. All you, the reader, have to do is transform the p-value into an effect size using any of the numerous on-line transformation programs that are available.  It takes about 15 seconds to do it yourself.  Or, if you want to be thorough, you can take the data from Table 1 in Karg et al (2011) and transform the p-values into effect sizes for your own meta-analytic pleasure. That takes about 15 minutes.

So what happens when you take their really, really significant p-value of p = .00002 and transform it in to an effect size estimate?  Like good meta-analytic types, the authors provide the overall N, which is 40,749.  What does that really impressive p-value translate into when you translate it into an r metric?

.0199 or .02 if you round up.

It is even smaller than Rosenthal’s famous .03 correlation between aspirin consumption and protection from heart disease.  You get the same thing when you plug all of the numbers from Table 1 into Comprehensive Meta-Analysis, by-the-way.

So the average interaction between the serotonin transporter promoter and stress on depression is “strong,” “robust,” yet infinitesimal.  It sounds like a Monty Python review of Australian wine (“Bold, yet naïve.” “Flaccid, yet robust”).

Back to our original question, what did the authors mean when they described their results as “strong?”  One can only assume that they mean to say that their p-value of .00002 looks a lot better than our usual suspect, the p < .05.  Yippee.

Why should we care?  Well, this is a nice example of what you get when you ignore effect size estimates and just use p-values–misguided conclusions. The Karg et al (2011) paper has been cited 454 times so far.  Here’s a quote from one of the papers that cites their work “This finding, initially refuted by smaller meta-analyses, has now been supported by a more comprehensive meta-analysis” (Palazidou, 2012). Wrong.

Mind you, there is no inconsistency across the meta-analyses.  If the average effect is really equal to an r of .02, and I doubt it is this big, it is really, really unlikely to be consistently detected by any study, no less a meta-analysis. The fact that the meta-analyses appear to disagree is only because the target effect size is so small that even dozens of studies and thousands of participants might fail to detect it.  

Another reason to care about misguided findings is the potential mistaken conclusion either individuals or granting agencies will make if they take these findings at face value.  They might conclude that the GxE game is back on and start funding candidate gene research (doubtful, but possible).  Researchers themselves might come to the mistaken conclusion that they too can investigate GxE designs.  Heck, the average sample size in the meta-analysis is 755.  With a little money and diligence, one could come by that kind of sample, right?

Of course, that leads to an interesting question.  How many people do you need to detect a correlation of .02? Those pesky granting agencies might ask you to do a power analysis, right?  Well, to achieve 80% power to detect a correlation of .02 you would need 8,699 participants.  That means the average sample in the meta-analysis was woefully underpowered to detect the average effect size.  For that matter, it means that none of the studies in the meta-analysis were adequately powered to detect the average effect size because the largest study, which was a null effect, had an N of 3,243.

So, this paper proves a point; that if you cumulate enough participants in your research almost anything is statistically significant.  And this warrants publication in the Archives of General Psychiatry?  Fascinating.

Brent W. Roberts

Posted in Uncategorized | 10 Comments