Please Stop the Bleating

 

It has been unsettling to witness the seemingly endless stream of null effects emerging from numerous pre-registered direct replications over the past few months. Some of the outcomes were unsurprising given the low power of the original studies. But the truly painful part has come from watching and reading the responses from all sides.  Countless words have been written discussing every nuanced aspect of definitions, motivations, and aspersions. Only one thing is missing:

Direct, pre-registered replications by the authors of studies that have been the target of replications.

While I am sympathetic to the fact that those who are targeted might be upset, defensive, and highly motivated to defend their ideas, the absence of any data from the originating authors is a more profound indictment of the original finding than any commentary.  To my knowledge, and please correct me if I’m wrong, none of the researchers who’ve been the target of a pre-registered replication have produced a pre-registered study from their own lab showing that they are capable of getting the effect, even if others are not. For those of us standing on the sidelines watching things play out we are constantly surprised by the fact that the one piece of information that might help—evidence that the original authors are capable of reproducing their own effects (in a pre-registered study)—is never offered up.

So, get on with it. Seriously. Everyone. Please stop the bleating. Stop discussing whether someone p-hacked, or what p-hacking really is, or whether someone is competent to do a replication, what a replication is, or whether a replication was done poorly or well.  Stop reanalyzing the damn Reproducibility Project or any thousands of other ways of re-examining the past.  Just get on with doing direct replications of your own work. It is a critical, albeit missing piece of the reproducibility puzzle.

Science is supposed to be a give and take. If it is true that replicators lack some special sauce necessary to get an effect, then it is incumbent on those of us who’ve published original findings to show others that we can get the effect—in a pre-registered design.

Brent W. Roberts

Advertisements
Posted in Uncategorized | 8 Comments

We Need Federally Funded Daisy Chains

One of the most provocative requests in the reproducibility crisis was Daniel Kahneman’s call for psychological scientists to collaborate on a “daisy chain” of research replication. He admonished proponents of priming research to step up and work together to replicate the classic priming studies that had, up to that point, been called into question.

What happened? Nothing. Total crickets. There were no grand collaborations among the strongest and most capable labs to reproduce each other’s work. Why not? Using 20:20 hindsight it is clear that the incentive structure in psychological science militated against the daisy chain idea.

The scientific system in 2012 (and the one still in place) rewarded people who were the first to discover a new, counterintuitive feature of human nature, preferably using an experimental method. Since we did not practice direct replications, the veracity of our findings weren’t really the point. The point was to be the discoverer, the radical innovator, the colorful, clever genius who apparently had a lot of flair.

If this was and remains the reward structure, what incentive was there or is there to conduct direct replications of your own or other’s work? Absolutely none. In fact, the act of replicating your work would be punitive. Taking the most charitable position possible, most everyone knew that our work was “fragile.” Even an informed researcher would know that the average power of our work (e.g., 50%) would naturally lead to an untenable rate of failures to replicate findings, even if they were true. And, failures to replicate our work would lead to innumerable negative consequences ranging from diminishment of our reputations, undermining our ability to get grants, decreasing the probability of our students publishing their papers, to painful embarrassment.

In fact, the act of replication was so aversive that then, and now, the proponents of most of the studies that have been called into question continue to argue passionately against the value of direct replication in science. In fact, it seems the enterprise of replication is left to anyone but the original authors. The replications are left to the young, the noble, or the disgruntled. The latter are particularly problematic because they are angry. Why are they angry? They are angry because they are morally outraged. They perceive the originating researchers as people who have consciously, willingly manipulated the scientific system to publish outlandish, but popular findings in an effort to enhance or maintain their careers. The anger can have unintended consequences. The disgruntled replicators can and do behave boorishly at times. Angry people do that. Then, they are called bullies or they are boycotted.

All of this sets up a perfectly horrible, internally consistent, self-fulfilling system where replication is punished. In this situation, the victims of replication can rail against the young (and by default less powerful) as having nefarious motivations to get ahead by tearing down their elders. And, they can often accurately point to the disgruntled replicators as mean-spirited. And, of course, you can conflate the two and call them shameless, little bullies. All in all, it creates a nice little self-justifying system for avoiding daisy chaining anything.

My point is not to criticize the current efforts at replication, so much as to argue that these efforts face a formidable set of disincentives. The system is currently rigged against systematic replications. To counter the prevailing anti-replication winds, we need robust incentives (i.e., money). Some journals have made valiant efforts to reward good practices and this is a great start. But, badges are not enough. We need incentives with teeth. We need Federally Funded Daisy Chains.

The idea of a Federally Funded Daisy Chain is simple. Any research that the federal government deems valuable enough to fund should be replicated. And, the feds should pay for it. How? NIH and NSF should set up research daisy chains. These would be very similar to the efforts currently being conducted at Perspectives on Psychological Science being carried out by Dan Simons and colleagues. Research teams from multiple sites would take the research protocols developed in federally funded research and replicate them directly.

And, the kicker is that the funding agencies would pay for this as part of the default grant proposal. Some portion of every grant would go toward funding a consortium of research teams—there could be multiple consortia across the country, for example. The PIs of the grants would be obliged to post their materials in such a way that others could quickly and easily reproduce their work. The replication teams would be reimbursed (e.g., incentivized) to do the replications. This would not only spread the grant-related wealth, but it would reward good practices across the board. PIs would be motivated to do things right from the get go if they knew someone was going to come behind them and replicate their efforts. The pool of replicators would expand as more researchers could get involved and would be motivated by the wealth provided by the feds. Generally speaking, providing concrete resources would help make doing replications the default option rather than the exception.

Making replications the default would go a long way to addressing the reproducibility crisis in psychology and other fields. To do more replications we need concrete positive incentives to do the right thing. The right thing is showing the world that our work satisfies the basic tenet of science—that an independent lab can reproduce our research. The act of independently reproducing the work of others should not be left to charity. The federal government, which spends an inordinate amount of taxpayer dollars to fund our original research, should care enough about doing the right thing that they should fund efforts to replicate the findings they are so interested in us discovering.

Posted in Uncategorized | 3 Comments

Yes or no? Are Likert scales always preferable to dichotomous rating scales?

What follows below is the result of an online discussion I had with psychologists Michael Kraus (MK) and Michael Frank (MF). We discussed scale construction, and particularly, whether items with two response options (i.e., Yes v. No) are good or bad for the reliability and validity of the scale. We had a fun discussion that we thought we would share with you.

MK: Twitter recently rolled out a polling feature that allows its users to ask and answer questions of each other. The poll feature allows polling with two possible response options (e.g., Is it Fall? Yes/No). Armed with snark and some basic training in psychometrics and scale construction, I thought it would be fun to pose the following as my first poll:

Screenshot_2015-10-26-20-00-55

Said training suggests that, all things being equal, some people are more “Yes” or more “No” than others, so having response options that include more variety will capture more of the real variance in participant responses. To put that into an example, if I ask you if you agree with the statement: “I have high self-esteem.” A yes/no two-item response won’t capture all the true variance in people’s responses that might be otherwise captured by six items ranging from strongly disagree to strongly agree. MF/BR, is that how you would characterize your own understanding of psychometrics?
MF: Well, when I’m thinking about dependent variable selection, I tend to start from the idea that the more response options for the participant, the more bits of information are transferred. In a standard two-alternative forced-choice (2AFC) experiment with balanced probabilities, each response provides 1 bit of information. In contrast, a 4AFC provides 2 bits, an 8AFC provides 3, etc. So on this kind of reasoning, the more choices the better, as illustrated by this table from Rosenthal & Rosnow’s classic text:

Screen Shot 2015-11-06 at 10.43.09 AM

For example, in one literature I am involved in, people are interested in the ability of adults and kids to associate words and objects in the presence of systematic ambiguity. In these experiments, you see several objects and hear several words, and over time the ideas is that you build up some kind of links between objects and words that are consistently associated. In these experiments, initially people used 2 and 4AFC paradigms. But as the hypotheses about mechanism got more sophisticated, people shifted to using more stringent measures, like a 15AFC, which was argued to provide more information about the underlying representations.

On the other hand, getting more information out of such a measure presumes that there is some underlying signal. In the example above, the presence of this information was relatively likely because participants had been trained on specific associations. In contrast, in the kinds of polls or judgment studies that you’re talking about, it’s more unknown whether participants have the kind of detailed representations that allow for fine-grained judgements. So if you’re asking for a judgment in general (like in #TwitterPolls or classic likert scales), how many alternatives should you use?

MK: Right, most or all of my work (and I imagine a large portion of survey research) involves subjective judgments where it isn’t known exactly how people are making their judgments and what they’d likely be basing those judgments on. So, to reiterate your own question: How many response alternatives should you use?

MF: Turns out there is some research on this question. There’s a very well-cited paper by Preston & Coleman (2000), who ask about service rating scales for restaurants. Not the most psychological example, but it’ll do. They present different participants with different numbers of response categories, ranging from 2 – 101. Here is their primary finding:

Screen Shot 2015-11-06 at 10.44.53 AM

In a nutshell, the reliability is pretty good for two categories, but it gets somewhat better up to about 7-9 options, then goes down somewhat. In addition, scales with more than 7 options are rated as slower and harder to use. Now this doesn’t mean that all psychological constructs have enough resolution to support 7 or 9 different gradations, but at least simple ratings or preference judgements seem like they might.

MK: This is great stuff! But if I’m being completely honest here, I’d say the reliabilities for just two response categories, even though they aren’t as good as they are at 7-9 options, are good enough to use. BR, I’m guessing you agree with this because of your response to my Twitter Poll:

Screenshot_2015-10-26-20-01-09

BR: Admittedly, I used to believe that when it came to response formats, more was always better.  I mean, we know that dichotomizing continuous variables is bad, so how could it be that a dichotomous rating scale (e.g., yes/no) would be as good if not superior to a 5-point rating scale?  Right?

Two things changed my perspective.  The first was precipitated by being forced to teach psychometrics, which is minimally on the 5th level of Dante’s Hell teaching-wise.  For some odd reason at some point I did a deep dive into the psychometrics of scale response formats and found, much to my surprise, a long and robust history going all they way back to the 1920s.  I’ll give two examples.  Like the Preston & Colemen (2000) study that Michael cites, some old old literature had done the same thing (god forbid, replication!!!).  Here’s a figure showing the test-retest reliability from Matell & Jacoby (1971), where they varied the response options from 2 to 19 on measures of values:

Screen Shot 2015-10-28 at 10.52.36 AM

The picture is a little different from the internal consistencies shown in Preston & Colemen (2000), but the message is similar.  There is not a lot of difference between 2 and 19.  What I really liked about the old school researchers is they cared as much about validity as they did reliability–here’s their figure showing simple concurrent validity of the scales:


Screen Shot 2015-10-28 at 11.00.57 AM

The numbers bounce a bit because of the small samples in each group, but the obvious take away is that there is no linear relation between scale points and validity.  

The second example is from Komorita & Graham (1965).  These authors studied two scales, the evaluative dimension from the Semantic Differential and the Sociability scale from the California Psychological Inventory.  The former is really homogeneous, the latter quite heterogeneous in terms of content.  The authors administered 2 and 6 point response formats for both measures.  Here is what they found vis a vis internal consistency reliability:


Screen Shot 2015-10-28 at 11.08.24 AM

This set of findings is much more interesting.  When the measure is homogeneous, the rating format does not matter.  When it is heterogeneous, having 6 options leads to better internal consistency.  The authors’ discussion is insightful and worth reading, but I’ll just quote them for brevity: “A more plausible explanation, therefore, is that some type of response set such as an “extreme response set” (Cronbach, 1946; 1950) may be operating to increase the reliability of heterogeneous scales. If the reliability of the response set component is greater than the reliability of the content component of the scale, the reliability of the scale will be increased by increasing the number of scale points.”

Thus, the old-school psychometricians argued that increasing the number of scale point options does not affect test-retest reliability, or validity. It does marginally increase internal consistency, but most likely because of “systematic error” such as, response sets (e.g., consistently using extreme options or not) that add some additional internal consistency to complex constructs.  

One interpretation of our modern love of multi-option rating scales is that it leads to better internal consistencies which we all believe to be a good thing.  Maybe it isn’t.

MK: I’ve have three reactions to this: First, I’m sorry that you had to teach psychometrics. Second, it’s amazing to me that all this work on scale construction and optimal item amount isn’t more widely known. Third, how come, knowing all this as you do, this is the first time I have heard you favor two-item response options?

BR: You might think that I would have become quite the zealot for yes/no formats after coming across this literature, but you would be wrong. I continued pursuing my research efforts using 4 and 5 point rating scales ad nauseum. Old dogs and new tricks and all of that.  

The second experience that has turned me toward using yes/no more often, if not by default, came as a result of working with non-WEIRD [WEIRD = White, Educated, Industrial, Rich, and Democratic] samples and being exposed to some of the newer, more sophisticated approaches to modeling response information in Item Response Theory. For a variety of reasons our research of late has been in samples not typically employed in most of psychology, like children, adolescents, and less literate populations than elite college students. In many of these samples, the standard 5-point likert rating of personality traits tend to blow up (psychometrically speaking).  We’ve considered a number of options for simplifying the assessment to make it less problematic for these populations to rate themselves, one of which is to simplify the rating scale to yes/no.  

It just so happens that we have been doing some IRT work on an assessment experiment we ran on-line where we randomly assigned people to fill out the NPI in one of three conditions–the traditional paired-comparison, a 5-point likert ratings of all of the stems, and a yes/no rating of all of the NPI item stems (here’s one paper from that effort). I assumed that if we were going to turn to a yes/no format that we would need more items to net the same amount of information as a likert-style rating.  So, I asked my colleague and collaborator, Eunike Wetzel, how many items you would need using a yes/no option to get the same amount of test information from a set of likert ratings of the NPI.  IRT techniques allow you to estimate how much of the underlying construct a set of items captures via a test information function.  What she reported back was surprising and fascinating.  You get the same amount of information out of 10 yes/no ratings as you do out of 10 5-point likert scale ratings of the NPI.  

So Professor Kraus, this is the source of the pithy comeback to your tweet.  It seems to me that there is no dramatic loss of information, reliability, or validity when using 2-point rating scales.  If you consider the benefits gained–responses will be a little quicker, fewer response set problems, and the potential to be usable in a wider population, there may be many situations in which a yes/no is just fine.  Conversely, we may want to be cautious about the gain in internal consistency reliability we find in highly verbal populations, like college students, because it may arise through response sets and have no relation to validity.  

MK: I appreciate this really helpful response (and that you address me so formally). Using a yes/no format has some clear advantages, as it forces people to fall on one side of a scale or the other, is quicker to answer than questions that rely on 4-7 Likert items, and sounds (from your work BF) that it allows scales to hold up better for non-WEIRD populations. MF, what is your reaction to this work?  

MF: This is totally fascinating. I definitely see the value of using yes/no in cases where you’re working with non-WEIRD populations. We are just in the middle of constructing an instrument dealing with values and attitudes about parenting and child development and the goal is to be able to survey broader populations than the university-town parents we often talk to. So I am certainly convinced that yes/no is a valuable option for that purpose and will do a pilot comparison shortly.

On the other hand, I do want to push back on the idea that there are never cases where you would want a more graded scale. My collaborators and I have done a bunch of work now using continuous dependent variables to get graded probabilistic judgments. Two examples of this work are Kao et al., (2014) – I’m not an author on that one but I really like it – and Frank & Goodman (2012). To take an example, in the second of those papers we showed people displays with a bunch of shapes (say a blue square, blue circle, and green square) and asked them, if someone used the word “blue,” which shape do you think they would be talking about?

In those cases, using sliders or “betting” measures (asking participants to assign dollar values between 0 and 100) really did seem to provide more information per judgement than other measures. I’ve also experimented with using binary dependent variables in these tasks, and my impression is that they both converge to the same mean, but that the confidence intervals on the binary DV are much larger. In other words, if we hypothesize in these cases that participants really are encoding some sort of continuous probability, then querying it in a continuous way should yield more information.

So Brent, I guess I’m asking you whether you think there is some wiggle room in the results we discussed above – for constructs and participants where scale calibration is a problem and psychological uncertainty is large, we’d want yes/no. But for constructs that are more cognitive in nature, tasks that are more well-specified, and populations that are more used to the experimental format, isn’t it still possible that there’s an information gain for using more fine-grained scales?

BR:  Of course there is wiggle room.  There are probably vast expanses of space where alternatives are more appropriate.  My intention is not to create a new “rule of thumb” where we only use yes/no responses throughout.  My intention was simply to point out that our confidence in certain rules of thumb is misplaced.  In this case, the assumption that likert scales are always preferably is clearly not the case.  On the other hand, there are great examples where a single, graded dimension is preferable–we just had a speaker discussing political orientation which was rated from conservative to moderate to liberal on a 9-point scale.  This seems entirely appropriate.  And, mind you, I have a nerdly fantasy of someday creating single-item personality Behaviorally Anchored Rating Scales (BARS).  These are entirely cool rating scales where the items themselves become anchors on a single dimension.  So instead of asking 20 questions about how clean your room is, I would anchor the rating points from “my room is messier than a suitcase packed by a spider monkey on crack” to “my room is so clean they make silicon memory chips there when I’m not in”.  Then you could assess the Big Five or the facets of the Big Five with one item each.  We can dream can’t we?

MF: Seems like a great dream to me. So – it sounds like if there’s one take-home from this discussion, it’s “don’t always default to the seven-point likert scale.” Sometimes such scales are appropriate and useful, but sometimes you want fewer – and maybe sometimes you’d even want more.

Posted in Uncategorized | Leave a comment

The New Rules of Research

by Brent W. Roberts

A paper on one of the most important research projects in our generation came out a few weeks ago. I’m speaking, of course, of the Reproducibility Project conducted by several hundred psychologists. It is a tour de force of good science. Most importantly, it provided definitive evidence for the state of the field. Despite the fact that 97% of the original studies reported statistically significant effects, only 36% hit the magical p < .05 mark when closely replicated.

Two defenses have been raised against the effort. The first, described by some as the “move along folks, there’s nothing to see here” defense, proposes that a 36% replication rate is no big deal. It is to be expected given how tough it is to do psychological science. At one level I’m sympathetic to the argument that science is hard to do, especially psychological science. It is the case that very few psychologists have 36% of their ideas work. And, by work, I mean in the traditional sense of the word, which is to net a p value less than .05 in whatever type of study you run. On the other hand, to make this claim about published work is disingenuous. When we publish a peer-reviewed journal article, we are saying explicitly that we think the effect is real and that it will hold up. If we really believed that our published work was so ephemeral, then much of our behavior in response to the reproducibility crisis has been nonsensical. If we all knew and expected our work not to replicate most of the time, then we wouldn’t get upset when it didn’t. We have disproven that point many times over. If we thought our effects that passed the p< .05 threshold were so flimsy, we would all write caveats at the end of our papers saying other researchers should be wary of our results as they were unlikely to replicate. We never do that. If we really thought so little of our results we would not write such confident columns to the New York Times espousing our findings, stand up on the TED stage and claim such profound conclusions, or speak to the press in such glowing terms about the implications of our unreliable findings. But we do. I won’t get into the debate over whether this is a crisis or not, but please don’t pass off a 36% reproducibility rate as if it is either the norm, expected, or a good thing. It is not.

The second argument, that is somewhat related, is to restate the subtle moderator idea. It is disturbingly common to hear people argue that the reason a study does not replicate is because of subtle differences in the setting, sample, or demeanor of the experimenter across labs. To invoke this is problematic for several reasons. First, it is an acknowledgment that you haven’t been keeping up with the scholarship surrounding reproducibility issues. The Many Labs 3 report addressed this hypothesis directly and showed that the null hypothesis could not be rejected.  Second, it means you are walking back almost every finding ever covered in an introductory psychology textbook. It makes me cringe when I hear what used to be a brazen scientist who had no qualms generalizing his or her findings based on psychology undergraduates to all humans, claiming that their once robust effects are fragile, tender shoots, that only grow on the West coast and not in the Midwest. I’m not sure if the folks invoking this argument realize that this is worse than having 66% of our findings not replicate. At least 36% did work. The subtle moderator take on things basically says we can ignore the remaining 36% too because yet unknown subtle moderators will render them ungeneralizable if tested a third time. While I am no fan of the over-generalization of findings based on undergraduate samples, I’m not yet willing to give up the aspiration of finding things out about humans. Yes, humans. Third, if this was such a widely accepted fact, and not something solely invoked after our work fails to replicate, then again, our reactions to the failures to replicate would be different. If we never expected our work to replicate in the first place, our reactions to failures to replicate wouldn’t be as extreme as they’ve been.

One thing that has not really occurred much in response to the Reproducibility Report is to recommend some changes to the way we do things. With that in mind, and in homage to Bill Maher, I offer a list of the “New Rules of Research[1]” that follow, at least in my estimate, from taking the results of the Reproducibility Report seriously.

  1. Direct replication is yooge (huge). Just do it. Feed the science. Feed it! Good science needs reliable findings and direct replication is the quickest way to good science. Don’t listen to the apologists for conducting only conceptual replications. Don’t pay attention to the purists who argue that all you need is a large sample. Build direct replications into your work so that you know yourself whether your effects hold up. At the very least, doing your own direct replications will save you from evils of sampling error. At the very most, you may catch errors in your protocol that could affect results in unforeseen ways. Then share it with us however you can. When you are done with that do some service to the field and replicate someone else’s work.
  1. If your finding fails to replicate, the field will doubt your finding—for now. Don’t take it personally. We’re just going by base rates. After all, less than half of our studies replicate on average. If your study fails to replicate, you are in good company—the majority. The same thing goes if your study replicates. Two studies do not make a critical mass of evidence. Keep at it.
  1. Published research in top journals should have high informational value. In the parlance of the NHSTers this means high power. For the Bayesian folks, compelling evidence that is robust across a range of reasonable priors. Either way, we know from some nice simulations that for the typical between subjects study this means that we need a minimum of 165 participants for average main effects and more than 400 participants for 2×2 between-subjects interaction tests. You need even more observations if you want to get fancy or reliably detect infinitesimal effect sizes (e.g., birth order and personality, genetic polymorphisms and any phenotype). We now have hundreds of studies that have failed to replicate and the most powerful reason is the lack of informational value in the design of the original research. Many protest that the burden of collecting all of those extra participants will cost too much time, effort, and money. While it is true that increasing our average sample size will make doing our research more difficult, consider the current situation in which 64% of our studies fail to replicate and are therefore are a potential waste of time to read and review because they are poorly designed to start (e.g., small N studies with no evidence of direct replication). We waste countless dollars and hours of our time processing, reviewing, and following up on poorly designed research. The time spent collecting more data in the first place will be well worth it if the consequence is increasing the amount of reproducible and replicable research. And, the journals will love it because we will publish less and their impact factors will inevitably go up—making us even more famous.
  1. The gold standard for our science is a pre-registered direct replication by an independent lab. A finding is not worth touting or inserting in the textbooks until a well-powered, pre-registered, direct replication is published. Well, to be honest, it isn’t a worth touting until a good number of well-powered, pre-registered, direct replications have been published.
  1. The peer-reviewed paper is no longer the gold standard. We need to de-reify the publication as the unit of exaltation. We shouldn’t be winning awards, or tenure, or TED talks for single papers. Conversely, we shouldn’t be slinking away in shame if one of our studies fails to replicate. We are scientists. Our job is, in part, to figure out how the world works. Our tools are inherently flawed and will sometimes give us the wrong answer. Other times we will ask the wrong question. Often we will do things incorrectly even when our question is good. That is okay. What is not okay is to act as if our work is true just because it got published. Updating your priors should be an integral part of doing science.
  1. Don’t leave the replications to the young. Senior researchers, the ones with tenure, should be the front line of replication research—especially if it is their research that is not replicating. They are the ones who can suffer the reputational hits and not lose their paychecks. If we want the field to change quickly and effectively, the senior researchers must lead, not follow.
  1. Don’t trust anyone over 50[2]. You might have noticed that the persons most likely to protest the importance of direct replications or who seem willing to accept a 36% replication rate as “not a crisis” are all chronologically advanced and eminent. And why wouldn’t they want to keep the status quo? They built their careers on the one-off, counter-intuitive, amazeballs research model. You can’t expect them to abandon it overnight can you? That said if you are young, you might want to look elsewhere for inspiration and guidance. At this juncture, defending the status quo is like arguing to stay on board the Titanic.
  1. Stop writing rejoinders. Especially stop writing rejoinders that say 1) there were hidden, subtle moderators (that we didn’t identify in the first place), and 2) a load of my friends and their graduate students conceptually replicated my initial findings so it must be kind of real. Just show us more data. If you can reliably reproduce your own effect, show it. The more time you spend on a rejoinder and not producing a replication of your own work, the less the field will believe your original finding.
  1. Beware of meta-analyses. As Daniël Lakens put it: bad data + good data does not equal good data. As much as it pains me to say it, since I like meta-analyses, they are no panacea. Meta-analyses are especially problematic when a bunch of data has been p-hacked into submission and it is included with some high quality data. The most common result of this combination is to find an effect that is different from zero and thus statistically significant but strikingly small compared to the original finding. Then, you see the folks who published the original finding (usually with a d of .8 or 1) trumpeting the meta-analytic findings as proof that their idea holds, without facing the fact that the flawed meta-analytic effect size is so small that they would have never detected it using the methods they used to detect it in the first place.
  1. If you want anyone to really believe your direct or conceptual replication then pre-register it. Yes, we know, there will be folks who will collect the data, then analyze it, then “pre-register” it after the fact. There will always be cheaters in every field. Nonetheless, most of us are motivated to find the truth and eventually if the gold standard is applied (see rule #4), we will get better estimates of the true effect. In the mean time, pre-register your own replication attempts and the field will be better for your efforts.

[1] Of course, many of these are not at all new. But, given the reactions to the Reproducibility Report and the continued invocation of any reason possible to avoid doing things differently, it is clear that these rules are new to some.

[2] Yes, that includes me. And, yes, I know that there are some chronologically challenged individuals on the pro-reproducibility side of the coin. That said, among the outspoken critics of the effort I count a disproportionate number of eminent scientists without even scratching the surface.

Posted in Uncategorized | 8 Comments

What we are reading in PIG-IE 9-14-15

Last week, we read Chabris et al (2015) The fourth law of behavior genetics another in a series of lucid papers from the GWAS consortium.

This week, with Etienne LeBel in town, we are reading the OSF’s Reproducibility Report.

Posted in Uncategorized | Leave a comment

Be your own replicator

by Brent W. Roberts

One of the conspicuous features of the ongoing reproducibility crisis stewing in psychology is that we have a lot of fear, loathing, defensiveness, and theorizing being expressed about direct replications. But, if the pages of our journals are any indication, we have very few direct replications being conducted.

Reacting with fear is not surprising. It is not fun to have your hard-earned scientific contribution challenged by some random researcher. Even if the replicator is trustworthy, it is scary to have your work be the target of a replication attempt. For example, one colleague was especially concerned that graduate students were now afraid to publish papers given the seeming inevitability of someone trying to replicate and tear down their work. Seeing the replication police in your rearview mirror would make anyone nervous, but especially new drivers.

Another prototypical reaction appears to be various forms of loathing. We don’t need to repeat the monikers used to describe researchers who conduct and attempt to publish direct replications. It is clear that they are not held in high esteem. Other scholars may not demean the replicators but hold equally negative attitudes towards the direct replication enterprise and deem the entire effort a waste of time. They are, in a word, too busy making discoveries to fuss with conducting direct replications.

Other researchers who are the target of failed replications have turned to writing long rejoinders. Often reflecting a surprising amount of work, these papers typically argue that while the effect of interest failed to replicate, there are dozens of conceptual replications of the phenomenon of interest.

Finally, there appears to be an emerging domain of scholarship focused on the theoretical definition and function of replications. While fascinating, and often compelling, these essays are typically not written by people conducting direct replications themselves—a seemingly conspicuous fact.

While each of these reactions are sensible, they are entirely ineffectual, especially in light of the steady stream of papers failing to replicate major and minor findings in psychology. Looking across the various efforts at replication, it is not too much of an exaggeration to say that less than 50% of our work is reproducible. Acting fearful, loathing replicators, being defensive and arguing for the status quo, or writing voluminous discourses on the theoretical nature of replication are fundamentally ineffective responses to this situation. We dither while a remarkable proportion of our work fails to be reproduced.

 

There is, of course, a deceptively simple solution to this situation. Be your own replicator.

 

It is that simple. And, I don’t mean conceptual replicator; I mean direct replicator. Don’t wait for someone to take your study down. Don’t dedicate more time writing a rejoinder than it would take to conduct a study. Replicate your work yourself.

Now this is not much different than the position that Joe Cesario espoused, which is surprising because as Joe can attest to I did not care for his argument when it came out. But, it is clear at this juncture that there was much wisdom in his position. It is also clear that people haven’t paid it much heed. Thus, I think it merits restating.

Consider for a moment how conducting your own direct replication of your own research might change some of the interactions that have emerged over the last few years. In the current paradigm we get incredibly uncomfortable exchanges that go something like this:

Researcher R: “Dear eminent, highly popular Researcher A, I failed to replicate your study published in that high impact factor journal.”

Researcher A: “Researcher B, you are either incompetent or malicious. Also, I’d like to note that I don’t care for direct replications. I prefer conceptual replications, especially because I can identify dozens of conceptual replications of my work.”

 

Imagine an alternative universe in which Researcher A had a file of direct replications of the original findings. Then the conversation would go from a spitting match to something like this:

Researcher R: “Dear eminent, highly popular Researcher A, I failed to replicate your study published in that high impact factor journal.”

Researcher A: “Interesting. You didn’t get the same effect? I wonder why. What did you do?”

Researcher B: “We replicated your study as directly as we could and failed to find the same effect” (whether judged by p-values, effect sizes, confidence intervals, Bayesian priors or whatever).

Research A: “We’ve reproduced the effect several times in the past. You can find the replication data on the OSF site linked to the original paper. Let’s look at how you did things and maybe we can figure this discrepancy out.”

 

That is a much different exchange than the one’s we’ve seen so far which have been dominated by conspicuous failures to replicate and, well, little more than vitriolic arguments over details with little or no old or new data.

Of course, there will be protests. Some continue to argue for conceptual replications. This perspective is fine. And, let me be clear. No one to date has argued against conceptual replications per se. What has been said is that in the absence of strong proof that the original finding is robust (as in directly replicable), conceptual replications provide little evidence for the reliability and validity of an idea. That is to say, conceptual replications rock, if and when you have shown that the original finding can be reproduced.

And that is where being your own replicator is such an ingenious strategy. Not only do you inoculate the replicators, but also you bolster the validity of your conceptual replications in the process. That is a win-win situation.

And, of course, being your own direct replicator also addresses the argument that the replicators may be screw-ups. If you feel this way, fine. Be your own replicator. Show us you can get the goods. Twice. Three times. Maybe more. But, of course, make sure to pre-register your replication attempts otherwise some may accuse you of p-hacking your way to a direct replication.

It is also common, as noted, to see a response to a failure to replicate that lists out sometimes dozens of small sample, conceptual replications of original work as some kind of response. Why waste your time? The time spent crafting arguments about tenuous evidence could easily be spent conducting your own direct replication of your own work. Now that would be a convincing response. A direct replication is worth a thousand words—or a thousand conceptual replications.

Conversely, replication failures spur some to craft nuanced arguments about just what is a replication and if there anything that is really a “direct” replication and such. These are nice essays to read. But, we’ll have time for these discussions later, after we show that some of our work actually merits discussion. Proceeding to intellectual discussions is nothing more than a waste of time when more than 50% of our research fails to replicate.

Some might want to argue that conducting our own direct replications would be an added burden to already inconvenienced researchers. But, let’s be honest. The JPSP publication arms race has gotten way out of hand. Researchers seemingly have to produce at least 8 different studies to even have a chance of getting into the first two sections of JPSP. What real harm would there be if you still did the same number of studies but just included 4 conceptually distinct studies each replicated once? That’s still 8 studies, but now the package would include information that would dissipate the fear of being replicated.

Another argument would be that it is almost impossible to get direct replications published. And, that is correct. Our only bias more foolish than the bias against null findings is the bias against the value of direct replications. Resultantly, it would be hard to get direct replications published in mainstream outlets. I have utopian dreams sometimes where I imagine our entire field moving past this bias. One can dream, right?

But, this is no longer a real barrier. Some journals or sections of journals are actively fostering the publication of direct replications. Additionally, we have numerous outlets for direct replication research, whether it is formal ones, such as PloS-ONE or Frontiers, or less formal such as Psychfiledrawer or the Open Science Framework. If you have replication data, it can find a home, and interested parties can see it. Of course, it would help even more if the data were pre-registered.

So there you have it. Be your own replicator. It is a quick, easy, entirely reasonable way dispelling the inherent tension in the current replication crisis we are enduring.

 

 

 

Posted in Uncategorized | 3 Comments

Sample Sizes in Personality and Social Psychology

R. Chris Fraley

Imagine that you’re a young graduate student who has just completed a research project. You think the results are exciting and that they have the potential to advance the field in a number of ways. You would like to submit your research to a journal that has a reputation for publishing the highest caliber research in your field.

How would you know which journals are regarded for publishing high-quality research?

Traditionally, scholars and promotion committees have answered this question by referencing the citation Impact Factor (IF) of journals. But as critics of the IF have noted, citation rates per se may not reflect anything informative about the quality of empirical research. A paper can receive a large number of citations in the short run because it reports surprising, debatable, or counter-intuitive findings regardless of whether the research was conducted in a rigorous manner. In other words, the citation rate of a journal may not be particularly informative concerning the quality of the research it reports.

What would be useful is a way of indexing journal quality that is based upon the strength of the research designs used in published articles rather than the citation rate of those articles alone.

In an article recently published in PLoS ONE, Simine Vazire and I attempted to do this by ranking major journals in social-personality psychology with respect to what we call their N-pact Factors (NF)–the statistical power of the studies they publish. Statistical power is defined as the probability of detecting an effect of interest when that effect actually exists. Statistical power is relevant for judging the quality of empirical research literatures because, compared to lower powered studies, studies that are highly powered are more likely to (a) detect valid effects, (b) buffer the literature against false positives, and (c) produce findings that other researchers can replicate. Although power is certainly not the only way to evaluate the quality of empirical research, the more power a study has, the better positioned it is to provide useful information and to make robust contributions to the empirical literature.

Our analyses demonstrate that, overall, the statistical power of studies published by major journals in our field tends to be inadequate, ranging from 40% to 77% for detecting the typical kinds of effect sizes reported in social-personality psychology. Moreover, we show that there is considerable variation among journals; some journals tend to consistently publish higher power studies and have lower estimated false positive rates than others. And, importantly, we show that some journals, despite their comparatively high impact factors, publish studies that are greatly underpowered for scientific research in psychology.

We hope these rankings will help researchers and promotion committees better evaluate various journals, allow the public and the press (i.e., consumers of scientific knowledge in psychology) to have a better appreciation of the credibility of published research, and perhaps even facilitate competition among journals in a way that would improve the net quality of published research. We realize that sample size and power are not and should not be the gold standard in evaluating research But we hope that this effort will be viewed as a constructive, if incomplete, contribution to improving psychological science.

Simine wrote a nice blog post about some of the issues relevant to this work. Please check it out.

 

Posted in Uncategorized | 1 Comment