by R. Chris Fraley
Stroebe and Strack (2014) recently argued that the current crisis regarding replication in psychological science has been greatly exaggerated. They observed that there are multiple replications of classic social/behavioral priming findings in social psychology. Moreover, they suggested that the current call for replications of classic findings is not especially useful. If a researcher conducts an exact replication study and finds what was originally reported, no new knowledge has been generated. If the replication study does not find what was originally reported, this mismatch could be due to a number of factors and may speak more to the replication study than the original study per se.
As an alternative, Stroebe and Strack (2014) argue that, if researchers choose to pursue replication, the most constructive way to do so is through conceptual replications. Conceptual replications are potentially more valuable because they serve to probe the validity of the theoretical hypotheses rather than a specific protocol.
Are conceptual replications part of the solution to the crisis currently facing psychological science?
The purpose of this post is to argue that we can only learn anything of value—whether it is from an original study, an exact replication, or a conceptual replication—if we can trust the data. And, ultimately, a lack of trust is what lies at the heart of current debates. There is no “replicability crisis” per se, but there is an enormous “crisis of confidence.”
To better appreciate the distinction, consider the following scenarios.
A. At University of A researchers have found that X1 leads to Y1. They go on to show that X2 leads to Y. And that X3 leads to Y2. In other words, there are several studies that suggest that X, operationalized in multiple ways, leads to Y in ways anticipated by their theoretical model.
B. At University of B researchers have found that X1 leads to Y1. They go on to show that X2 leads to Y. And that X3 leads to Y2. In other words, there are several studies that suggest that X, operationalized in multiple ways, leads to Y in ways anticipated by their theoretical model.
Is one set of research findings more credible than the other? What’s the difference?
At the University of A researchers conducted 8 studies total. Some of these were pilot studies that didn’t pan out, but led to some ideas about how to tweak the measure of Y. A few of the studies involved exact replications with extensions, but the so-called exact replication part didn’t quite work, but one of the other variables did reveal a difference that made sense in light of the theory, so that finding was submitted (and accepted) for publication. In each case, the data from on-going studies were analyzed each week for lab meetings and studies were considered “completed” when a statistically significant effect was found. The sample sizes were typically small (20 per cell) because a few other labs studying a similar issue had successfully obtained significant results with small samples.
In contrast, at the University of B a total of 3 studies were conducted. The researchers used large sample sizes to estimate the parameters/effects well. Moreover, the third study had been preregistered such that the stopping rules for data collection and the primary analyses were summarized briefly (3 sentences) on a time-stamped site.
Both research literatures contain conceptual replications. But, once one has full knowledge of how these literatures were produced via a Simmons et al. (2011) sleight of hand, one may doubt whether the findings and theories being studied by the researchers at the University of A are as solid as those being studied at the University of B. This example is designed to help separate two key issues that are often conflated in debates concerning the current crisis.
Specifically, as a field, we need to draw a sharper distinction between (a) replications (exact vs. conceptual) and (b) the integrity of the research process (see Figure) when considering the credibility of knowledge generated in psychological science. We sometimes conflate these two things, but they are clearly separable.

The difference between methodological integrity and replication and their relation to the the credibility of research
Speaking for myself, I don’t care whether a replication is exact or conceptual. Both kinds of studies serve different purposes and both are valuable under different circumstances. But what matters critically for the current crisis is the integrity of the methods used to populate the empirical literature. If the studies are not planned, conducted, and published in a manner that has integrity, then—regardless of whether those findings have been conceptually replicated—they offer little in the way of genuine scientific value. The University of A example above illustrates a research field that has multiple conceptual replications. But those replications do little to boost the credibility of the theoretical model because the process that generated the findings was too flexible and not transparent (Simmons, Nelson, & Simonsohn, 2011).
When skeptics call for “exact replications,” what they really mean is that “we don’t trust the integrity of the process that led to the publication of the findings in the first place.” An exact replication provides the most obvious way to address that matter; that is why skeptics, such as my colleague, Brent Roberts, are increasingly demanding them. But improving the integrity of the research process is the most direct way to improve the credibility of published work. This can be accomplished, in part, by using established and validated measures, taking statistical power or precision seriously, using larger sample sizes, preregistering analyses and designs when viable, and, of course, conducting replications along the way.
I agree with Stroebe and Strack (2014) that conceptual replication is something for which we should be striving. But, if we don’t practice methodological integrity, no number of replications will solve the crisis of confidence.
–
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22(11), 1359-1366.
Stroebe, W., & Strack, F. (2014). The alleged crisis and the illusion of exact replication. Perspectives in Psychological Science, 8, 59-71. http://pps.sagepub.com/content/9/1/59
Great post. I’ve seen the scenario you describe in practice. The first time I realized it was reading a 4-study paper that tested the same theoretical relationship in all 4 studies; but across the 4 studies, it had 3 operationalizations of the IV and 3 operationalizations of the DV. Once I put 2 and 2 together (okay actually 3 and 3), I started wondering: out of 9 possible studies, why am I seeing those 4?
The integrity concern, as you discuss, is that maybe they actually ran all or most of the 9 possibly studies and just reported the 4 that “worked.” Full disclosure and preregistration would address that.
Here’s another possibility though: Maybe they only ran those 4. But they chose to run those 4 and not the other 5 because they had some prior knowledge or intuition that those 4 would work better than the other 5, perhaps for reasons grounded in artifact.
This, I think, is a second and under-recognized problem with conceptual replications. “Conceptual replication” means testing the same theoretical relationship across different empirical realizations of the variables and procedures. It’s supposed to be a Good Thing insofar as it shows that the theorized relationship is not dependent on method — it’s actually supposed to rule out artifacts as explanations. But those empirical realizations are almost never a random sample of the possible empirical realizations. Instead they’re handpicked by the researcher, which leaves open room for bias.
It’s not easy to know what the solution should be. Sometimes you can run multiple methods in the same study — like when you measure a variable with self and informant reports — and then model effects of the shared variance. But that doesn’t work for experimental manipulations, it doesn’t work for more reactive measures, and it can become onerous for methods that are expensive or time-consuming.
“It’s supposed to be a Good Thing insofar as it shows that the theorized relationship is not dependent on method — it’s actually supposed to rule out artifacts as explanations. But those empirical realizations are almost never a random sample of the possible empirical realizations. Instead they’re handpicked by the researcher, which leaves open room for bias.”
I think you’re right on the money here, Sanjay.
I think part of the solution (while acknowledging that no solution is perfect) is starting with transparent methods, using large samples, and validated instruments. With all of this in place, I believe the findings. With all of this in place–and with conceptual replications too–I believe the theory.
This is a great post. You briefly mentioned that scientific integrity entails using established and validated measures, and I think that point deserves extra emphasis.
As LeBel & Peters (2011) observed, in practice, there probably is a (negative) link between conceptual replication and scientific integrity. Specifically, conceptual replications usually involve at least one unvalidated manipulation, measure, or methodology. As such, when a conceptual replication fails, it could be due to poor construct validity, an ineffective manipulation, or a variety of other methodological flaws. This creates a situation where it’s not clear whether the study failed because (1) the effect size is smaller than the researchers wish to detect, or (2) the methods were suboptimal. In trying to perfect their methods, researchers can unintentionally end up testing their hypotheses several times.
One of LeBel and Peters’ conclusions was that we should validate our manipulations and measures *before* using them to test hypotheses. In terms of conceptual replications, I think the ideal would be to run a study showing that X2 taps the same construct as X1 (and taps it well) before running a separate study showing that X2 leads to Y.
Great post Chris! I think increasing transparency and sample size will do a lot to cure the underlying issues that have prompted the crisis of confidence (and count me as one who thinks this is a real crisis). I also thought the sections in the upcoming Cumming PS paper about integrity and the two notions of research integrity were important. I think those ideas match well with the ideas you are discussing.
A few random thoughts:
1. I think Nathan makes a good point that worries me as well. I worry that many conceptual replications are not very risky (in the Paul Meehl sense) in practice. Why? If the conceptual replication fails, then researchers can dismiss the result as a failure to correctly operationalize the IV or DV(s). The miss can be chalked up to a “bad” pilot study. How many times has a failed conceptual replication attempt prompted researchers to go back to the original finding and test it again versus how many times does the failed conceptual replication just motivate a different conceptual replication attempt? The publication bias indexes/indices from Francis and Schimmack (and others before) suggest that at least a few big conceptual replication packages have been subject to selective reporting. If we had full disclosure of the all of the hits and misses, it would be good for the field. The evidence for ESP was a bunch of conceptual replications, no?
2. I worry that at least a few people in the field just do not believe that Type M and Type S errors (and even the conceptually maligned Type I error) can happen to them. We don’t have to be moralistic about research integrity — we can endorse those practices to help reduce these kinds of errors from happening.
Very nice post. Perhaps you’ll find my reply interesting as well: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2381936.
Pingback: I’ve Got Your Missing Links Right Here (01 March 2014) – Phenomena: Not Exactly Rocket Science
Hello. Where did you find the figure with replications (exact vs. conceptual) and the integrity of the research process?
I think Chris Fraley created that himself. I would recommend reaching out to him directly (he won’t see this).
Thank you