By R. Chris Fraley
In her commentary on the Johnson, Cheung, and Donnellan (2014) replication attempt, Schnall (2014) writes that the analyses reported in the Johnson et al. (2014) paper “are invalid and allow no conclusions about the reproducibility of the original findings” because of “the observed ceiling effect.”
I agree with Schnall that researchers should be concerned with ceiling effects. When there is relatively little room for scores to move around, it is more difficult to demonstrate that experimental manipulations are effective. But are the ratings so high in Johnson et al.’s (2014) Study 1 that the study is incapable of detecting an effect if one is present?
To address this question, I programmed some simulations in R. The details of the simulations are available at http://osf.io/svbtw, but here is a summary of some of the key results:
- Although there are a large number of scores on the high end of the scale in the Johnson et al. Study 1 (I’m focusing on the “Kitten” scenario in particular), the amount of compression that takes place is not sufficient to undermine the study’s ability to detect genuine effects.
- If the true effect size for the manipulation is relatively large (e.g., Cohen’s d = -.60; See Table 1 of Johnson et al.), but we pass that through a squashing function that produces the distributions observed in the Johnson et al. study, the effect is still evident (see the Figure for a randomly selected example from the thousands of simulations conducted). And, given the sample size used in the Johnson et al. (2014) report, the authors had reasonable statistical power to detect it (70% to 84%, depending on exactly how things get parameterized).
- Although it is possible to make the effect undetectable by compressing the scores, this requires either (a) that we assume the actual effect size is much smaller than what was originally reported or (b) that the scores be compressed so tightly that 80% or more of participants endorsed the highest response or (c) that the effect work in the opposite direction of what was expected (i.e., that the manipulation pushes scores upwards towards rather than away from the ceiling).
In short, although the Johnson et al. (2014) sample does differ from the original in some interesting ways (e.g., higher ratings), I don’t think it is clear at this point that those higher ratings produced a ceiling effect that precludes their conclusions.
While I greatly appreciate the time and effort put into this analysis, I do believe the energy is misplaced for one very important reason. The nature of the data in the replication is not relevant to the quality of that replication but to the quality of the measure that was used. That measure is the exact same measure used in Schnall’s original study. If that measure is so problematic that it nets skewed data in a direct replication, then it is the measure and the method used by Schnall that is the problem, not the replication. Thus, if the data were problematically skewed in Johnson et al, which according to your analysis they are not, that would be evidence that the method used in the original article was flawed.
Your analysis gives too much credit to the criticism, which is misplaced.
While I can see why Schnall would bring something like this up, what is most distressing is the unthinking endorsement of this criticism by those seeking to support her. Their criticism and subsequent comments about the apparent lack of reviewing of the replication—the method of which was reviewed and approved by Schnall—implies that the replication paper should have been rejected. This is the beauty pageant mentality that has gotten us into the crisis we are in—research is evaluated on the results, not the method. It bodes poorly for our future if this simple distinction is not appreciated by central figures in Schnall’s field.
The data analysis in psych research is one big tragedy. I just want to make clear what is happening here with an example. Researcher A (somehow) measures 6.5 inches with stick that is one foot long. Researcher B however measures 2.1 inches with the same one-foot long stick. Researcher A says that one-foot long stick is not appropriate for measuring distances smaller than one foot and therefore the results of researcher A are invalid. Researcher C brings a ruler with a sub-inch precision. He runs a simulation study – he first determines a distance with the ruler. He then measures the distance with the stick. Finally he compares the values obtained with the stick and the ruler. Researcher C argues that stick is ok for measuring a distance of 2.1 inches.
For heaven’s sake, why can’t we just throw away the stick and use the ruler instead!!!
To be sure, the one-inch stick is Anova and the ruler is Regression (where we are interested in the regression coefficients + their CI)
In the current case the outcome can be considered continuous and regression with logit-normal link function could be fruitfully used to analyse the data. (If the logit-normal model turns out not to be appropriate one can still take recourse in IRT.) Notably, with such an analysis, the ceiling effects would translate into low precision of the estimates and would NOT affect the magnitude of the estimate as it does when Anova is used. As a direct consequence, the question of ceiling effects would be quantitatively resolved by the analysis and we would avoid the inferential limbo in which the current discussion of the replication is heading.
PS: I agree with Brent’s comment, that once we have determined that the outcome measure provides imprecise estimates (even for high N), we should ditch the response measure. Checking the validity and reliability of the instruments used in the experiment should be an important part of any replication effort.
Pingback: Felix Schönbrodt's website
Pingback: Replication studies, ceiling effects, and the psychology of science | Is Nerd
Important to keep in mind that a measure may have different properties when administered to different populations… e.g., in an extreme case, would we “ditch” a questionnaire written in English if we got back nonsense results when administered to non-English speakers?
> If that measure is so problematic that it nets skewed data in a direct replication, then it is the measure and the method used by Schnall that is the problem, not the replication. [Brent R]
I prefer to conceptualize scores as being a function of the properties of the people being measured AND the properties of the instruments being used to measure them. A math test that focuses on differential equations might be useful for assessing individual differences in math skills/knowledge in a college population, but less useful for assessing that variation in math skills/knowledge in third graders.
In short, the measure in question is probably fine; it certainly has face validity. The real question, as I understand it, is whether students at MSU are “different” from those where the original study was conducted–so different that they are more like third graders in the math example above. I doubt it, but I think it is difficult to know for sure given the large CI around the means in the original study.
> would we “ditch” a questionnaire written in English if we got back nonsense results when administered to non-English speakers? [FairAndBalanced]
Nonsense results seem like a reasonable basis for investigating the measure further–if “nonsense” means something like “not internally consistent” rather than “not compatible with my hypothesis.”
Pingback: Psychology News Round-Up (May 30th) | Character and Context
Pingback: There is no ceiling effect in Johnson, Cheung, & Donnellan (2014) | 
Pingback: Felix Schönbrodt's website
Pingback: Reanalyzing the Schnall/Johnson “cleanliness” data sets: New insights from Bayesian and robust approaches ← Patient 2 Earn
Pingback: Notes on Replication from an Un-Tenured Social Psychologist - The Berkeley Science Review
Pingback: Replication studies, ceiling effects, and the psychology of science – Panicking
Pingback: Replication studies, ceiling effects, and the psychology of science – Excitations