The Pre-Publication Transparency Checklist

The Pre-Publication Transparency Checklist: A Small Step Toward Increasing the Believability of Psychological Science

 We now know that some of the well-accepted practices of psychological science do not produce reliable knowledge. For example, widely accepted but questionable research practices contribute to the fact that many of our research findings are unbelievable (that is, that one is ill-advised to revise one’s beliefs based on the reported findings). Post-hoc analyses of seemingly convincing studies have shown that some findings are too good to be true.  And, a string of seminal studies have failed to replicate.  These factors have come together to create a believability crisis in psychological science.

Many solutions have been proffered to address the believability crisis.  These solutions have come in four general forms.  First, many individuals and organizations have listed recommendations about how to make things better.  Second, other organizations have set up infrastructures so that individual researchers can pre-register their studies, to document hypotheses, methods, analyses, research materials, and data so that others can reproduce published research results (Open Science Framework).  Third, specific journals, such as Psychological Science, have set up pre-review confessionals of sorts to indicate the conditions under which the data were collected and analyzed.  Fourth, others have created vehicles so that researchers can confess to their methodological sins after their work has been published (psychdisclosure.org).  In fact, psychology should be lauded for the reform efforts it has put forward to address the believability crisis, as it is only one of many scientific fields in which the crisis is currently raging, and it is arguably doing more than many other fields.

While we fully support many of these efforts at reform, it has become clear that they leave a gaping hole through which researchers can blithely walk.  People can and do ignore recommendations.  Researchers can avoid pre-registering their work.  Researchers can also avoid publishing in journals that require confessing one’s QRPs before review.  And, published authors can avoid admitting to their questionable research practices post hoc.  What this means is that research continues to be published every month in our most prestigious journals that in design and method looks indistinguishable from the research that lead to the believability crisis in the first place.

In searching for solutions to this problem, we thought that instead of relying on the good graces of individual researchers to change their own behavior, or waiting for the slow pace of institutional change (e.g., journals to follow Psychological Science’s lead), that it might be productive to provide a tool that could be used by everyone, right now.  So what are we proposing?  We propose a set of questions that all researchers should be able to answer pre-publication in the review process—the Pre-Publication Transparency Checklist (PPTC).  Who should use these questions?  Reviewers.  Reviewers are free to ask any question they want, as many of us can attest to.  There is nothing stopping researchers from holding other researchers accountable. The goal of these questions is to get even those unwilling to pre- or post-register their research process to cough up background information on how they conducted their research and the extent to which their results are “fragile” or “robust”.  The questions are inspired by the changes recommended by many different groups and would hopefully help to improve the believability of the research by making authors describe the conditions under which the research was conducted before their paper is accepted for publication[1].

The Pre-Publication Transparency Checklist

  1. How many studies were run in conceptualizing and preparing the set of studies reported in this paper?
    • How many studies were run under the original IRB proposal?
    • How many “pilot” studies were run in support of the reported studies?
  2. If an experiment was run, how many conditions were included in the original study and were all of these conditions included in the current manuscript?a
  3. Was any attempt made to directly replicate any of the studies reported in this paper?
    • Would you be willing to report the direct replications as an on-line appendix?
    • Note: In some fields it is common to replicate research but not report the efforts.
    • Note: Some studies are difficult to replicate (e.g., longitudinal, costly, technologically intense).
  4. Approximately how many outcome measures were assessed in each study?a
    • How many of these outcome measures were intended for this study?
    • How many outcome measures were analyzed for each study?
    • Do all of the conceptually and empirically related DVs show the same pattern of effects?
  5. In any of the studies presented, were the data analyzed as they were being collected (i.e., peeked at)?
    • If it was “peeked” at, was there an effort to address the potential increase in the Type I error rate that results from peeking, such as conducting direct replications or using Bayesian estimation approaches?
    • Note: The goal is not necessarily to eliminate p-hacking but to make sure our findings are replicable despite p-hacking (see Asendorpf, et al, 2012 for a discussion).
  6. What was the logic behind the sample sizes used for each study?a
    • Was a power analysis performed in the absence of pilot data?
    • Was an effect size estimate made on the initial work and used for power estimates of subsequent studies whether they were direct or conceptual replications?
  7. Were any participants assessed and their data not included in the final report of each study?a
    • What was the rationale for not including these participants?
  8. Do all of the co-authors of the study have access to the data?
  9. Can all of the co-authors of the study reproduce the analyses?
    • If not, why and who can?
    • Note: It is common for statistical experts to lend a helping hand so it is not necessarily bad that all the authors cannot reproduce the analyses.  But, it is important to know who can and cannot reproduce the analyses for future efforts to reproduce the results.

10. Were there any attempts to test whether the results were robust across analytical models and different sets of control variables?

  • If the results do not replicate across models was this factored into the alpha level (multiple tests of significance?)?

11. Approximately how many different analytical approaches were used?

  • Were alternative ways of analyzing the data considered and tested?
  • Note: it is common to try different variants of the general linear model (ANOVA, ANCOVA, regression, HLM, SEM).  It would be important to know whether the results replicate across the various ways of analyzing the data.

So, as we noted above, these questions could be asked of researchers when they present their work for review.  Ideally, the answers to these questions would become part of the reported methods of every paper submitted, possibly as an on-line appendix. If reviewers asked these types of questions of every paper that was submitted, that in itself would change the publication incentive structure quite dramatically.

A second way that the Pre-Publication Transparency Checklist could be used is by editors of journals other than Psychological Science.  Like reviewers, editors could ask authors to simply answer each of these questions along with their submission.  There is no reason why Psychological Science should go it alone with this type of questioning.  The effort to answer these questions is minor—far less, for example, than the time taken to complete the typical IRB form.  Again, if editors used the PPTC, which they should be free to do today, we could be on our way to a better more substantial body of research on which to base our future scientific efforts.

Given the heterogeneity of reactions to the believability crisis in psychology, we do not foresee the answers to these questions being “right” or “wrong” so much as providing information that other researchers can use to determine whether they personally would want further follow up before concluding that the research was reliable. But, of course, like the traditional methods we use in psychological science which rely on transparency, accuracy, and honesty, the answers will only be as good as they are truthful.

We are also sympathetic to the point that many of the questions will be difficult to answer and that many questions will not apply to different types of research.  That is okay.  The goal is not an exacting account of every behavior and decision made on the way to a finished publication.  The goal is to provide background information to help determine how robust or delicate the findings may be.  For example, if dozens of studies were run looking at scores of outcomes and only a few of the studies and outcomes were reported, then other researchers may not want to attempt to build on the findings before directly replicating the results themselves.  Similarly, if multiple analyses were conducted and only the statistically significant ones reported, then other researchers would likewise be cautious when following up on the findings.

As noted above, the PPTC would not be necessary if researchers pre-registered their studies, posted their materials and data on-line, and were transparent with the description of their methods.  But, given the obvious fact that not every researcher is going to pre-register their materials, the Pre-Publication Transparency Checklist provides a means through which reviewers and editors can get these individuals to provide desperately needed information on which to judge the robustness or fragility of the reported findings in any submitted manuscript.

Brent W. Roberts, University of Illinois, Urbana-Champaign

Daniel Hart, Rutgers University


[1] These accords include those proposed for pre-review at Psychological Science, post-publication disclosure at Psychdislosure.org and as part of the 21 Word Solution.  Those questions marked with an “a” are similar in content to the questions found on existing systems.

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

7 Responses to The Pre-Publication Transparency Checklist

  1. Greg Francis says:

    The checklist is an interesting idea for raising awareness about these kinds of issues. Ignorance is still rampant, so I believe there is some merit in reminding authors and reviewers to think about these topics when submitting or evaluating a manuscript.

    However, something has always bothered me about these kinds of requirements, and I thought it might be useful to describe my concerns as a counterpoint to the checklist. It seems to me that in a good scientific study a checklist like this should not be necessary, and that the imposed checks actively hinder scientific exploration.

    1. How many studies were run in conceptualizing and preparing the set of studies reported in this paper?

    Doesn’t a good set of experimental results stand on their own, regardless of how many pilot studies were attempted? I understand the concern here; perhaps there are hundreds of failed studies and the authors are presenting just the few that happened to reject the null. But such a situation typically leads to lousy studies (small samples sizes, p values just below .05, unbelievably large effect sizes). Let’s just call them lousy studies and be done with it.

    2. If an experiment was run, how many conditions were included in the original study and were all of these conditions included in the current manuscript?

    I see this as being similar to item 1. It is a bit different in that withholding information about a condition misrepresents the reported experiment (perhaps even the procedure, for a within-subjects design); I see these as a gross misrepresentation of the study. In general, though, cases involving selective reporting of a handful of conditions are going to produce pretty lousy results. Let’s just call them lousy results.

    3. Was any attempt made to directly replicate any of the studies reported in this paper?

    I am always a bit baffled by questions of this type. If the purpose of the question is to imply that a replication is a good thing to avoid Type I errors, then the same point can be achieved by just requiring one study to satisfy p<.0025. If the purpose of the question is for authors to demonstrate generalizability of the methodology, then a lot more effort is required.

    4. Approximately how many outcome measures were assessed in each study?

    This question seems similar to items 1 and 2. I understand the concern that authors may craft measures to insure statistical significance, but such efforts tend to produce lousy results. Let's just call them lousy measures.

    Alternatively, and this point somewhat applies to other items as well, we could just recognise that identifying outcome measures is part of the scientific process of exploration. Such activity is very common in psychology (I would argue it is the norm). I think part of the problem the field faces right now is that they engage in exploratory work but describe it as confirmatory work. If someone has a large data set and searches through it for interesting patterns, I think that is good scientific work. I would not, however, suggest that the authors should use hypothesis tests to claim statistical significance nor that they should use such measures to justify a theoretical claim. Exploratory work can be used to build theoretical ideas, but not to test them. Testing requires new data sets.

    5. In any of the studies presented, were the data analyzed as they were being collected (i.e., peeked at)?

    I'm not sure what to make of this item. One the one hand, someone who admits to data-peeking should not be allowed to report results from (uncorrected) hypothesis testing. On the other hand, it is very difficult to make these kinds of judgments. Suppose someone runs the planned 20 subjects per group and gets the desired outcome (reject the null). The inference is not necessarily valid. If the outcome had been to not reject the null, maybe the researcher would have added another ten subjects to each group, so the stopping after the "planned" sample was actually data-peeking, even though it did not feel like it to the researcher. For that matter, what is a replication study but a decision to gather more data? If a researcher plans a replication study, then they are explicitly planning to data-peek at some intermediate state of data collection. I don't see easy solutions to these problems. The difficult solutions jettison hypothesis testing.

    6. What was the logic behind the sample sizes used for each study?

    The answers at psychdisclosure.org make it pretty clear that this is a very difficult question to answer. Usually there is very little logic behind the sample sizes.

    7. Were any participants assessed and their data not included in the final report of each study?

    No problems with this one. It's easy to answer and should be standard. I think most people know and follow this kind of rule already (at least within my specialised subfield).

    8. Do all of the co-authors of the study have access to the data?

    I don't see this as being so serious regardless of the answer.

    9. Can all of the co-authors of the study reproduce the analyses?

    Similar to item 8.

    10. Were there any attempts to test whether the results were robust across analytical models and different sets of control variables?

    The spirit of this question seems to run counter to some of the other items (1, 2, 3, 4, 11). It seems to encourage trying out various analysis. I understand the desire for robustness, but this kind of approach often undermines rather than supports theoretical claims. If I require statistical significance for measures of performance on Raven's tests and IQ, the probability of a successful experimental outcome (both measures rejecting the null) must be lower than the probability of any one measure being successful. This property undermines replicability.

    11. Approximately how many different analytical approaches were used?

    I appreciate the concern here (researchers might try out different analyses to find the few that happen to show a significant outcome). My impression is similar as for items 1, 2, and 4; such approaches often produce lousy results and we can consider them on that basis. Alternatively, such approaches can be considered exploratory work (it's technically a kind of model development), and I think it is wrong to discourage such work.

    Ultimately, it seems to me that a properly functioning scientific field would not ask authors to complete a checklist of this type because it would not be necessary. The concerns behind the items can largely be addressed by considering two types of scientific work:

    1) Confirmatory studies: Authors have a clearly articulated theory that is fully described so that readers can see how the theoretical claims follow from the theory ideas. The theory must quantitatively predict effect size magnitudes, so justifiable sample sizes then easily follow. Likewise, the analyses and measures all flow from the theoretical ideas. Checklists are not necessary because every (appropriately trained) reader can follow the argument from theory to the experimental design and analysis. A reader can disagree with various stages of the process, but the connection is clear.

    2) Exploratory studies: Authors do not have a clearly articulated theory for the area of investigation, so they are trying out different possibilities and seeing if anything interesting pops up. In such a situation the checklist is unnecessary, and such a study should probably not be using hypothesis testing at all. If the reported results are based on hypothesis testing, they should be interpreted as being tentative. Since the authors are just publicly trying out different ideas, there is not necessarily a need to discuss other studies, measures, analyses and so forth (it might be a good idea in some circumstances, but it should be up to the authors).

    I have a similar attitude toward pre-registration procedures. If an author is doing confirmatory studies from a clearly defined theory, then a pre-registration is just writing down what anyone could derive from the theory. It will be the same derivation before the study and after the study, so the pre-registration is superfluous. On the other hand if an author is doing exploratory studies, then pre-registration seems rather silly: why should the author be tied to one particular outcome when the point of the study is to explore possibilities? Is the data not valid just because the author did not anticipate some outcome?

    It seems to me that one of the fundamental problems in the field of psychology is that people think they are running confirmatory studies but they are actually running exploratory studies. I like both types of studies, but they convey different types of scientific information. We need to better understand which kind of work we are actually performing.

    • Genobollocks says:

      If Roberts and Hart are right and we have a believability crisis (i.e. not just a reproducibility crisis, but gullible psychologists still believe), then they gave a reviewer, who wants to be able to believe again, a checklist to identify some of these lousy (exploratory posing as confirmatory) studies.
      You seem to operate on the assumption that everybody can already identify those, which is definitely not the case.
      You are right about exploratory and confirmatory research, but of course there is currently an incentive to hide the fact that you did not do confirmatory research (especially when in excess of 99% of studies do not fulfil your strict criteria such as “quantitatively predict effect size magnitudes”).

      I very much agree with you on #5. While I appreciate the effort to get into the nitty-gritty of what is acceptable analysis practices, I think good statisticians cannot wholeheartedly stand behind simple FDR correction or MC corrections (eg. http://arxiv.org/pdf/0907.2478.pdf).
      But, while I really wouldn’t want to spend a week discussing with you how using both a FIML-SEM and a multilevel model with shrinkage on multiply imputed data should affect the alpha level, I think these questions, if honestly answered (sure…), are suited to identify people who are “gaming the system” (knowingly or unknowingly) by simply searching for the strongest pattern and then a story to go with it.

      • Greg Francis says:

        I agree that there is value to the questions, especially as a mechanism for making people think about the properties of their studies and their conclusions. Still, I think several questions have no good (or even honest) answers. Some of the problems are with the methods (especially hypothesis testing) rather than with the researchers. A lot of the things people intuitively feel to be correct (such as add more data points) are inappropriate for confirmatory work with hypothesis testing, but such activities are valid for exploratory work. A scientific field needs confirmatory studies, but it also needs exploratory studies. We currently have a lot of the latter, but they are described as being the former. This mislabeling is confusing almost everyone, and is, I think, why we have a believability crisis. Papers tell us that they have “proven” something, but the methods used are only appropriate to suggest the possibility of something (maybe with a story to go along, but that is often premature).

    • Q. says:

      “Confirmatory studies: Authors have a clearly articulated theory that is fully described so that readers can see how the theoretical claims follow from the theory ideas. The theory must quantitatively predict effect size magnitudes, so justifiable sample sizes then easily follow. Likewise, the analyses and measures all flow from the theoretical ideas. Checklists are not necessary because every (appropriately trained) reader can follow the argument from theory to the experimental design and analysis”

      I don’t understand this. I reason that exploratory research, when it is written down, mimics this (previous results are used to build a theoretical framework and to come up with plausible hypotheses in the introduction-section), so how does one discern confirmatory and exploratory research without, for instance, pre-registration? I don’t think it can be distinguished by looking at the theoretical framework used to come up with the hypotheses.

      • Greg Francis says:

        I am not sure I know what Q. did not understand, so I will just try to clarify a few points.

        Non-quantitative theoretical ideas can motivate exploratory work, but they cannot predict the outcome of a hypothesis test. Consider the theoretical idea that women find men wearing a red shirt to be more attractive than men wearing a blue shirt. This statement is a theoretical idea, but it does not indicate the magnitude of the effect. Without an estimated effect size, it is not possible to identify sample sizes that provide sufficient power (the probability of rejecting the null for the given effect size). At best a researcher can only predict the probability of rejecting the null, but even that is not possible without a predicted effect size. I would suggest that there is no reason to perform a hypothesis test in such a situation. The researcher wants to estimate the effect size (it might be zero), and this is best done with something like a confidence interval. I would call this exploratory work.

        On the other hand, suppose I have in mind some quantitative theory (such as: the judgments of attractiveness are determined by judgments of status, and previous work has shown that shirt color produces a d=0.5 effect on status judgments so I predict the same effect size for attractiveness ratings). If I run a power analysis, I find that I need 105 subjects in each group to reject the null hypothesis 95% of the time. If such an experiment does not reject the null, I would be skeptical about the validity of the theory. I would call this confirmatory work.

        Having a pre-registered prediction does not necessarily make a study confirmatory, at least not in an interesting way. I can pre-register a prediction that the effect size is 0.6 for a difference in some preference-for-order measure between subjects in a control group and subjects who see a line of waterfowl (ducks in a row). Even if the prediction happens to be close to the effect size estimate from the experiment, this result would not validate the prediction in any meaningful way, unless I can explain the reasons for the prediction. If I just made up a number that happened to be correct, I was just lucky.

        So, confirmatory work is defined by predictions that are both quantitative and justified. Exploratory work can be motivated by non-quantitative or non-justified ideas. That’s how the theoretical framework distinguishes the different types of research. I should emphasize that both types of research are useful and important.

    • pigee says:

      Excellent comments. The most important being the fact that we package our exploratory work as confirmatory. That is, in a nutshell the motivation behind the checklist, as we have gotten so good at dressing up our sow’s ears into silk purses that I think we often e convince ourselves that our explorations are confirmations. In reality, we don’t have a “believability crisis”, but a “I want to believe” crisis.

      While I agree with the import of comments 1, 2, & 4, that these practices represent “lousy research”, it is impossible in the current publication system to discern whether these strategies were employed and therefore to draw the appropriate conclusion that the research is lousy. Actually, I prefer to infer that if these practices are employed, the research is simply exploratory. Nonetheless, if the answers to the question confirm any of these behaviors, then readers can infer that the results are not confirmatory, which as you acknowledge is just dandy. Though, it might take a few years of therapy for people to get used to admitting to conducting exploratory work when the obvious rewards in the system are for confirmatory, hypothesis testing. Have you read I/O journals lately? They are a hilarious litany of pedantic “hypotheses”.

      Item 3 is, in part, motivated by the comments made by some colleagues that “in my area we always replicate” despite the fact that there are no published replications. If it is the case that these multiple replications are lying about somewhere, then it would be very easy to include them as an appendix, thus enhancing the inference that the work is reliable. As long as replications lay hidden, if conducted at all, they will do us no good. While I am sympathetic to your argument that direct replications can be misleading, especially if they were arrived at through QRPs, they are far better than conceptual replications for reasons that I’ll take up in a future post.

      On question 4, surfing through the matrix of multiple measures (that are typically not reported in the study) does not mean the measures were lousy, just that the tests of significance run against them are not confessed to. The authors may have used 15 gold standard measures and only report the one that hit a valued p value.

      Item 5 (peeking) is simply an attempt to know whether there was any logic behind the data collection plan (like question 6). At the moment, we simply don’t fess up to any logic.

      Question 6 is very easy to answer. It may be difficult to admit to, but not hard to answer. Of course, the answer, I suspect will often be “There was none.” So, let me provide an easy one, for everyone in social, personality, and organizational psychology, where we now have multiple meta-analyses providing estimates of the modal effect sizes in each field. In correlational terms, the modal effect size in the average study is equal to an r of .20. So, one very simple strategy going forward is to power your study to detect the average effect size. To attain 80% power you would need to run 200 participants. Let the complaining begin. Oh, and if you want to get even more irritated, the effect sizes for interactions are typically smaller, which would mean you need closer to 400 participants, on average, to have enough power to detect the effect.

      Questions 8 & 9 I call the “Fredrickson” questions as they allude to the case where one author has no clue what the other author has done. Knowing this in advance, could help to clear up problems in the future when people ask for the data and the like. There are also cases where one author, like Stapel, cooks the books and the other authors don’t know it (there was another case at Wash U). While it is not necessary that everyone do everything on a paper, knowing who did what can help when future studies fail to find the same effect and you want to know why. Again, tossing the data and the syntax on-line with the pub would obviate much of this.

      Questions 10 and 11 are not running contrary to the early questions. Rather, they are providing more and better information about the explorations that were presumably made. If it is the case that much of our work is exploratory, conducting tests of robustness by testing alternative models and controls is actually good practice. Of course, if the research was tight and no exploratory work was done, the answers to these questions are easy and short.

      • Greg Francis says:

        The checklist motivated me to explicitly write down ideas I had been thinking about for some time. Indeed, the best argument for a checklist may be that it makes people stop and think about what they are doing.

        Just a few specific comments.

        Lousy research: I had in mind something like the Test for Excess Significance when characterising the research as lousy. To be more specific, if someone gets data on 12 measures of creativity and reports only those that satisfy p<.05, this is going to demonstrably be "lousy research". This approach almost invariably leads to a situation where some of the tests just barely satisfy the significance criterion, which means the post hoc power is never going to be much above one-half. Data peeking, dropping of experiments, and other QRPs tend to leave similar traces that are markers of lousy research. These traces can exist for both confirmatory and exploratory research.

        For question 6 (describing sample sizes), I think it is often genuinely difficult to answer the question because researchers do not know what they would do in various situations. It is clear that something caused data collection to end, and maybe it was related to the fact that the outcome became significant. I am not sure anyone (even the authors) can say whether data collection would have continued if the outcome was not significant.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s