Over the past few weeks, I’ve had several seemingly benign conversations with students about research only to realize in retrospect that we had blithely engaged in, or proposed to engage in, several of the questionable research practices (QRPs) that lead to inflated Type 1 error rates. Having spent so much time stewing on these issues, the experience was rather deflating. That our students could escape our graduate program without the clear message that these practices are problematic seemed a clear indication of pedagogical failure. That I could fail to teach these issues effectively was an indication that my standard operating procedures were more than lacking. So, I thought it would be constructive to post some readings, that if consumed, would mean that someone engaging in QRPs would do so with eyes wide shut.
A Little Background
In a perfect world, most of these issues are covered in a basic methods course that all grad students take. A typical methods course will cover issues such as internal validity, external validity (generalizability), construct validity, and statistical conclusion validity along with topics such as ethics and the various techniques one uses in research such as within-subjects designs, experience sampling, or growth modeling. Given the demands of a typical methods course many issues cannot be covered in the detail they deserve. Also, we leave a lot methods teaching to On the Job Training (OJT)–we are supposed to teach our students how to do things properly as they conduct their research. Something in this combination has not gone as well as could be expected, thus the need for some supplementation.
It is clear from the blow up of methodological imbroglios in social and personality psychology that most of our problems arise from either a willful ignorance or Machiavellian abuse of null hypothesis significance testing, combined with the use of QRPs. So, the list below emphasizes the niceties of statistical conclusion validity and how to avoid QRPs to the detriment of topics like external and construct validity.
Null Hypothesis Significance Testing (NHST), or as Cohen describes it Significance Hypothesis Inference Testing (SHIT)
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304-1312.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.
What not to do (kind of like What Not to Wear)
As I’ve noted elsewhere, I’m not entirely optimistic that reading this material is enough to protect us from conducting problematic research. Given the fact that we have known for five decades (yes, 50 years since Cohen’s 1962 paper) that our studies are underpowered and we still conduct underpowered research could be taken as evidence that we are impervious to influence. That said, if you do read and understand these papers and continue to blithely run underpowered studies and p-hack your way to fame and fortune, at least you do it with the knowledge that your papers may make you famous in the short run, but fade away in the long run.
In closing, this list is clearly idiosyncratic and incomplete. Feel free to plug for your favorite classic methods papers. It can’t hurt, can it?
nice list. thanks for this – every bit of awareness helps!
This looks like a great list, Brent!
I’m going to take a moment to summarize some of the key points we’ve discussed in the past year during our research methods discussions in PIG-IE as a mechanism for supplementing your reading list and/or reinforcing the potential value of some of the papers you list.
1. One of the limitations of psychological science as we practice it is that we behave more like lawyers than scientists. Namely, we seek evidence that is compatible with our favored hypotheses and use that evidence to build our case. A number of problems emerge when we behave in this way.
A. We continue to run studies and/or collect data until we find the evidence we’re seeking (e.g., a statistically significant mean difference in a predicted direction) without reporting the studies/data that were run that failed to produce the expected effect.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366.
B. We justify this practice as standard operating procedure by referring to these false starts as “pilot studies,” “working out the kinks in the procedures,” and “developing our measures” without acknowledging that (a) this process is equivalent to testing a hypothesis multiple times and driving up the Type I error rate accordingly and (b) this process necessarily speaks to the lack of generalizability of our effects.
2. Beyond these issues, there are some widespread flaws in research designs that characterize the field that limit our ability to understand the world accurately. I would summarize the most important problem as the following: We rely on NHST excessively without fully understanding what NHST is. The key consequence of this is the following:
A. We design studies that lack sufficient statistical power to appropriately reveal the effects or associations of interest. If the typical study in psychology has the statistical power to detect an effect of interest 50% of the time, the implication is that, if the null hypothesis is false, our ability to uncover the truth is no better than a coin toss. The tragic implication of this is that, as a field, our hit rate for correctly rejecting the null hypothesis would be the same as it is now if we stopped funding research and, instead, gave every researcher a quarter to flip to test hypotheses.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.
B. Because of the low power of our studies, the magnitude of the effects we estimate in our research are substantially larger than what they really are. In short, because only large effects can emerge as statistically significant in a low powered design, researchers who obtain significant results are systematically overestimating the effects of interest. In my opinion, Frank Schmidt has explained this point beautifully:
Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115-129.
And one consequence of this is that we see "puzzlingly high correlations" in research areas that use small sample sizes–correlations that probably excite a lot of people (e.g., the press, young investigators entering the field) and concern others.
Vul E., Harris C., Winkielman P. & Pashler H. (2009) Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition, Perspectives on Psychological Science, 4, 274-290
C. Because of our emphasis of NHST, many researchers have taken a disinterest in estimating effects as useful parameters that can inform practice, theory, and research design. This naturally limits what we can contribute as a science. (We cannot, for example, provide an educated guess about the proportion of people who will develop anxiety disorders or experience depressive episodes if exposed to traumatic events because a typical study doesn't attempt to do more than reject the null hypothesis. Moreover, we cannot even begin to confront seriously questions about generalizability without estimates that could potentially vary from one traumatic context to the next or that could vary as a function of pre-existing individual differences.)
Another paper by Schmidt expresses this point nicely; it is a point most scholars with experience in meta-analysis intuitively appreciate.
Schmidt, F. L., (1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 47, 1173-1181.
3. We do not value direct replications as much as we should. Many psychologists (correctly, in my view) value conceptual replications–replications of findings that involve modifications to design, assessment, etc., partly because such replications (a) help to establish the robustness of the effect under similar, but different, circumstances and (b) add something new to the literature. However, I'm increasingly convinced that direct replications serve an important place in our field too as long as the problems articulated in Point 1 persist. Specifically, given the degrees of freedom a typical investigator gives him or herself with respect to writing off non-significant results as "pilot studies", assessing multiple dependent variables and potentially reporting only the ones that work, etc., we would be wise to never believe a finding without seeing it replicated directly. In the current culture, a "conceptual replication" may just represent one finding from a collection of studies/variables that were expected to work.
4. If you believe what I wrote in Point 1, then it necessarily follows that the scientific literature in psychological science is not scientific.
Francis, G. (2012). Too good to be true: Publication bias in two prominent studies from experimental psychology. Psychonomic Bulletin and Review.
Until the leaders in the field (i.e., journal editors, executive boards of research-based organizations) and in our departments change the incentive structures, nothing will change. If we reward scholars for the number of publications they produce, for example, they will invest in conducting multiple small n studies with multiple measures will produce some large effects easily and that can be written up for publication swiftly. (I'm not suggesting that scholars will try to game the system explicitly. I'm suggesting that people will gravitate towards research strategies that "work" and, unfortunately, the strategies that work for producing significant effects are not necessarily the same strategies that work for producing a cumulative base of knowledge.)
One way to change this is to give greater weight to papers based on large samples and to value/evaluate the quality of the questions and methods more than the results themselves (Fraley & Marks, 2007). Moreover, if we better incentivize direct replications (or, equally valuable, in my opinion, narrow confidence intervals), the research that is published will be better positioned to inform our knowledge in psychology. Ultimately, we need a cumulative science in which we can believe.
Fraley, R. C., & Marks, M. J. (2007). The null hypothesis significance testing debate and its implications for personality research. In R. W. Robins, R. C. Fraley, & R. F. Krueger (Eds.), Handbook of research methods in personality psychology (pp. 149-169). New York: Guilford.
Nailed it.
There is a great deal of wisdom in Chris’ commentary above. But I think in one respect he is being overly optimistic in both his assumptions and in terms of whether fixing the problem would in fact make psychological science scientific. Specifically, regarding:
“We design studies that lack sufficient statistical power to appropriately reveal the effects or associations of interest. If the typical study in psychology has the statistical power to detect an effect of interest 50% of the time, the implication is that, if the null hypothesis is false, our ability to uncover the truth is no better than a coin toss. The tragic implication of this is that, as a field, our hit rate for correctly rejecting the null hypothesis would be the same as it is now if we stopped funding research and, instead, gave every researcher a quarter to flip to test hypotheses.”
In contrast to the assumption implicit in this quote, many individual difference publications emerge from studies that were not specifically designed to detect the specific assocation(s) reported (except in the most limited sense that the relevant IVs and DVs were assessed). If–as seems to be near consensus in the field–the goal is to “tell the story of a given datatset” (i.e., mine it for statistically significant results), the Type 1 error rate can easily go to near unity over multiple comparisons. In other words, our scientific literature–worst case scenario–can be considerably less valuable indicator of the truth than the coin flip analogy implies. And unless we know what process led to the published result, it is hard to know (although p-curves and the like may provide some insight into the worst abuses).
But considerably worse yet is what is revealed when one takes power seriously. Once we can sidestep all of the NHST pitfalls implied above (i.e., our parameter estimates are for all intents and purposes the population values because of representative sampling coupled with very large N studies and high quality measurement), what becomes clear is that many and indeed perhaps most of our individual difference “theories” are so flimsy in terms of their predictions (i.e., any non-nil value in the predicted direction) that they hardly merit serious attention. As Meehl noted, with enough N, our data conform to directional predictions 50% of the time–irrespective of the true model that generated the data. Importantly, this is far worse than the situation described above where one at least has to take the effort to search around in an underpowered dataset to find a statistically significant ‘hit’ as Meehl’s critique applies to each and every analysis in a highly powered study. The situation is even worse when one can not even eke out a directional prediction (for examples see individual difference studies examining molecular genetic and neuroscience correlates of psychological phenotypes). Here, with sufficient N, under our current system, ones “theory” gains support each and every time one runs an analysis!
In short, in my view the real menace lies below the surface of this debate over the abuse of NHST. Indeed, once we solve that problem (e.g., by paying attention to basic methods advice about power) we’ll have a much bigger fish to fry–why many individual difference “theories” are quasi-unfalsifiable given enough sample.
Glenn wrote: “In other words, our scientific literature–worst case scenario–can be considerably less valuable indicator of the truth than the coin flip analogy implies. And unless we know what process led to the published result, it is hard to know”
This is a good point. I’m “optimistically” assuming that investigators formulate hypotheses and test them via underpowered studies–a state of affairs that is clearly not an optimal use of time or money and which has the potential to lead to erroneous conclusions.
In contrast, if investigators are doing something that is analogous to flipping 20 coins and being satisfied with any of the 20 tosses coming up heads, then we have a different kind of problem. In this case, we have many successful studies (from the investigator’s point of view and the point of view of his or her CV) because just about any data set will provide something that can be published. But, from the point of the view of the field or the consumers of that empirical knowledge, we have a mess.
Glenn wrote: “In short, in my view the real menace lies below the surface of this debate over the abuse of NHST. Indeed, once we solve that problem (e.g., by paying attention to basic methods advice about power) we’ll have a much bigger fish to fry–-why many individual difference “theories” are quasi-unfalsifiable given enough sample.”
I agree with you that theory testing in psychological science is not particularly rigorous. This isn’t a big problem on my radar screen at the moment because I’m still struggling with trying to understand how much of what I read is “real” given that our field does not value replication, doesn’t appreciate statistical behavior in small-n samples, emphasizes the novelty of findings over the rigor of methodology, and is reluctant to publish results from well-designed studies that “don’t work.”
But, in the spirit of being lawyer-like (thanks, Sanjay), I will make two claims. First, the kinds of problems you articulate are not specific to psychological science. That doesn’t mean that psychology cannot take the lead on improving things, of course. But I’d hate to see theories concerning individual differences thrown under the bus unless the problem is specific to these theories in particular. Second, most theories in psychological science make more than one prediction. Thus, to borrow Meehl’s language, the verisimilitude of the theory doesn’t hinge on a single directional prediction. It hinges on multiple claims and predictions. As a result, although any single directional hypothesis has a 50-50 chance of being supported in an error-free scenario (regardless of the verisimilitude of the theory), I don’t think this necessarily implies that the theory is a weak one or that it is “quasi-unfalsifiable.”
Of course, thinking about things in this way raises a number of questions about “theory testing” and falsifiability that have been debated among philosophers of science: How many “misses” should we tolerate before we consider a theory inadequate? Should we differentially weight predictions that are unique to a theory vs. predictions that are common to alternative theories? Are quantitative predictions easier to make in the study of simpler systems and, if so, should we find them more impressive? To what extent should we calibrate the parameters of theoretical models with empirical data to enable them to make more precise predictions? I’m not well versed in the philosophy of science, but a paper that had an enormous impact on my thinking is Meehl’s (1990) paper in Psychological Inquiry. I think this paper is worth adding to your list, Brent.
Meehl, P. E. (1990). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1, 108-141.
On Chris’ reactions below…
“I agree with you that theory testing in psychological science is not particularly rigorous. This isn’t a big problem on my radar screen at the moment because I’m still struggling with trying to understand how much of what I read is “real” given that our field does not value replication, doesn’t appreciate statistical behavior in small-n samples, emphasizes the novelty of findings over the rigor of methodology, and is reluctant to publish results from well-designed studies that “don’t work.”
How do do you know what is “real”? First, ignore all small sample study individual difference findings entirely. Such studies may well be valuable in identifying a new construct, method, etc. but they are so unlikely to produce generalizable truths (especially given flexible investigator degrees of freedom) that the actual findings of such studies should be simply set aside until the big sample work is completed. Indeed, I think it can be argued that it is the equivalent of malpractice to fail to power a study adequately to identify the full range of effects consistent with ones theory as statistically significant (typically, any value > or < 0, although I'd be comfortable ignoring effects smaller than .10 for theory-testing purposes). Second, only trust large sample studies when the data is publically available for re-analysis so as to determine whether the published analysis is representative or not of the dataset from which is was drawn. The short-cut to Part 2 is for the original work to have been the product of a collective of scientists with opposing theoretical perpsectives who are forced to reach consensus with a large dataset. A good example of this process is (ahem) the Early Child Care Research Network, which managed the NICHD Study of Early Child Care and Youth Development. In such cases, perhaps it is not necessary to re-analyze the data oneself.
Second, regarding:
"But, in the spirit of being lawyer-like (thanks, Sanjay), I will make two claims. First, the kinds of problems you articulate are not specific to psychological science. That doesn’t mean that psychology cannot take the lead on improving things, of course. But I’d hate to see theories concerning individual differences thrown under the bus unless the problem is specific to these theories in particular. Second, most theories in psychological science make more than one prediction. Thus, to borrow Meehl’s language, the verisimilitude of the theory doesn’t hinge on a single directional prediction. It hinges on multiple claims and predictions. As a result, although any single directional hypothesis has a 50-50 chance of being supported in an error-free scenario (regardless of the verisimilitude of the theory), I don’t think this necessarily implies that the theory is a weak one or that it is “quasi-unfalsifiable.”
I agree with you that the problems I noted are not specific to psychology. But I would remind you that we are psychologists and it is our store we are responsible for minding 🙂
On the second point, I continue to think you are being overly optimistic. Under the current regime, theories gain currency (unfortunately, in a manner potentially orthogonal to their verisimilitude) on the basis of the reporting of findings not inconsistent with the theory (plus publication bias that excludes much negative evidence from entering the scientific literature). The problem is that individual difference theories tend to make such weak predictions (typically, X is associated with good outcomes), it seems to me highly unlikely that, even within the context of an adequately powered study, counter-theoretical findings are likely to emerge due to the positive manifold issue (crud factor) Meehl warned against. And as I noted, absent directional predictions, any effect (regardless of its valence) is going to be theory consistent. This is just the situation in studies attempting to establish the molecular-genetic or neuroscientific correlates of psychological phenotypes. Indeed, because the null hypothesis is quasi-always false, the only thing stopping such researchers (and I count myself among them) from concluding that every analysis they conduct is theory consistent (or at the very least publishable under the current standards) is that their designs lack adequate statistical power.
One description of the current situation is that we are constantly running pilot experiments and nothing more. Consider a psychologist trying to decide whether or not to expend (considerable) resources of time and money on a project. A hypothesis test analysis of a pilot study fits perfectly with what the researcher wants to know: is there an effect (should I run a serious study?) or is there not (should I explore something else?).
The problem is that this answer is only a prelude to the real scientific work: measuring the size of the effect so that you can develop a theory. Instead, we either build a theory on very noisy estimates (modeling noise), or move on to run a pilot study on some other topic.
Really interesting summary, Chris.
On the issue of behaving like lawyers and not scientists (and your final point springing from that), I view it a little differently. I don’t think the “dispassionate truth-seekers” ideal of scientists exists anywhere. Scientists everywhere have rivalries and coalitions and factions, big egos, preferred theories, money and prestige riding on the outcome of studies, etc. And I don’t think that any field of human endeavor is going to get rid of that. But in many other areas of science, you have multiple labs working on the same problem; and because they are working on cumulative sets of questions, making the next discovery depends on whether the people who reported the last discovery got it right. That results in a system that catches unreliable results, and stemming from that is a set of incentives to get it right the first time — because you’ll actually get caught, and suffer consequences, if you don’t. In other words, in some ways they act more like lawyers, in that they challenge each other’s evidence and cross-examine each other’s conclusions.
Many of the proposals you summarized will help, if they incentivize us to check each other’s work and if they make psychological-scientific knowledge more cumulative. But I don’t think they’ll take the politics and biases out of science — they’ll just harness them for progress.
Sanjay, I agree with you that scientists are rarely dispassionate truth-seekers. But, in my experience, there are at least two types of passionate scientists: Those who are motivated to answer questions about which they are passionate and those who are passionately motivated to defend specific answers.
If someone were to put a gun to my head and demand that I tell them the Truth about how personality develops, I’d prefer to draw upon the research of the former kind of scientist than the latter. Even if the latter kind of scientist were involved in some fascinating and heated debates with other scientists in which assumptions were being questioned, the strengths and limitations of various methods were being revealed, etc., I wouldn’t necessarily have faith in the research if it was produced in a lab that was motivated to defend a specific point of view.
I guess what you’re describing as between-person variance (“two types of passionate scientists”) I’d say varies quite a bit within persons. Nearly every scientist is both of those things at least some of the time, and it’s in the nature of biases and motivations that we probably underestimate when and how much we are being biased. That’s why I’m more optimistic about institutional solutions that energize the potential good scientists in us and put reasonable checks on the potential bad ones than I am about prescriptions for individual behavior. In that sense I think I’m agreeing with your conclusions about the importance of leadership and incentive structures, but maybe for slightly different reasons.
First of, I’m not a psychologist but just someone who follows the fall-out of the Bem paper.
To me, as an outsider it seems that there is a curious omission in the discussion of methodology.
It’s been suggested that papers should contain at least 2 experiments, one being a replication of the first. Bem’s paper actually presents 9 experiments, 4 of them being replications. That raises doubts on whether this suggestion will achieve much.
In Bem’s case, it is clear that the experiments did not take place as described
US witnesses swear to tell the whole truth and nothing but the truth. That is a very intuitive idea of honesty.
Going by what John et al say about QRPs and their prevalence it seems that social psychologists have a different idea. They condemn telling anything but the truth(ie falsifying data) but have no qualms about withholding information to mislead the reader.
This isn’t so much about methodology as about honesty. And that’s the omission. There’s a problem with dishonesty, not just with bad methodology.
Students are trained to report experiments in a dispassionate, impersonal manner that leaves out any personal anecdotes about their troubles with the experiment. Perhaps that trains them to make up an idealized version of reality. To report the ideal experiment run by the perfect experimenter that will forever remain fiction, rather than the messy one that actually happened.
Maybe it’s not so much that students don’t have enough training to realize the problems with data dredging but enough training to realize they shouldn’t mention it?
I don’t think anyone is proposing that direct replication is the solution, per se. After all, an intrepid researcher could run the study over and over until he or she gets two hits and publish them as a direct replication. Nonetheless, current practice is to devalue direct replication and prefer “conceptual” replication. So, most studies in our major journals don’t even show direct replications from the same author–thus the recommendation.
The flaws inherent in even direct replication that you point out are one of the reasons we’ve recommended that replications from other labs be automatically published and connected to the original paper–indexed so the authors get credit for the publication–because the true gold standard is direct replication by another lab. The platinum standard which this would eventually support would be meta-analytic evidence for an effect. This also reflects the natural ecology of science. We see a cool finding and often attempt to replicate and extend the initial study adding our own twist. More often than not, those attempts at replication fail and both the failure to replicate and the new twist end up in the file drawer. By rewarding and valuing direct replication, these failures can see the light of day and be better integrated into the evaluation of any given topic.
I’m a little late, but do any of you have responses to Karl Friston’s defense of small sample sizes in, “Ten ironic rules for non-statistical reviewers” (a 2012 “comments and controversies” paper in NeuroImage)? If you haven’t seen it yet (and PIG-IE might have already talked about it; I missed a few, sorry!), it’s a how-to guide for reviewers who would like to prevent a paper from being published. It’s directed at neuroscience research obviously, but I think his main points are general enough for any area. Here’s an excerpt:
“Rule number four: the under-sampled study
If you are lucky, the authors will have based their inference on less than 16 subjects. All that is now required is a statement along the following lines:
‘Reviewer: Unfortunately, this paper cannot be accepted due to the small number of subjects. The significant results reported by the authors are unsafe because the small sample size renders their design 110 insufficiently powered. It may be appropriate to reconsider this work if the authors recruit more subjects.’
Notice your clever use of the word ‘unsafe’, which means you are not actually saying the results are invalid. This sort of critique is usually sufficient to discourage an editor from accepting the paper; however – in the unhappy event the authors are allowed to respond – be prepared for something like:
‘Response: We would like to thank the reviewer for his or her comments on sample size; however, his or her conclusions are statistically misplaced. This is because a significant result (properly controlled for false positives), based on a small sample indicates the treatment effect is actually larger than the equivalent result with a large sample. In short, not only is our result statistically valid. It is quantitatively more significant than the same result with a larger number of subjects.’
Unfortunately, the authors are correct (see Appendix 1). On the bright side, the authors did not resort to the usual anecdotes that beguile handling editors. Responses that one is in danger of eliciting include things like:
‘Response: We suspect the reviewer is one of those scientists who would reject our report of a talking dog because our sample size equals one!’
Or, a slightly more considered rebuttal:
‘Response: Clearly, the reviewer has never heard of the fallacy of classical inference. Large sample sizes are not a substitute for good hypothesis testing. Indeed, the probability of rejecting the null hypothesis under trivial treatment effects increases with sample size.’
Thankfully, you have heard of the fallacy of classical inference (see Appendix 1) and will call upon it when needed (see next rule). When faced with the above response, it is often worthwhile trying a slightly different angle of attack; for example:
‘Reviewer: I think the authors misunderstood my point here: The point that a significant result with a small sample size is more compelling than one with a large sample size ignores the increased influence of outliers and lack-of-robustness for small samples.’
Unfortunately, this is not actually the case and the authors may respond with:
‘Response: The reviewers concern now pertains to the robustness of parametric tests with small sample sizes. Happily, we can dismiss this concern because outliers decrease the type I error of parametric tests (Zimmerman, 1994). This means our significant result is even less likely to be a false positive in the presence of outliers. The intuitive reason for this is that an outlier increases sample error variance more than the sample mean; thereby reducing the t or F statistic (on average).’
At this point, it is probably best to proceed to rule six.
Rule number five: the over-sampled study…”
Hopefully that quote isn’t so long that NeuroImage decides to press charges. Any thoughts?
Hi Molly,
There is no “defense” of using small samples–see Fraley’s point #2 above. What is so maddening about Friston’s “defense” is that it relies on the continued belief that a boldly ignorant use of NHST is sufficient for our scientific efforts. It is not. And, Friston’s logic is used by many, many people to justify continued methodological stupidity. They say things like “I’m not interested in an effect or effect size. I’m interested in testing a theory” or some similar nut-job statement like that. In reality, in every study we run, there is an effect. That effect has a magnitude that ranges from nil to huge somewhere out there in the population. And, given our typical study design, we have some probability of detecting that effect size. When we use small samples, the only effect sizes we can discover are huge (i.e., an fMRI study).
Now, a sensible defense of small sample sizes would be based on this logic. Small sample sizes are only defensible if you are trying to find huge effects. Don’t get me wrong, entire branches of psychology used to go this route–behaviorists often eschewed NHST because they wanted to create effects that were obvious with one or two pigeons–you could see the effect of the change in reinforcement schedule with a simple inter-ocular examination–it was obvious.
In contrast, I see today’s researchers using Friston’s type of logic to ignore the natural consequences of small sample sizes. They want to believe that they are not subject to effect sizes and power and want to rely solely on NHST. This leads inevitably to what Francis is showing across the board in our field–people will surf correlation matrices, churn multiple studies, p-hack as much as they can in order to find “statistically significant effects” because they still believe that statistical significance is some divining rod of truth. This is so ingrained that some researchers knowingly avoid doing properly powered studies in lieu of p-hacking their way to oblivion–to paraphrase a colleague “have you seen the sample size requirements for interaction effects?!? That’s out of the question.” The irony is that instead of running a well-powered study (N of 300 anyone?), the typical researcher will churn through 6 samples of 50 undergrads, mTurkers, or any other sample in order to find an unreplicable interaction effect. That’s more than aggressively ignorant. Moreover, the rationale for the continued use of small samples is typically based on “rules of thumb”–i.e., “my advisor used that many, why should I do anything different?” Because, you are smarter than your advisor. Or, at least, you should be.
Finally, Friston misses the easiest, simplest way of justifying the use of a small sample. If it is the case that your effect is truly huge, then replicating the damn study–directly mind you–should be a good way of showing the effect is reliable without using contorted logic about robustness, outliers, or the compelling nature of your finding. Given the nature of the way we go about our science, effects from small studies that are not informed by effect size measurement should be distrusted and therefore replicated. Directly. This is another way of saying that the publication policy at SPPS is frightening.
Sigh. Thanks for letting me vent.
Brent
Thanks for bringing our attention to the Friston Neuroimaage paper. Having not read it yet (I plan to), I can only address the fragment above.
Two thoughts:
1. With some caveats expressed below, I agree with Friston that necessarily large, statistically significant effects based on small N studies can prove replicable, under the assumption that these significant results were not generated by practices that increase Type 1 error rates. However, the main thrust of debate around small N studies is that common practices dramatically increase the Type 1 error rate. So what is parenthetical to Friston (“properly controlled for false positives”) is what is in dispute in the field at large. Many of such practices involve some sort of multiple testing and one can look to literatures with small Ns and multiple testing opportunities (e.g., molecular genetic association studies) and the result is quite consistent–a pressing problem with the failure to replicate findings.
2. As Friston correctly notes, large sample studies will produce many more significant results than small sample ones (indeed, the larger the sample, the smaller the effect that will be detected as significant). But he takes the wrong lesson from this, in part because he imagines that studies have the same “result” when they have similar p-values. Studies have the same result when they generate effect sizes that are comparable in magnitude, period. It is true that the field rewards investigators on the basis of the p-values generated by their studies (i.e., effects below .05 are publishable), but one should not confuse that reward system with some sort of scientifically defensible standard.
All of this said, I think it needs to be clearly stated that the primary advantage of larger samples is that the investigator has increasing confidence in his/her estimate of the magnitude of the focal effect/association/group difference with increasing N. The problem with small sample research is that, even if the false positive rate (i.e., Type 1 errors) has been properly controlled, there is nonetheless a very large error band around the estimate of the focal effect. Friston does not dispute this, and I can only assume this is because he takes the attitude (as some experimentalists do) that precise estimates of effect size don’t really matter (except, oddly, to compare the necessarily large effects that emerge as significant in small N studies with the magnitude of statistically significant results generated in large N studies). I.e., The goal is to demonstrate that X and Y are correlated or group A and B differ on attribute C, which is taken to be the case when a statistically significant p-value emerges from one’s analysis. (I.e., Incorrectly equating a significant p-value with hypothesis-testing).
Unfortunately, X and Y are almost always correlated to some extent (even if the magnitude of the effect is trivial; i.e., r or d=.01). The fact that the null hypothesis is quasi-always-false (the true effect is unlikely to be exactly 0), means that even the experimentalist needs to be clear up front about how large of an association she will consider consistent with her theory and design her study accordingly (i.e., with sufficient power to detect the lower bound of that range 80 or 90 percent of the time). A widespread alternative is that researchers design small scale studies or experiments where the lower bound for what effect can be expected to be to be statistically significant is implicit in the study design–that is, the smallest effect that could be identified as statistically significant has been completely uninfluenced by any sort of effect size standard imposed by the theory being tested. In such a situation, one is essentially engaging in the magical thinking that effects that are significant are theory consistent and those that are not significant are not theory consistent. The scientific alternative is: (a) to clearly describe the range of effects consistent with the hypothesis being tested, (b) to design one’s study with adequate power to detect that range of effects, and (c) to only conclude in favor of one’s theory if (i) the result can not be attributed to sampling error (p < .05, with Type 1 error protections) and (ii) the estimated magnitude of the effect is within the range specified as theory-consistent a priori. The standard practice is to ignore (a), design ones study in light of rules of thumb or pragmatic considerations, and then interpret results p <.05 (oftentimes even if they can be dismissed to false discovery/Type 1 error due to multiple testing) as consistent with one's hypothesis. Beating chance (especially with a little 'help' that boosts Type 1 error rates) is not the same as testing one's hypothesis/theory.