By Brent W. Roberts
As of late, psychological science has arguably done more to address the ongoing believability crisis than most other areas of science. Many notable efforts have been put forward to improve our methods. From the Open Science Framework (OSF), to changes in journal reporting practices, to new statistics, psychologists are doing more than any other science to rectify practices that allow far too many unbelievable findings to populate our journal pages.
The efforts in psychology to improve the believability of our science can be boiled down to some relatively simple changes. We need to replace/supplement the typical reporting practices and statistical approaches by:
- Providing more information with each paper so others can double-check our work, such as the study materials, hypotheses, data, and syntax (through the OSF or journal reporting practices).
- Designing our studies so they have adequate power or precision to evaluate the theories we are purporting to test (i.e., use larger sample sizes).
- Providing more information about effect sizes in each report, such as what the effect sizes are for each analysis and their respective confidence intervals.
- Valuing direct replication.
It seems pretty simple. Actually, the proposed changes are simple, even mundane.
What has been most surprising is the consistent push back and protests against these seemingly innocuous recommendations. When confronted with these recommendations it seems many psychological researchers balk. Despite calls for transparency, most researchers avoid platforms like the OSF. A striking number of individuals argue against and are quite disdainful of reporting effect sizes. Direct replications are disparaged. In response to the various recommendations outlined above, prototypical protests are:
- Effect sizes are unimportant because we are “testing theory” and effect sizes are only for “applied research.”
- Reporting effect sizes is nonsensical because our research is on constructs and ideas that have no natural metric, so that documenting effect sizes is meaningless.
- Having highly powered studies is cheating because it allows you to lay claim to effects that are so small as to be uninteresting.
- Direct replications are uninteresting and uninformative.
- Conceptual replications are to be preferred because we are testing theories, not confirming techniques.
While these protestations seem reasonable, the passion with which they are provided is disproportionate to the changes being recommended. After all, if you’ve run a t-test, it is little trouble to estimate an effect size too. Furthermore, running a direct replication is hardly a serious burden, especially when the typical study only examines 50 to 60 odd subjects in a 2×2 design. Writing entire treatises arguing against direct replication when direct replication is so easy to do falls into the category of “the lady doth protest too much, methinks.” Maybe it is a reflection of my repressed Freudian predilections, but it is hard not to take a Depth Psychology stance on these protests. If smart people balk at seemingly benign changes, then there must be something psychologically big lurking behind those protests. What might that big thing be? I believe the reason for the passion behind the protests lies in the fact that, though mundane, the changes that are being recommended to improve the believability of psychological science undermine the incentive structure on which the field is built.
I think this confrontation needs to be more closely examined because we need to consider the challenges and consequences of deconstructing our incentive system and status structure. This, then begs the question, what is our incentive system and just what are we proposing to do to it? For this, I believe a good analogy is the dilemma faced by Harry Potter in the last book of the eponymously titled book series.
The Deathly Hallows of Psychological Science
In the last book of the Harry Potter series “The Deathly Hallows,” Harry Potter faces a dilemma. Should he pursue the destruction of the Horcruxes or gather together the Deathly Hallows. The Horcruxes are pieces of Voldemort’s soul encapsulated in small trinkets, jewelry, and such. If they were destroyed, then it would be possible to destroy Voldemort. The Deathly Hallows are three powerful magical objects, which are alluring because by possessing all three, one becomes the “master of death.” The Deathly Hallows are the Cloak of Invisibility, the Elder Wand, and the Resurrection Stone. The dilemma Harry faced was whether to pursue and destroy the Horcruxes, which was a painful and difficult path; or Harry could choose to pursue the Deathly Hallows, with which he could quite possibly conquer Voldemort, and, if not conquer him, live on despite him. He chose to destroy the Horcruxes.
Like Harry Potter, the field of psychological science (and many other sciences) faces a similar dilemma. Pursue changes in our approach to science that eliminate problematic practices that lead to unreliable science—a “destroy the Horcrux” approach. Or, continue down the path of least resistance, which is nicely captured in the pursuit of the Deathly Hallows.
What are the Deathly Hallows of psychological science? I would argue that the Deathly Hallows of psychological science, which I will enumerate below, are 1) p values less than .05, 2) experimental studies, and 3) counter-intuitive findings.
Why am I highlighting this dilemma at this time? I believe we are at a critical juncture. The nascent efforts at reform may either succeed or fade away like they have done so many times before. For it is a fact that we’ve confronted this dilemma many times before and have failed to overcome the allure of the Deathly Hallows of psychological science. Eminent methodologists such as Cohen, Meehl, Lykken, Gigerenzer, Schmidt, Fraley, and lately Cumming, have told us how to do things better since the 1960s to no avail. Revising our approach to science has never been a question of knowing the right the thing to do, but rather it has been whether we were willing to do the thing we knew was right.
The Deathly Hallows of Psychological Science: p-values, experiments, and counter-intuitive/surprising findings
The cloak of invisibility: p<.05. The first Deathly Hallow of psychological science is the infamous p-value. You must attain a p-value less than .05 to be a success in psychological science. Period. If your p-value is greater than .05, you have no finding and nothing to say. Without anything to say, you cannot attain status in our field. Find a p-value below .05 and you can wrap it around yourself and hide from the contempt aimed at those who fail to cross that magical threshold.
Because the p-value is the primary key to the domain of scientific success, we do almost anything we can to find it. We root around in our data, digging up p-values either by cherry picking studies, selectively reporting outcomes, or through some arcane statistical wizardry. One only has to read Bem’s classic article on how to write an article in psychological science to see how we approach p-values as a field:
“…the data. Examine them from every angle. Analyze the sexes separately. Make up new composite indices. If a datum suggests a new hypothesis, try to find further evidence for it elsewhere in the data. If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief. If there are participants you don’t like, or trials, observers, or interviewers who gave you anomalous results, drop them (temporarily). Go on a fishing expedition for something–anything–interesting.”
What makes it worse is that when authors try to report null effects they are beaten down because we as reviewers and editors do everything in our power to hide the null effects. Null effects make for a messy narrative. Our most prestigious journals almost never publish null effects because reviewers and editors act as gatekeepers and mistakenly recommend against publishing null effects. Consider the following personal example. In one study, reviewer 2 argued that our study was not up for publication in JPSP because one of our effects was null (there were other reasons too). Consider the fact that the null effect in question was a test of a hypothesis drawn from my own theory. I was trying to show that my theory did not work all of the time and the reviewer was criticizing me for showing that my own ideas might need revision. This captures quite nicely the tyranny of the p-value. The reviewer was so wedded to my ideas that he or she wouldn’t even let me, the author of said ideas, offer up some data that would argue for revising them.
In the absence of simply rejecting null effects, we often recommend cutting the null effects. I have seen countless recommendations in reviews of my papers and the papers of colleagues to simply drop studies or results that show null effects. It is not then surprising that psychology confirms 95% of its hypotheses.
Even worse, we often commit the fundamental attribution error by thinking that the person trying to publish null effects is an incompetent researcher—especially if they fail to replicate an already published effect that has crossed the magical p< .05 threshold. Not to be too cynical, but the reviewers may have a point. If you are too naïve to understand “the game”, which is to produce something with p < .05, then maybe you shouldn’t succeed in our field. Setting sarcasm aside, what the gatekeepers don’t understand is that they are sending a clear message to graduate students and assistant professors that they must compromise their own integrity in order to succeed in our field. Of course, this leads to the winnowing of the field of researchers who don’t want to play the game.
The Elder Wand: Running Experiments
Everyone wants to draw a causal conclusion, even observational scientists. And, of course, the best way to draw a causal conclusion, if you are not an economist, is to run an experiment. The second Deathly Hallow for psychological science is doing experimental research at all costs. As one of my past colleagues told a first year graduate student, “if you have a choice between a correlational or an experimental study, run an experiment.”
Where things go awry, I suspect, is when you value experiments so much, you do anything in your power to avoid any other method. This leads to levels of artificiality that can get perverse. Rather than studying the effect of racism, we study the idea of racism. Where we go wrong is that, as Cialdini has noted before, we seldom work back and forth between the fake world of our labs and the real world where the phenomenon of interest exists. We become methodologists, rather than scientists. We prioritize lab-based experimental methods because they are most valued by our guild not because they necessarily help us illuminate or understand our phenomenon but because they putatively lead to causal inferences. One consequence of valuing experiments so highly is that we get caught up in a world of hypothetical findings that have unknown relationships to the real world because we seldom if ever touch base with applied or field research. As Cialdini so poignantly pointed out, we simply don’t value field research enough to pursue it with equal vigor to lab experiments.
And, though some great arguments have been made that we should all relax a bit about our use of exploratory techniques and dig around in our data, what these individuals don’t realize is that half the reason we do experiments is not to do good science but to look good. To be a “good scientist” means being a confirmer of hypotheses and there is no better way to be a definitive tester of hypotheses than to run a true experiment.
Of course, now that we know many researchers run as many experiments as they need to in order to construct what findings they want, we need to be even more skeptical of the motive to look good by running experiments. Many of us publishing experimental work are really doing exploratory work under the guise of being a wand wielding wizard of an experimentalist simply because that is the best route to fame and fortune in our guild.
Most of my work has been in the observational domain, and admittedly, we have the same motive, but lack the opportunity to implement our desires for causal inference. So far, the IRB has not agreed to let us randomly assign participants to the “divorced or not divorced” or the “employed, unemployed” conditions. In the absence of being able to run a good, clean experiment, observational researchers, like myself, bulk up on the statistics as a proxy for running an experiment. The fancier, more complex, and indecipherable the statistics, the closer one gets to the status of an experimenter. We even go so far as to mistake our statistical methods, such as cross-lag panel, longitudinal designs, for ones that would afford us the opportunity to make causal inferences (hint: they don’t). Reviewers are often so befuddled by our fancy statistics that they fail to notice the inappropriateness of that inferential leap.
I’ve always held my colleague Ed Diener in high esteem. One reason I think he is great is that as a rule he works back and forth between experiments and observational studies, all in the service of creating greater understanding of well-being. He prioritizes his construct over his method. I have to assume that this is a much better value system than our long standing obsession with lab experiments.
The Resurrection Stone: Counter-intuitive findings
The final Deathly Hallow of psychological science is to be the creative destroyer of widely held assumptions. In fact, the foundational writings about the field of social psychology lay it out quite clearly. One of the primary routes to success in social psychology, for example, is to be surprising. The best way to be surprising is to be the counter-intuitive innovator—identifying ways in which humans are irrational, unpredictable, or downright surprising (Ross, Lepper, & Ward, 2010).
It is hard to argue with this motive. We hold those scientists who bring unique discoveries to their field in the highest esteem. And, every once in a while, someone actually does do something truly innovative. In the mean time, the rest of us make up little theories about trivial effects that we market with cute names, such as the “End of History Effect”, or the “Macbeth Effect” or, whatever. We get caught up in the pursuit of cutesy counter-intuitiveness all under the hope that our little innovation will become a big innovation. To the extent that our cleverness survives the test of time, we will, like the resurrection stone, live on in our timeless ideas even if they are incorrect.
What makes the pursuit of innovation so formidable an obstacle to reform is that it sometimes works. Every once in a while someone does revolutionize a field. The aspiration to be the wizard of one’s research world is not misplaced. Thus, we have an incentive system that produces a variable-ratio schedule of reinforcement—one of the toughest to break according to those long forgotten behaviorists (We need not mention behavioral psychologists, since their ideas are no longer new, innovative, or interesting–even if they were right).
Reasons for Pessimism
The problem with the current push for methodological reform is that, like pursuing the Horcruxes, it is hard and unrewarding in comparison to using the Deathly Hallows of psychological science. As one of our esteemed colleagues has noted, no one will win the APA Distinguished Scientific Award by failing to replicate another researcher’s work. Will a person who simply conducts replications of other researcher’s work get tenure? It is hard to imagine. Will researchers do well to replicate their own research? Why? It will simply slow them down and handicap their ability to compete with the other aspiring wizards who are producing the conceptually-replicated, small N lab-based experimental studies at a frightening rate. No, it is still best to produce new ideas, even if it comes at the cost of believability. And, everyone is in on the deal. We all disparage null findings in reviews because we want errors of commission rather than omission.
Another reason why the current system may be difficult to fix is that it provides a weird p-value driven utopia. With the infinite flexibility of Deathly Hallows of psychological science we can pretty much prove any idea is a good one. When combined with our antipathy toward directly replicating our own work or the work of others, everyone can be a winner in the current system. All it takes is a clever idea applied to enough analyses and every researcher can be the new hot wizard. Without any push to replicate, everyone can co-exist in his or her own happy p-value driven world.
So, there you have it. My Depth Psychology analysis of why I fear that the seemingly benign recommendations for methodological change are falling on deaf ears. The proposed changes contradict the entire status structure that has served our field for decades. I have to imagine that the ease with which the Deathly Hallows can be used is one reason why reform efforts have failed in the past. Since, as many have indicated, the same recommendations to revise our methods have been made for over 50 years. Each time, the effort has failed.
In sum, while there have been many proposed solutions to our problems, I believe we have not yet faced our real issue, which is how are we going to re-structure our incentive structure? Many of us have stated, as loudly and persistently as we can that there are Horcruxes all around us that need to be destroyed. The move to improve our methods and to conduct direct replications can be seen as an effort to eliminate our believability Horcruxes. But, I believe the success of that effort rides on how clearly we see the task ahead of us. Our task is to convince a skeptical majority of scientists to dismantle an incentive structure that has worked for them for many decades. This will be a formidable task.
Now you’re just trolling. 🙂 But, ok, I’ll bite.
Your points 2 and 3 are value judgments that have nothing to do with good scientific practice, per se. One may conduct poor, haphazard correlational or experimental research. Whether or not one is particularly concerned about extra-lab outcomes is orthogonal to this issue. Same goes for your preference for testing intuitive vs. counter-intuitive findings. Your values have crept into what had been, more or less, an ongoing straight methodological rant.
You should make UIUC the center for people conducting direct replications of intuitive, correlational research.
Finally, why is it that the people who seem most incensed by NHST and .05 are the same people who are most incensed by p-hacking?
Hi Jeff! Long time no speaky. If you want trolling, I’ll send you the first draft. It was indelicate.
I don’t disagree with you at all about the value judgements–this is not a methods rant. It is an attempt to point out our value system, not justify it or critique it so much as make it clear what it is. You can’t fix something without knowing what it is you are trying to fix. And, I’ll be the first to admit that I failed to understand that a purely rational argument would work in the face of a powerful incentive structure.
UIUC, well my lab, is the place for conducting direct replications of intuitive correlational research. We usually think meta-analysis first, collect new data second, funky counter-intuitive third. We usually don’t get to three.
I like your question about NHST and p-hacking. In lieu of an answer, I’ll reflect the question. Why do so many people believe that a p-value of .05 negates all of the p-hacking they do?
But the value system also is completely orthogonal to whether the research is correlational/experimental or applied/theoretical or intuitive/counter-intuitive. In all 8 of those cells, one can do good research or bad research that should be recognized accordingly.
I have to say that your vision of science as scary and dangerous if it tries to tell us something we don’t already know is deeply depressing.
I don’t think anyone believes that .05 negates p-hacking. More likely, they believe that .05 is silly and arbitrary (as do you and most people reading this blog), so what’s the harm in tinkering to bring .07 down to .05? Surely, that is where the vast majority of p-hacking can be found. Seriously doubt that anyone is p-hacking .60 to .05, even if it is technically possible. It is a silly game that is mostly inconsequential, unless you believe that there really is something magical about .05 and Truth.
Also worth noting that #s 1 and 2 vs. 3 of the protests you list up top are incompatible with one another. So, the people arguing that effect sizes are unimportant are not the people arguing against highly powered studies. To be honest, I can’t say that I’ve come across anyone arguing that more power is Bad. Moreover, believing that effect sizes can be arbitrary in theoretical research is not the same thing as arguing against reporting those effect sizes.
It’s disappointing that you don’t attempt to honestly address those positions (though you may have already done so elsewhere). Here, at least, you are satisfied to conclude that the people who disagree with you on those matters (and direct vs. conceptual replication) must have Bad Motives. Talk about doth protesting too much! The invocation of Depth Psychology is ironic, given that the number of agitated and self-righteous proclamations from the self-appointed sheriffs of the science world far outnumber the strenuous protests you decry.
Ok, maybe that was a little agitated.
I’ll chime in at the risk of getting my head chopped off mostly because I think many of BWR’s points are valid. Perhaps I am biased because we share the same name.
1. One concern the BWR did not point out was the paper by Mitchell (2012) in Perspectives on Psychological Science about how well findings from the lab generalize to the field. Lab results in I/O tended to translate well to the field. Lab results in Social did less well. What was somewhat worrisome was that 21 of 80 effects in Social changed *sign* from the lab to the field (see p. 112 of the report). I am surprised that more people have not talked about this piece in the recent discussions about methodological reform. I think this provides some empirical evidence for concerns about generalizability.
2. The problem with p-hacking can be expressed in terms of over-fitting a model to a particular dataset and then ignoring the possibility that the model won’t cross-validate. In other words, many p-hacked findings do not replicate (see e.g., the ESP stuff). Researchers should feel free to “p-hack” if they acknowledge that they are in exploratory mode and then attempt to replicate their findings on a fresh dataset in a strictly confirmatory mode. The problem is confusing exploratory and confirmatory research.
I appreciate Brent’s thesis, which I take to be that we need to build a reward structure in psychological science that rewards Truth-telling/discovery, rather than work that uses various proxies orthogonal or weakly correlated with same.
That being said, a few observations.
On point 2, I have sympathy for Jeff Sherman’s position. At the core of Brent’s argument is the helpful observation that we should not give up external validity for good internal validity (the former in the form of not addressing the substantive research questions we have and the latter in the form of good causal inference). But I think Brent may be, ahem, overgeneralizing here a bit to consider this a Deathly Hallow of all of Psychological Science, writ large (good causal inference at all costs). In social development (my research domain), for example, non-experimentalists hold sway (at least the past few decades). In our case, we probably overdo external validity at the cost of internal validity. In short, for me, the goal should be maximizing all aspects of good design (including maximizing causal inference) within the context of a zero sum game (i.e., pragmatic constraints).
On point 1, I wanted to respond to Jeff’s comment that “To be honest, I can’t say that I’ve come across anyone arguing that more power is Bad.” I had not either until recently. I think the most common argument against power (really, larger N studies within a domain) is extra-scientific and pragmatic. I.e., “It costs too much money to run that many people through the scanner.” More recently, however, I’ve noticed a (to me) disturbing trend among scholars in small N domains referring to “over-powered” studies. For example, a fMRI person observes that when N is large, “everything lights up.” There are two issues here. One is that such a person is wrongly using p-values for theory testing instead of merely NHST. The second is the use of a term like “over-power” suggests that there is not merely a pragmatic argument at play here, but a methods critique of large N work. Now, to be clear, I do think on pragmatic grounds one could argue that a researcher can reach an N threshold where he has enough precision and more N buys nothing of substance–E.g., the effect is really .43 versus .44. However, I think the claim for “over-powered” studies actually has its roots in the idea that, if an effect is meaningful (i.e., Big), it should be easily detected using a small N. The problem is that this logic is simply wrong. Small N studies always lie about small and often moderate effects (treating them as 0) and only sometimes tell the truth about big effects (due to p-hacking and the like). Big N studies always tell the truth about effect sizes, large or small, at least for the population being sampled, and the specific measures being used.
Finally, I wanted to echo Brent D.’s observations about p-hacking. Jeff’s example understandably assumes a single focal hypothesis that is close to .05 getting hacked down below the threshold (using covariates, a little more N, etc.). However, I doubt that is how a literature amasses p-hacked findings. I think much more of it is due to “telling the story of one’s data,” which amounts to overfitting across a series of potential focal hypotheses and censoring (from the literature) what does not work.
This is a bit tangential to the other comments, but the jury is still out on whether experiments are the only robust way to prove causation. Both Antonakis (in his article “On making causal claims”) and Pearl (see for instance Bollen and Pearl’s “Eight Myths about causality and structural equation models”). Here’s the relevant section of the Bollen and Pearl paper:
Consider first that the idea that “no causation without manipulation” is necessary for analyzing causation. In the extreme case of viewing manipulation as something done by humans only, we would reach absurd conclusions such as there was no causation before humans evolved on earth. Or we would conclude that the “moon does not cause the tides, tornadoes and hurricanes do not cause destruction to property, and so on” (Bollen 1989:41). Numerous researchers have questioned whether such a restrictive view of causality is necessary. For instance, Glymour (1986), a philosopher, commenting on Holland’s (1986) paper finds this an unnecessary restriction. Goldthorpe (2001:15) states: “The more fundamental difficulty is that, under the – highly anthropocentric – principle of ‘no causation without manipulation’, the recognition that can be given to the action of individuals as having causal force is in fact peculiarly limited.”
Bhrolchain and Dyson (2007:3) critique this view from a demographic perspective:
“Hence, in the main, the factors of leading interest to demographers cannot be shown to
be causes through experimentation or intervention. To claim that this means they
cannot be causes, however, is to imply that most social and demographic phenomena
do not have causes—an indefensible position. Manipulability as an exclusive criterion is
defective in the natural sciences also.” Economists Angrist & Pischke (2009:113) also
cast doubt on this restrictive definition of cause.
A softer view of the “no causation without manipulation” motto is that actual physical
manipulation is not required. Rather, it requires that we be able to imagine such
manipulation. In sociology, Morgan and Winship (2007:279) represent this view: “What
matters is not the ability for humans to manipulate the cause through some form of
actual physical intervention but rather that we be able, as observational analysts, to
conceive of the conditions that would follow from a hypothetical (but perhaps physically
impossible) intervention.” A difficulty with this position is that the possibility of causation
then depends on the imagination of researchers who might well differ in their ability to
envision manipulation of putative causes.
Pearl (2011) further shows that this restriction has led to harmful consequence by
forcing investigators to compromise their research questions only to avoid the
manipulability restriction. The essential ingredient of causation, as argued in Pearl
(2009:361) is responsiveness, namely, the capacity of some variables to respond to variations in other variables, regardless of how those variations came about.
what is the pearl(2011) reference?
I have been trying to convince Brent that there is cause for optimism out there. This exchange, like the inexorable gravitational pull of a giant planet, is bringing me back to his dark pessimistic view.
I cannot believe that, at this time, there is still ANY resistance to Brent’s three main ideas here:
1. P<.05 sucks, except possibly as a very preliminary gatekeeper for exploratory data that basically means, "Well, maybe I have something here, I need to do SOMETHING MORE to see if there really is a there there." That p<.05 has been used, intentionally or not, to cloak bad methods and analyses is controversial? Really??
Here are two great articles on the general issue:
http://www.nature.com/news/scientific-method-statistical-errors-1.14700
P<.05 was never intended to be a dichotomous believe/not believe the result statistic.
http://retractionwatch.com/2013/11/12/just-significant-results-have-been-around-for-decades-in-psychology-but-have-gotten-worse-study/
(shoot me now).
2. The difference between an excellent experiment and an excellent naturalistic study, with respect to their ability to draw causal inferences, is very small. My judgment is that social psychology has a scientific arrogance problem, and part of that problem comes from the deeply entrenched and widespread belief in the supremacy of experiments.
Chris M's comment on causal inference from nonexperimental phenomena nails it.
That arrogance extends to the widespread willingness TO generalize on the basis of extremely technical and arcane experiments. You will rarely see social psychological conclusions that look like this:
"In my 10 x 15 ft lab, lit with 8 4 foot flourescent bulbs, in three weeks in March, 2014, where the outdoor temps ranged from 40-60, and the indoor temps from 65-75, when run by college seniors, experimental method M implemented exactly as A/B/C, will produce a change in
responses to DVs 1, 2, and 3 when assessed in exactly as follows (long description of specifics of the dvs), in exactly this order, when surrounded by exactly these other measures."
Instead, what you get are conclusions such as:
80% of Americans are implicit racists.
Self-fulfilling prophecies are a powerful and pervasive source of social problems.
Stereotypes are the default basis of person perception.
Conservatives are more rigid than liberals.
Most of life is driven by automatic nonconscious mental processes.
(the latter is quoted from Bargh & Chartrand's "Unbearable" paper; the others are paraphrases
that you can find in almost any review of the IAT, almost any review of self-fulfillng prophecies, Fiske & Neuberg, 1991; almost any review of the psychology of libs&cons, respectively).
So, I am not making crap up here. This is how we conduct business, and, as Brent points out, what we provide rewards for doing.
3. Jeff is right that preference for intuitive versus counterintuitive findings is a subjective value judgment. However, Brent's argument nails the problem: the pervasive preference for counterintuitive findings skews the field's conventional wisdom, in two ways. When we engage in a highly skewed investigation of phenomena (e.g., when most of us go after counterintuitive effects), we get a literature that does not capture a wide range of human nature or experience; it captures a very narrow range. Further, many people seem to conflate "I love this counterintuitive finding" with "This is a powerful and socially important counterintuitive finding."
I cannot believe any of this is controversial. Or, rather, I do believe it is controversial and it is that that is exerting its dark pull on my view of social psychology's future.
Lee
Lee and many others seem to be confused about the primary goal of much of the research reported in our journals. The goal is not to generalize to any particular group or context. The goal is to provide a test of a theoretically-driven hypothesis within the context of our labs. So, the logic is: If this hypothesis is true, then the following result should obtain in the context of our labs (limited though it may be). The results reflect not the generalizability or applicability of the finding. Rather, the results either provide or fail to provide support for a theory that makes predictions about what ought to happen in the context of our labs. Direct applicability to the “outside world” (whatever that means) is a value not a criterion. Consider the artificiality of atom smashers, genetically identical lab rats, or Harlow’s work (as but a few examples). This work does not and was never meant to “generalize.” This work was/is meant to test specific hypotheses derived from theories. Rather than belabor the point, I’ll just issue a standard plea to go read (carefully) Mook (1983), who described these issues far more eloquently than I ever will.
Intentional or not, I also would say that Lee’s response appears to be a plea for conceptual versus direct replication.
And, the description of your ideal, the hypothetico-deductive model, is in no way contrary to any of the recommended changes to our methods that have been proposed. If you want to test a hypothesis what harm can come from testing it, the same way, twice? If you want to refute the null, why object to cataloguing the degree to which you have refuted it (e.g., effect sizes)? If you really have discovered a replicable effect, why not confess to all of the failed attempts you made to get there? The latter would be extremely helpful to people conducting follow up research on your topic. None of these changes undermines the hypothetico-deductive model.
What the proposed changes to our methods do is protect you from the inference that your research can be disregarded because it is unbelievable. We now know that the multi-study, small N, NHST package can be willfully manipulated by less idealistic researchers than yourself. Unfortunately, this creates a reputational problem for everyone wielding the Deathly Hallows. If you use the same methods to evaluate research as Bem, then whatever you study may be as real as ESP.
This is a response to Brent’s post.
I said nothing about being opposed to replication, reporting effect sizes, openly reporting failed attempts, larger Ns, or multi-study packages. And none of those things was the focus of your Hallows post.
What I object to is your assertion that valuing those things apparently now also requires that our research must be correlational and focus on intuitive results. It is a non-sequitur.
I am definitely confused. These claims are about what happens in labs?
80% of Americans are implicit racists.
Self-fulfilling prophecies are a powerful and pervasive source of social problems.
Stereotypes are the default basis of person perception.
Conservatives are more rigid than liberals.
Most of life is driven by automatic nonconscious mental processes.
If so, we have an epidemic of horrendous writing on our hands. I doubt it though. I think our colleagues usually write exactly what they mean and mean exactly what they write.
However, perhaps we can meet constructively part way. IF AND WHEN experimental researchers clearly restrict their claims to what happened in their labs, it would go quite far to improving social psychology.
Jeff, nowhere in the blog did I propose that we should value correlational and intuitive results over experiments. I identified what we value, which is experimental lab studies with small Ns, no effect sizes, poor power, and counter-intuitive findings. I stand by that description and I’m befuddled that you would disagree. It is an accurate description of the most cited research in Psych Science, JPSP, PSPB, JESP, and SPPS. It is the research that almost all award winning psychologists from our guild do, with pride I might add.
In fact, I’m fine with our field continuing to value experiments over correlational designs. What I’m saying is that our valuing of this package in particular is getting in the way of sensible reforms to our methods that would make experiments, correlational studies, intuitive, or counterintuitive research more reliable and believable.
In the mean time, Rome is burning, as NSF is proposing to cut social and behavioral sciences by 150 million in no uncertain terms because of our ability to create proof for things like ESP. Basic science at its best.
Response to Lee:
No, they are not claims about what happens in labs or anywhere else. They are claims about the extent to which the data from a study support or fail to support a hypothesis.
At the risk of violating copyright law, I’ve cut-and-pasted Mook’s discussion of the proper way to describe results:
“The study is a test of the tension-reduction view of alcohol consumption, conducted by Higgins and Marlatt (1973). Briefly, the subjects were made either highly anxious or not so anxious by the threat of electric shock, and were permitted access to alcohol as desired. If alcohol reduces tension and if people drink it because it does so (Cappell & Herman, 1972), then the anxious subjects should have drunk more. They did not.
Writing about this experiment, one of my better students gave it short shrift: “Surely not many alcoholics are presented with such a threat under normal conditions.”
Indeed. The threat of electric shock can hardly be “representative” of the dangers faced by anyone except electricians, hi-fi builders, and Psychology 101 students. What then? It depends! It depends on what kind of conclusion one draws and what one’s purpose is in doing the study.
Higgins and Marlatt could have drawn this conclusion: “Threat of shock did not cause our subjects to drink in these circumstances. Therefore, it probably would not cause similar subjects to drink in similar circumstances either.” A properly cautious conclusion, and manifestly trivial.
Or they could have drawn this conclusion: “Threat of shock did not cause our subjects to drink in these circumstances. Therefore, tension or anxiety probably does not cause people to drink in normal, real-world situations.” That conclusion would be manifestly risky, not to say foolish; and it is that kind of conclusion which raises the issue of external validity. Such a conclusion does assume that we can generalize from the simple and protected lab setting to the complex and dangerous real-life one and that the fear of shock can represent the general case of tension and anxiety. And let me admit again that we have been guilty of just this kind of foolishness on more than one occasion.
But that is not the conclusion Higgins and Marlatt drew. Their argument had an entirely different shape, one that changes everything. Paraphrased, it went thus: “Threat of shock did not cause our subjects to drink in these circumstances. Therefore, the tension-reduction hypothesis, which predicts that it should have done so, either is false or is in need of qualification.” This is our old friend, the hypothetico-deductive method, in action. The important point to see is that the generalizability of the results, from lab to real life, is not claimed. It plays no part in the argument at all.
Of course, these findings may not require much modification of the tension-reduction hypothesis. It is possible—indeed it is highly likely—that there are tensions and tensions; and perhaps the nagging fears and self-doubts of the everyday have a quite different status from the acute fear of electric shock. Maybe alcohol does reduce these chronic fears and is taken, sometimes abusively, because it does so. If these
possibilities can be shown to be true, then we could sharpen the tension-reduction hypothesis, restricting it (as it is not restricted now) to certain kinds of tension and, perhaps, to certain settings. In short, we could advance our understanding. And the “artificial” laboratory findings would have contributed to that advance. Surely we cannot reasonably ask for more.”
Response to Brent:
Here are some quotes from your original post:
“And, though some great arguments have been made that we should all relax a bit about our use of exploratory techniques and dig around in our data, what these individuals don’t realize is that half the reason we do experiments is not to do good science but to look good. To be a “good scientist” means being a confirmer of hypotheses and there is no better way to be a definitive tester of hypotheses than to run a true experiment.
Of course, now that we know many researchers run as many experiments as they need to in order to construct what findings they want, we need to be even more skeptical of the motive to look good by running experiments. Many of us publishing experimental work are really doing exploratory work under the guise of being a wand wielding wizard of an experimentalist simply because that is the best route to fame and fortune in our guild.
Most of my work has been in the observational domain, and admittedly, we have the same motive, but lack the opportunity to implement our desires for causal inference.”
It appears that you are arguing that we should do fewer experiments because experiments make it easier to do bad science and make false claims of causality. If that is not what you intended to convey, perhaps you can clarify.
You also state that our guild values “small Ns, no effect sizes, poor power, and counter-intuitive findings….It is an accurate description of the most cited research in Psych Science, JPSP, PSPB, JESP, and SPPS. It is the research that almost all award winning psychologists from our guild do, with pride I might add.”
I agree that those features characterize some of our most valued research, but you’ve included a bunch of variables that are correlated with but do not cause that research to have value. That research is valued because the findings are counter-intuitive and tell us something interesting about psychology that we don’t already know. That research is not valued *because* it has small Ns, no effects sizes, or poor power. The suggestion that this research would be valued less if it had more power and reported effect sizes is absurd.
Just above you write that “valuing of this package in particular is getting in the way of sensible reforms to our methods that would make experiments.”
Here, again, you are suggesting that there is something inherent about experiments that makes these reforms less feasible. I still see no relationship between these reforms and the kinds of research (experimental vs. correlational) that one conducts.
Finally, your suggestion that the NSF cuts are “in no uncertain terms because of our ability to create proof for things like ESP.” I call bullshit, and would like to see the source for this claim.
I’ve already expressed sympathy for Jeff’s argument that the problem is not with valuing good causal inference (e.g., via experimentation) per se.
However, I think he is parsing Brent’s argument in a manner that is unfair to its spirit. I take the value system of a field to be reflected in its practices and product, even if that value system is expressed in implicit rather than fully explicit ways.
The reality is that when we are operating under conditions of limited resources, valuing experiments within a given sub-field means smaller N studies with less precision as regards effect size estimates. This is minimally an implicit endorsement of good causal inference over trustworthy effect sizes.
That being said, it has been my experience that some experimentalists (including social psychologists) are willing to take things a step beyond this and explicitly devalue trustworthy effects sizes. When a scholar talks about so-called over-powered studies (see above) or claims that an effect is only of interest if it can be detected with a small N (i.e., is large in magnitude), s/he is devaluing effect sizes and in particular the precise estimates of same.
Wonderful post as usual. I, like many other researchers and methodologists, have been running these same issues around in my head for the past couple of years. I’ve really enjoyed reading your blog and seeing how others think about statistical and methodological reform.
I was trained as a social psychologist, but during graduate school (I started in 2003) I was never exposed to the weaknesses of NHST, the experimental method, or any of the larger epistemological issues that have recently (once again) received popular attention. I was surprised when I started reading about these problems about 2 years ago that the critiques in psychology go back at least 50 years.
I’ve been through the gauntlet of the review process a lot, and based on my experiences and the writings of others (the Asendorpf paper you cite above gives a great overview of structural barriers and incentives), I’ve come to think that the field simply doesn’t care about whether the research we publish is “true” (this is a slippery concept, but perhaps most easily operationalized by “replicable”).
Uri Simonsohn made an incredible (and seemingly spontaneous) comment at the replicability symposium at SPSP last month. He essentially said that, even among the supposedly “best” JPSP papers (e.g. those that made no mention of deleting outliers or including lots of covariates in their models), the average statistical power to detect the effects reported in the articles was around 33%. I realized the current situation was bad, but I will admit this was surprising. What’s most amazing is that these papers, reporting statistically improbable results, have been subjected to the most strict scrutiny and selection procedures that our field provides.
You mention incentives above, and this is a critical issue. I guess I’m coming at it from the other side- the selection procedures that the guild employs to pass judgment on what the “best” research is (that which appears in the top journals). Uri’s analysis of power in JPSP, along with the many failed replications that have received attention in recent years, make it pretty clear that we’re not selecting papers for publication in prestigious journals based on the robustness or accuracy of the results. So what the heck are we selecting on?
My intuitive sense of what is rewarded or incentivized is storytelling (which we tell ourselves is “theory”), preferably with a sexy counter-intuitive finding (Deathly Hallows 3). How do we tell our stories? Well, with experiments (Deathly Hallows 2) which further their causal epistemological argument via NHST (p < .05, Deathly Hallows 1).
Then we further select individuals based on their publications. Those people who are typically selected for entry into the guild (e.g. professorships) are those who have amassed a sufficient number of publications (e.g. compelling stories justified with the 3 Deathly Hallows) in the current system.
It’s interesting, because nowhere in these two selection procedures do we see any inherent place for those topics that makes statisticians and methodologists apoplectic: statistical power, replicability, transparency regarding study materials, hypotheses, data, and syntax.
It’s almost like there’s two communities who speak different languages. The small minority of methodologists and statisticians would like our research to be robust and replicable. The rest of the field, rightly so given the selection procedures in place, talks around or past these issues. Because, at the heart of it, that’s not really what we do. We tell ourselves stories which we decorate with numbers and pretend to one another that we’re doing science (or as Sherman puts it above: "The goal is not to generalize to any particular group or context. The goal is to provide a test of a theoretically-driven hypothesis within the context of our labs"). I can understand where this argument is coming from. But at least let’s be honest about it.
Is what we do real? The only way to get any sense of this is to follow the methodologists, to care about statistical power, replicability, and transparency. But it seems to me that this concern has always been tangential to our research enterprise. To integrate this perspective requires not a return to any fundamental values, but a complete re-imagining of the way the process works.
This MM comment deserves a star.
This rcfraley comment deserves a firm Second.
Hey Jeff,
1. I am not looking for a food fight here. It really troubled me, not that the tone of our comments seem to conflict so much — I think intellectual/scientific conflict is healthy — but that we seemed to be talking past each other. “We” here includes Brent’s comments as well as mine.
2. Then it hit me. You are defending one way of doing good science (testing hypotheses derived from theory in the lab), and interpreting them accordingly, including NOT over-interpreting them.
I am not worried about people doing good science, and I do not think Brent is, either. I am worried about people doing bad science and dressing it up as good science. I am worried that some people are p-hacking and making up theories to explain data after the fact and know what they are doing, but do it anyway. I am worried that lots of people may be doing these things and are not even aware of how problematic these once-normal procedures are. I am worried that it has been nearly impossible to distinguish between good research and bad research, short of catching someone engaging in fraud redhanded.
2. In fact, I have to thank you. Sincerely, not sarcastically. I have long been struggling to figure out what solutions might be to (what I see, anyway, as) a crisis in validity and credibility. Your posts offer one answer to at least some of the problems. Stick with Mook. When your results support or fail to support some hypothesis in some lab study, feel free to say so. And be just as clear that one is making no claim whatsoever about generalizability or external validity, absent lots of data about the same phenomena under real world conditions. That would go a long way towards addressing both the overclaiming/overstatement interpretations that pervade social psychology, and towards reigning in some of (what I see as) the arrogance of (many) experimenters. Had Bargh merely declared, “theoretical bases for predicting unconscious processes can be involved in many social phenomena have consistently been confirmed” he would have been on much more solid ground.
3. However, your (completely unnecessary, I think) defense of how science should be conducted does not address my (or Brent’s, I think) concern’s about how it IS conducted. I will just give one concrete example. You know the “classic” and counterintuitive finding that people like fewer rather than more choices? Meta-analysis shows no effect. Great argument for replication, in my view (both exact, and conceptual, depending on what claim — in your/Mook’s terms, what theoretical claim or derived hypothesis — is being tested).
It gets worse, though. Leif Nelson has a great paper (maybe not yet published) showing that the studies in the meta-analysis supporting the counterintuitive “fewer choices are better” hypothesis show massive evidence of p-hacking. The studies showing no difference or “more choices are better” show no evidence of p-hacking.
Until Simonsohn/Nelson/Simmons developed means for testing for p-hacking, this would have been impossible to determine. And for me at least, it raises the following questions, which are all variations on the same theme:
What other areas of social psychology are similarly non-credible?
How many other counterintuitive findings are either irreplicable and/or have been obtained through (intentionally or otherwise) fishy methods and stats?
How much of the product of our field can we really believe?
And how can we tell what to believe and what not to believe?
Even when the phenomenon uncovered is real, how often are effect sizes
dramatically overestimated in initial publications?
Descriptions of ideal scientific practices do not help answer these questions.
To be sure, despite my deep skepticism expressed here, I do think much of social psychology can be believed — especially research that has consistently produced very large effects across lots of labs.
But sometimes, that research is very inconsistent with social psychology’s received wisdom.
For example, stereotype accuracy is one of the largest and most replicable effects in all of social psychology. And yet, you won’t find that conclusion anywhere in the repositories of received wisdom in social psychology (handbook/annual review chapters, advances chapters, psych review, psych bull, etc.).
Power of the situation? How much social psychology has ever compared the power of situations to individual differences? Funder and Fleeson have both done great work on this — and in the literature that has actually compared individual differences to situations, there is no greater overall power of situations.
The following are all different but related issues — exact (or near-exact) replications,conceptual replications, p-hacking, underpowered designs, unjustified claims/interpretations, overstatements, overgeneralization. Flaws and problems at each step compound the next. Bad study with hidden flaws finds some cute counterintuitive finding that captures people’s imaginations? Becomes received wisdom, overgeneralized to the real world, and then creates obstacles to failures to replicate seeing the light of day. And even if the failures do see the light of day, no one attends to them anyway, because that original finding has so captured people’s imaginations. And so the field risks ending up with many broad conclusions that are not justified because the original work cannot be replicated, or, even if the results replicate, it does mean what you/Mook have argued, but it does not mean what many social psychologists interpret it as meaning.
Descriptions of ideal behavior on the part of scientists is helpful though — it reminds us of the standards we should aspire to when conducting our science.
You can have the last word. Gotta get back to “real work.”
Lee
My arguments in these posts have been in response to Brent identifying experiments and non-intuitive effects as 2 of the Deathly Hallows of psychology.
I am not arguing that we do not face serious problems and challenges. I enthusiastically support almost all of the many good solutions people have suggested and are pursuing.
I do, however, strongly disagree that these problems are inherent to some types of research and not others. And, I think if we start to conflate discussions of proper science with discussions of what kinds of research we personally value, we are doing a disservice to the reform movement and unnecessarily introducing conflicts that are beside the point.
Well, I clearly did not write the post well since I did not communicate effectively to Jeff that it is not any one of these Hallows that are flawed. None of them are necessarily bad on their own (though some might argue that p < .05 has no redeeming value). What I argued, or so I thought, was that because we value that particular package of qualities–experiments with counter-intuitive findings, and p-values less than .05, researchers have been willing to do whatever it takes to achieve it, including p-hacking. And, like Bem's work on ESP, the resulting package looks good and still showing up in our journals to this day.
The inevitable consequences of conducting better research by using larger sample sizes, reporting effect sizes, documenting null effects, and pre-registration, etc., will be to make it much harder to gain that hallowed ground occupied so easily now by researchers employing a few researcher degrees of freedom. It is significantly easier to find a counter-intuitive experimental effect by running multiple, small N studies and only reporting the statistical significant findings. Take that approach away, and you take away a seriously large chunk of our research to date. And, if we make it difficult for people to employ the Deathly Hallows, that will change our entire status structure which is largely driven by number of publications in top journals.
Pingback: Psychology News Round-Up (March 14th) | Character and Context
Pingback: First post at Cohen’s b: What am I doing here? | Cohen's b(rain)
Hey Jeff,
I wish I had the same non-experience that you have (not?) had (“To be honest, I can’t say that I’ve come across anyone arguing that more power is Bad.”). Here’s a direct quote from a recent editorial letter from what many consider to be the top journal in our field:
“Given the unusually large sample sizes in your set of studies, the reviewers and I wonder what the magnitude of the effect sizes is because the larger the sample size is the smaller the effect size can be detected. It is highly possible that the correlation coefficients reached statistically significance due to the large number of participants. Statistical power is important but having too much can be problematic too.”
Obviously in subsequent versions we were careful to highlight the effect sizes!
Thanks all for the fascinating and spirited discussion.
Sam
Wow. That’s seriously fucked up. It is deeply disturbing that an editor at one of our top journals would write that.
I find this exchange illuminating and frustrating at the same time. I read the exact same value judgments in this blog that Jeff did. Perhaps the value judgments are more obvious when the reader doesn’t share them.
To anyone who wants to give over the conduct of science to statisticians: RESIST!!! The normal values of statisticians are to fear Type I error, and to devalue Type II error. Why not just turn social psychology over to the Republican Party? Both impulses share the same, deeply conservative values.
But everyone who has responded so far makes the same error–science is a social endeavor, made stronger by difference, by plurality, by error and experiment. One p-hacker is hardly a major problem–not even a small handful of them. Individual mistakes and bad acts (or worse) become trivial over time.
I am troubled by anyone who wants to make the argument that we’ll all move faster if we all move slower.
Here’s a great example of how science as a collective knows its business. Consider the “most important” data faker in social psychology in the last several decades–Diederik Stapel. He published a lot of articles, and was pretty famous. BUT–did people use his research? He published in some of the fastest-moving hottest, most highly populated areas of social psychology. But was he actually important?
To test this, I compared the citation rate of Mr. Stapel with the citation rate of the six people responding to this blog that I could identify: Brent Robertrs, Jeff Sherman, Lee Jussim, Glenn Roisman, Sam Gosling, and me (Chris Crandall). I used Google Scholar (thank you all for making your page public!), and looked at the top five publications for each of us.
Mr Stapel’s top five = 1,037 citations.
Average of six commenters’ top five = 3,005 citations.
In case you think I’ve Trojan-Horsed in one of us with massive citations, Stapel is lower than every one of these commenters, and lower by at least 80%.
(As an aside, if you think that scientific psychology devalues correlational, naturalistic, or observational research, then compare the citation rate of the correlational researchers on this list to the citation rate of the experimental researchers. Even better, compare those of us who do both, and consider the citation rate of the correlational research we do with the experimental research we do. That’ll sink a straw man or two.)
Science is a collective, cumulative enterprise. There is a certain wisdom in crowds, particularly scientific crowds, and when we take the long view, we can all quit hyperventilating.
I do need to add that, for my arguments to hold value, we must share evidence of failure to replicate. I cannot gainsay journals’ policies to avoid publishing failures-to-replicate.
However, I can criticize the notion that the ONLY place to publish such replications is in the original journal (or in the journals at all). We have a rich, dense, diverse set of paths to communicate. This includes the “original” journals, other journals, blogs, conferences, social media, personal communications, meta-analyses, and other sources. And we use them–you know we do.
When Karen Ruggiero was outed as a faker (another tip of the hat to Nalani Ambady for her role in the outing), I had known of suspicions about her data, and the failure to replicate some of the effects, for a handful of years. At that time I was working in that area. I cannot find a single citation in my publications to her empirical work (I did cite an introduction to an issue of Journal of Social Issues).
Not that I am far-seeing when it comes to fraud. It was out and about in the informal channels that are critical to the conduct of science. Knowledge doesn’t have to appear in JPSP before the community recognizes it–it’s not the stone tablets of social-personality psychology.
Chris: I don’t get the sense that many of the people who are concerned with the state of psychological science are hung up on Stapel in particular. I think part of the problem Brent is highlighting is that the incentive structure in our field doesn’t reward work that is designed to produce robust knowledge.
– Do we value replication? Not really.
– Do we care about getting things right? Not as much as we should.
– Do we take Type II error and precision seriously? Nope.
– Is the process self-correcting? Maybe it is in the long-run. But, in the meantime, those who are trying to make corrections are being treated as trolls.
I think Brent does a nice job at highlighting some of the things we do value as a field. (And I don’t think his point was that they are all Bad Things per se; they are what they are. Experiments, for example, can be done well or poorly.) But we should make sure we’re creating incentives for other things too if we want a cumulative science.
Chris F: I’ll agree that “many are not,” if you’ll agree that quite a lot of the articles, blogs, and discussion put his case in the first page or two. But it wasn’t my point. My point was that the *system* is substantially more robust than individual actions, choices, and practices. Editors and reviewers let Stapel’s bogus work into the journals, but not many people were fooled into thinking his work was important (compared to his “productivity”), based on its low citation rate.
And I’ll disagree with this one: “Do we value replication?”
Of course we do. I checked my JPSP publications (few as they may be), and every one of them has three replications or more built in. (They are not exact replications in every case, but you’ll have to agree that not everyone expects *exact* replications.) Reviewers ask for them all the time, and good scientists put them into their work. My next submission has 11 independent samples testing a nearly identical hypothesis almost 125 times (it’s not as tedious as it sounds, I promise!), with over 3,000 participants (collected by hand in the field). It’s what we do. Perhaps we might agree that replication is “under”-valued.
And it’s good that we agree that our science is self-correcting in the long run (or at least, it’s partially self-correcting). It seems to me that people are proposing fixes for the “short run” without significant attention to the long run. Despite Festinger & Carlsmith’s (1957) one-shot study with a comparatively underpowered sample, it has stood the test of time (does anyone need to talk about the effect size in that experiment?).
There are many ways to do good science, and some of those good ways are being attacked because they aren’t conservative enough. My point is that the system of science is sufficiently conservative without having to convert every single practitioner for every single publication. We don’t have to emulate the Tea Party–there isn’t One Good Way.
Pingback: (Sample) Size Matters |
Pingback: On the Research University, and on Good Science – two articles/blogs you should read. | Åse Fixes Science
Pingback: Personality Pedagogy Newsletter Volume 8, Number 8, April 2014 | Personality Pedagogy Newsletters
Pingback: (Sample) Size Matters - The Berkeley Science Review
Sorry I’m coming to this late, someone sent me the link ony today – I understand perfectly if this is all ‘water under the bridge’ now.
@Jeff says: “The goal is not to generalize to any particular group or context. The goal is to provide a test of a theoretically-driven hypothesis within the context of our labs”.
And what exactly then is the ‘population’ from which you have ‘sampled’, in order to derive a statistical test of ‘significance’ of your estimated ‘population’ parameter? A hypothetical population of ‘lab participants’ in one or more laboratories?
I have to admit it’s a clever wheeze (“our aim is not to generalize”) .. one that I’ve never come across before ..
Now, if you keep the wheeze-claim but bootstrap ‘significance’ from within your own sample, or an aggregate sample constructed from ‘the lab participant population’ (something akin to Jame’s Grice’s OOM ‘randomization’ test implemented on the same sample data), then the wheeze does indeed gains some legs!
But then of course the question arises, what kind of theory are you generating/testing which speaks (by design) only to phenomena detectable in a laboratory (or only in a laboratory in which can control conditions to the extent you/your theory requires) , which possesses no generalizability or meaning beyond a laboratory except by ‘chance’ alone? That claim bears no relation to what created the drive for explanatory theory and experimentation within physics or the natural sciences.
Pingback: Descriptive ulceritive counterintuitiveness | pigee