We recently read Karg et al (2011) for a local reading group. It is one of the many of attempts to meta-analytically examine the idea that the 5-HTTLPR serotonin transporter polymorphism moderates the effect of stress on depression.

It drove me batty. No, it drove me to apoplectia–a small country in my mind I occupy far too often.

Let’s focus on the worst part. Here’s the write up in the first paragraph of the results:

“We found strong evidence that 5-HTTLPR moderates the relationship between stress and depression, with the s allele associated with an increased risk of developing depression under stress (P = .00002). The significance of the result was robust to sensitivity analysis, with the overall P values remaining significant when each study was individually removed form the analysis (1.0×10-6<P<.00016).”

Wow. Isn’t that cool? Isn’t that impressive? Throw out all of the confused literature and meta-analyses that came before this one. They found “strong evidence” for this now infamous moderator effect. Line up the spit vials. I’m getting back in the candidate gene GxE game.

Just what did the authors mean by “strong?” Well, that’s an interesting question. There is nary an effect size in the review as the authors chose not to examine effect sizes, but focused on synthesizing p-values instead. Of course, if you have any experience with meta-analytic types, you know how they feel about meta-analyzing p-values. It’s like Nancy Reagan to drugs. Just say no. If you are interested in why, read Lipsey and Wilson or any other meta-analysis guru. They are unsympathetic, to say the least.

But, all is not lost. All you, the reader, have to do is transform the p-value into an effect size using any of the numerous on-line transformation programs that are available. It takes about 15 seconds to do it yourself. Or, if you want to be thorough, you can take the data from Table 1 in Karg et al (2011) and transform the p-values into effect sizes for your own meta-analytic pleasure. That takes about 15 minutes.

So what happens when you take their really, really significant p-value of p = .00002 and transform it in to an effect size estimate? Like good meta-analytic types, the authors provide the overall N, which is 40,749. What does that really impressive p-value translate into when you translate it into an r metric?

*.0199 or .02* if you round up.

It is even smaller than Rosenthal’s famous .03 correlation between aspirin consumption and protection from heart disease. You get the same thing when you plug all of the numbers from Table 1 into Comprehensive Meta-Analysis, by-the-way.

So the average interaction between the serotonin transporter promoter and stress on depression is “strong,” “robust,” yet infinitesimal. It sounds like a Monty Python review of Australian wine (“Bold, yet naïve.” “Flaccid, yet robust”).

Back to our original question, what did the authors mean when they described their results as “strong?” One can only assume that they mean to say that their p-value of .00002 looks a lot better than our usual suspect, the p < .05. Yippee.

Why should we care? Well, this is a nice example of what you get when you ignore effect size estimates and just use p-values–misguided conclusions. The Karg et al (2011) paper has been cited 454 times so far. Here’s a quote from one of the papers that cites their work “This finding, initially refuted by smaller meta-analyses, has now been supported by a more comprehensive meta-analysis” (Palazidou, 2012). Wrong.

Mind you, there is no inconsistency across the meta-analyses. If the average effect is really equal to an r of .02, and I doubt it is this big, it is really, really unlikely to be consistently detected by any study, no less a meta-analysis. The fact that the meta-analyses appear to disagree is only because the target effect size is so small that even dozens of studies and thousands of participants might fail to detect it.

Another reason to care about misguided findings is the potential mistaken conclusion either individuals or granting agencies will make if they take these findings at face value. They might conclude that the GxE game is back on and start funding candidate gene research (doubtful, but possible). Researchers themselves might come to the mistaken conclusion that they too can investigate GxE designs. Heck, the average sample size in the meta-analysis is 755. With a little money and diligence, one could come by that kind of sample, right?

Of course, that leads to an interesting question. How many people do you need to detect a correlation of .02? Those pesky granting agencies might ask you to do a power analysis, right? Well, to achieve 80% power to detect a correlation of .02 you would need *8,699* participants. That means the average sample in the meta-analysis was woefully underpowered to detect the average effect size. For that matter, it means that *none* of the studies in the meta-analysis were adequately powered to detect the average effect size because the largest study, which was a null effect, had an N of 3,243.

So, this paper proves a point; that if you cumulate enough participants in your research almost anything is statistically significant. And this warrants publication in the *Archives of General Psychiatry*? Fascinating.

Brent W. Roberts

Unfortunately, the situation is far worse than Brent describes it above, for at least two reasons that I am sure Brent already appreciates.

First, the GxE literature is hopelessly unrepresentative of the data.

Anecdote 1: Last year I was called on to review a “replication” of the SLE x 5HTT = Depression finding (i.e., originating with Caspi et al in Science). But as these authors had access to longitudinal data on depression, they were able to examine mean levels of depression (intercepts), change in depression (slopes), and acceleration (nonlinear change from assessment to assessment). They also genotyped both the so-called biallelic and triallelic variants of 5HTT (Caspi focused on biallelic). What did they find? In the original version of the ms, nothing for the biallelic variant and one effect for the triallelic–that acceleration (but not intercepts or slopes) was predicted by the SLE x triallelic 5HTT interaction. I pointed out that this pattern of results could be easily attributed to chance. The response: The biallelic runs were omitted from the ms and it was published as a successful replication.

Anecdote 2: One of my students recently reported evidence that rank order stability in attachment security, which is notably weak from infancy to young adulthood, is stronger for A carriers of one of the OXTR SNPs. The paper, though based on a modest N (150 or so), showed similar patterns across different DVs, and was published in the Journal of Child Psychology and Psychiatry. Although not a part of the initial effort, I asked the student if he wanted to attempt replication of his finding in the (N=600 sub-sample) Study of Early Child Care. He did and could not replicate the finding. We sent the paper to JCPP and they turned it down before it even when to review because, after all, null results are “meaningless.”

Second, let us assume that the .02 standardized effect is close to the true effect. All this means is that the association between SLEs and depression varies (apparently trivially) as a function of genotype. The problem is that there are MANY different kinds of interactions (differential susceptibility, dual risk, contrastive) that could have contributed to that effect, all of which carry substantively different interpretations. Moreover, the problem is that differentiating among them is not straightforward on the basis of the kinds of data reported in the literature. Minimally, we would need for all of the interactions to have been probed on the same research range of interest on X and Regions of Significance on X within that range would need to be reported.

All of this said, the bigger problem that is exposed here is the adequate power (big N) is necessary but not sufficient to cure what ails us (and many other disciplines in social and biological science). For a start, we need to overturn quasi-apocrypha like the aspirin standard that allow us to confuse trivial population effects with effect sizes that are treated as theory-confirming. Recall that the point of the aspirin example is that the population level effect is so small because the BASE RATES of the sorts of conditions that might be ameliorated by a daily aspirin regimen are so low. Aspirin works quite well for those who actually have the underlying diathesis. In short, P. E. Meehl was right–we need to upend the standard of “anything besides exactly zero gets out theories money-in-the-bank.” One simple way of attempting this is beginning with the premise, as Chris Ferguson has elegantly argued, that there really are unstandardized effects that are too small to care about. If we can’t reason through what that value might be before collecting our data, we are wasting out time and other’s money.

Great comment Glenn! Could you provide references to the independent replications of Caspi et al. (2002) and the attachment security finding (@ Journal of Child Psychology and Psychiatry) that you mention?

Painful anecdotes Glenn. The continued bias against the null makes me visit apoplectia just as much as the love of p-values. I will quibble with only one thing. I simply hate Ferguson’s argument about effect sizes because he had no good rationale for picking the number he picked (an r of .20)–or, at least, this is how he is cited now.

We now know that the average effect size in social psychology, personality psychology, and organizational psychology is equal to a correlation of .2. So, according to Ferguson, we need to throw out half of our studies as trivial without any attempt to understand their meaning? Take for example the correlation between conscientiousness and college persistence. In one study we are conducting the meager correlation, which is below .2, nets you an extra 6 months of college persistence if you are one standard deviation above the mean. That’s real value. We need to stop bloviating about what we think is a big or small effect size and get down to doing the dirty work of figuring out what our correlations/mean differences mean.

Exactly right. Amidst all the fancy statistics that fill our methods journals, a simple question remains insufficiently addressed: What effect size is big enough to matter, or (conversely) so small it doesn’t matter? Arbitrary thresholds are worse than useless. Conversely, the social psychologist I quoted in a blog post (http://funderstorms.wordpress.com/2013/02/01/does-effect-size-matter/) seems to think that if you are doing theoretical work, then ANY non-zero effect size is sufficient. The size beyond that only matters for “applied” work, he says (with a barely detectable sneer). I strongly disagree with that sentiment, but I have to admit that effect sizes are easier to evaluate in applied than in theoretical contexts. As a field, we just aren’t in the habit, and don’t have enough experience.

I don’t think we actually disagree on this point, pigee Brent :)

I mentioned unstandardized (rather than standardized) effects because I agree that arbitrary thresholding of standardized effects built on measures that use arbitrary metrics is a pretty useless basis for judging the substantive significance of a given association (i.e., is it important, actionable, etc.). In short, I am sympathetic to Ferguson’s description of the illness even if not his preferred cure.

Frankly, I am surprised we don’t spend more time as a field developing measures based on less or non-arbitrary metrics and asking the hard question of how much of a change or difference in unstandardized terms is meaningful in relation to how difficult, expensive, etc. it is to change 1 unit on the (non-arbitrary, unstandardized) IV. In short, I expect that the standardized, mostly arbitrary metric-based .20 (on average) effects you cite in social and personality psych are likely heterogeneous with respect to their substantive significance. Some are likely to be of trivial value, others are potentially valuable effects. I think your persistence example is a good one where the “small” effect is potentially useful, though also a rare one in psychology in the sense that the DV is clearly a non-arbitrary metric. We can have an opinion on the value of 6 months more persistence, which is good.

As well as the points you raise, reducing a two-dimensional interaction to a one-dimensional p-value clearly loses information. The *shape* of the interaction matters as much as the statistical evidence for the *presence* of an interaction (as reflected in Glenn’s anecdote #1). Unfortunately GxE is particularly vulnerable to researcher degrees of freedom, given the number of ways the data can be carved up. I give an example (related to this literature) here:

http://www.ncbi.nlm.nih.gov/pubmed/19476681

The point about effect size is interesting. Now that we’re in the GWAS era, we’ve learned that the effect sizes associated with common genetic variants in relation to complex phenotypes are *tiny*. This doesn’t necessarily mean that they’re not interesting or important though – the aggregate effect is large, and if that is what nature has given us we have to accept that. Obviously the *clinical* importance of these effects may be negligible, but that’s a slightly different question. What we have learned about is the genetic architecture of those traits, and that should guide future research. This may also be of interest in relation to the 5-HTTLPR in particular:

http://www.ncbi.nlm.nih.gov/pubmed/23108923

I have struggled with the interpretation of effect sizes. The discussion above reminds me of a provocative from Tukey’s (1969) essay, Analyzing Data: Sanctification or Detective Work? [To be clear - I love correlations but I do think there is a lesson about the underlying metrics that is easy to overlook.]

“Why then are correlation coefficients so attractive? Only bad reasons seem to come to mind. Worst of all, probably, is the absence of any need to think about units for either variable. Given two perfectly meaningless variables, one is reminded of their meaninglessness when a regression coefficient is given, since one wonders how to interpret its value. A correlation coefficient is less likely to bring up the unpleasant truth—we think we know what r = -.7 means. Do we? How often? Sweeping things under the rug is the enemy of good data analysis. Often, using the correlation coefficient is “sweeping under the rug” with a vengeance. Being so disinterested in our variables that we do not care about their units can hardly be desirable.”

Great passage, I love that paper!

Interesting point and quote, Brent. I definitely think all effect sizes are not created equal, and while correlations are imperfect and are not a magic “cure-all” to understanding our data, let’s at least see them, when appropriate (perhaps along with a BESD to give a rough idea of practical significance, or with a scatterplot to see what’s driving the relationship) and think a little more carefully about what effect sizes to report (e.g., not provide what our Stat software gives us or mindlessly report what others in our field used in their publications). I get more tripped up with the Eta Squares and r squares (Dan Ozer always uses the example about how we COULD discuss the miles we travel in squared space but how does this added complexity improve our understanding of the distance we traveled?

The original study showed an r² effect size of 0.75% (I computed that from the t-statistic and d.f.), which Caspi characterized as “trivial.” I guess one sometimes elevates the trivial to the crucial for self-promotional reasons. Matt Keller has done some good work on the issue of interaction pattern. Studies that find entirely different things will be claimed as successful replications if p<.05 for the interaction effect. Here's one of Matt's papers:

http://journals.psychiatryonline.org/article.aspx?articleid=178272