We recently read Karg et al (2011) for a local reading group. It is one of the many of attempts to meta-analytically examine the idea that the 5-HTTLPR serotonin transporter polymorphism moderates the effect of stress on depression.
It drove me batty. No, it drove me to apoplectia–a small country in my mind I occupy far too often.
Let’s focus on the worst part. Here’s the write up in the first paragraph of the results:
“We found strong evidence that 5-HTTLPR moderates the relationship between stress and depression, with the s allele associated with an increased risk of developing depression under stress (P = .00002). The significance of the result was robust to sensitivity analysis, with the overall P values remaining significant when each study was individually removed form the analysis (1.0×10-6<P<.00016).”
Wow. Isn’t that cool? Isn’t that impressive? Throw out all of the confused literature and meta-analyses that came before this one. They found “strong evidence” for this now infamous moderator effect. Line up the spit vials. I’m getting back in the candidate gene GxE game.
Just what did the authors mean by “strong?” Well, that’s an interesting question. There is nary an effect size in the review as the authors chose not to examine effect sizes, but focused on synthesizing p-values instead. Of course, if you have any experience with meta-analytic types, you know how they feel about meta-analyzing p-values. It’s like Nancy Reagan to drugs. Just say no. If you are interested in why, read Lipsey and Wilson or any other meta-analysis guru. They are unsympathetic, to say the least.
But, all is not lost. All you, the reader, have to do is transform the p-value into an effect size using any of the numerous on-line transformation programs that are available. It takes about 15 seconds to do it yourself. Or, if you want to be thorough, you can take the data from Table 1 in Karg et al (2011) and transform the p-values into effect sizes for your own meta-analytic pleasure. That takes about 15 minutes.
So what happens when you take their really, really significant p-value of p = .00002 and transform it in to an effect size estimate? Like good meta-analytic types, the authors provide the overall N, which is 40,749. What does that really impressive p-value translate into when you translate it into an r metric?
.0199 or .02 if you round up.
It is even smaller than Rosenthal’s famous .03 correlation between aspirin consumption and protection from heart disease. You get the same thing when you plug all of the numbers from Table 1 into Comprehensive Meta-Analysis, by-the-way.
So the average interaction between the serotonin transporter promoter and stress on depression is “strong,” “robust,” yet infinitesimal. It sounds like a Monty Python review of Australian wine (“Bold, yet naïve.” “Flaccid, yet robust”).
Back to our original question, what did the authors mean when they described their results as “strong?” One can only assume that they mean to say that their p-value of .00002 looks a lot better than our usual suspect, the p < .05. Yippee.
Why should we care? Well, this is a nice example of what you get when you ignore effect size estimates and just use p-values–misguided conclusions. The Karg et al (2011) paper has been cited 454 times so far. Here’s a quote from one of the papers that cites their work “This finding, initially refuted by smaller meta-analyses, has now been supported by a more comprehensive meta-analysis” (Palazidou, 2012). Wrong.
Mind you, there is no inconsistency across the meta-analyses. If the average effect is really equal to an r of .02, and I doubt it is this big, it is really, really unlikely to be consistently detected by any study, no less a meta-analysis. The fact that the meta-analyses appear to disagree is only because the target effect size is so small that even dozens of studies and thousands of participants might fail to detect it.
Another reason to care about misguided findings is the potential mistaken conclusion either individuals or granting agencies will make if they take these findings at face value. They might conclude that the GxE game is back on and start funding candidate gene research (doubtful, but possible). Researchers themselves might come to the mistaken conclusion that they too can investigate GxE designs. Heck, the average sample size in the meta-analysis is 755. With a little money and diligence, one could come by that kind of sample, right?
Of course, that leads to an interesting question. How many people do you need to detect a correlation of .02? Those pesky granting agencies might ask you to do a power analysis, right? Well, to achieve 80% power to detect a correlation of .02 you would need 8,699 participants. That means the average sample in the meta-analysis was woefully underpowered to detect the average effect size. For that matter, it means that none of the studies in the meta-analysis were adequately powered to detect the average effect size because the largest study, which was a null effect, had an N of 3,243.
So, this paper proves a point; that if you cumulate enough participants in your research almost anything is statistically significant. And this warrants publication in the Archives of General Psychiatry? Fascinating.
Brent W. Roberts