This is a public peer review of Scott Alexander’s essay on ivermectin, of which this is the fifteenth part. You can find an index containing all the articles in this series here.
I.
Scott Alexander’s ivermectin essay can be read as exposition of an epistemic framework, by means of its application to ivermectin. In it, he shows us how he approaches a topic as fraught and thorny as that one. For instance, when reviewing Krolwiecki et al., he explains that subgroup slicing which is not pre-registered should be discarded, even when the proposed causal pathway makes perfect sense and the strength of the observed correlation yields p-values smaller than 0.01:
Krolewiecki et al: […] The trend favored ivermectin but it was not statistically significant, although they were able to make it statistically significant if they looked at a subset of higher-IVM-plasma-concentration patients. They did not find any difference in clinical outcomes.
A pro-ivermectin person could point out that in the subgroup with the highest ivermectin concentrations, the drug seemed to work. A skeptic could point out that this is exactly the kind of subgroup slicing that you are not supposed to do without pre-registering it, which I don’t think this team did. I agree with the skeptic.
I might have my quibbles, but I can understand. Medicine is one of those fields where extreme care is warranted—what with people’s lives being at stake—so we should demand the highest degree of epistemic hygiene. This is especially true in pandemic situations where given the stakes, plus the limited bandwidth for communicating with the public at large, additional care may be warranted before giving advice that will be hard to retract later. What’s worse, public health advice perceived as erroneous can lead the population at large to lose faith in those in charge, creating further risk down the line.
With all this in mind, a reader might be shocked to see the same Scott Alexander, relying on not-pre-registered subgroup slicing of the type he explicitly warned against to pitch the Strongyloides hypothesis:
Dr. Avi Bitterman carries the hypothesis to the finish line:
The good ivermectin trials in areas with low Strongyloides prevalence, like Vallejos in Argentina, are mostly negative. The good ivermectin trials in areas with high Strongyloides prevalence, like Mahmud in Bangladesh, are mostly positive.
And while Scott refused to even consider such analyses when parsing evidence in favor of ivermectin—even when they demonstrated strong correlations—commenter Saloni Dattani noted that the last line in the diagram—the statistical test for subgroup differences—does not even come close to statistically significant territory with a p-value of 0.27. This is a bit of a problem when the argument is that the subgroups are meaningfully different:
The issue is compounded in the context of an essay where the concept of statistical significance has been utilized as the be-all and end-all of statistical analysis, dismissing any effect that did not show strong statistical significance as “no difference.”
At this point—in Scott’s own terms—Scott’s analysis has two serious problems:
Relying for conclusions on a subgroup analysis that was not pre-registered.
Relying for conclusions on a subgroup analysis that was not statistically significant.
II.
That was the state of play when Dr. Avi Bitterman, the person Scott quoted, responded by saying that Scott had actually used an old version of the analysis and should update it:
And so it happened. If you visit the article today, you’ll see that Scott has updated that section to contain the new subgroup analysis, and an additional sensitivity analysis, with no note of the update:
Dr. Avi Bitterman carries the hypothesis to the finish line:
The good ivermectin trials in areas with low Strongyloides prevalence, like Vallejos in Argentina, are mostly negative. The good ivermectin trials in areas with high Strongyloides prevalence, like Mahmud in Bangladesh, are mostly positive.
The new analysis makes important changes:
It adds two new studies (Fonseca, Okumus).
It collapses the three prevalence categories(Low/Medium/High) into two (Low/High).
It uses a different datasource for Strongyloides prevalence, but only for Brazilian studies (TOGETHER, Fonseca).
As a result of these changes, the case does appear to strengthen:
The effect in high-prevalence regions appears much stronger.
The difference between subgroups is now statistically significant at p=0.03.
This tells us a little bit about the worms hypothesis itself, but nothing conclusive. The next article in this series will focus on that hypothesis on its own terms.
This sequence of events does say something important about Scott’s approach to evidence though. After all, this is the same person, who, in the same essay when reviewing the Biber et al. study, felt at liberty to accuse the authors of engaging in deliberate statistical manipulation, writing:
So probably they did the study, found no positive results, re-ran it with various subsets of patients until they did get a positive result, and then claimed to have “excluded” patients who weren’t in the subset that worked.
I’m going to toss this one.
In that case, Scott was accusing the authors falsely, based on his own misunderstanding of the facts. Reading the published paper, we can see that the protocol alteration that got Scott exercised was submitted and approved by the relevant ethics review board long before the randomization key was unsealed, making the kind of p-hacking Scott alleges impossible.
The quote does describe the Scott’s actions much more closely, though. Even if inadvertently, the events played out in such a way as to lay the error bare.
Despite looking down upon non-pre-registered subgroup analyses when the results were positive for ivermectin, despite sneering at non-statistically-significant results, despite even going so far as to accuse other researchers with careers spanning decades of straight-up academic fraud when he thought something similar was going on with another study—when it came to his own pet hypothesis, he had no qualms doing, in public view, exactly what he insinuated others were doing in secret: making a hypothesis look stronger by iterating on a statistical analysis, to find one that passes the traditional hurdles. Except, of course, that those hurdles were not calibrated to withstand that kind of brute forcing.
And while the second version does appear a lot more compelling, Scott cannot claim to have been convinced by that version of the hypothesis, seeing as he went public without it. After all the to-ing and fro-ing about the importance of pre-set thresholds for statistical significance, he chose to ignore studies that showed strong evidence in favor of a hypothesis that, at the state he endorsed it, did not.
Scott’s article is presented as a meta-analysis, and one whose methods and criteria were not pre-registered as is commonly expected of meta-analyses. As such, we must rely on our faith in Scott’s even-handed treatment of the evidence to credit his conclusions. Sadly, the essay itself, read closely, betrays the bias we’d hoped to avoid.
After all, Scott pitched his essay as a way to understand not how Scott approaches hard problems, but as a way to see how science itself works. From the introduction:
What a great opportunity to exercise our study-analyzing muscles! To learn stuff about how science works which we can then apply to less well-traveled terrain! Sure, you read the articles saying that experts had concluded the studies were wrong. But did you really develop a gears-level understanding of what was going on? That’s what we have a chance to get here!
And while I can grant Scott the assumption that much of this was an honest mistake, his subsequent refusal to engage in dialogue about these sorts of issues in his essay makes me question his other output, where I’ve not spent the time to analyze the quality of the evidence and reasoning.
III.
I remember, many years ago, reading a young, up-and-coming philosopher of science who taught me to beware isolated demands for rigor:
Suppose there are scientists on both sides of a controversial issue – for example, economists studying the minimum wage. One team that supports a minimum wage comes up with a pretty good study showing with p < 0.05 that minimum wages help the economy in some relevant way. The Science Czar (of course we have a science czar! We're not monsters!) notes that p < 0.05 is really a shoddy criterion that can prove anything and they should come back when they have p < 0.01. I have a huge amount of sympathy with the Science Czar on this one, by the way.
Soooo the team of economists spends another five years doing another study and finds with p < 0.01 that the minimum wage helps the economy in some important way. The Science Czar notes that their study was correlational only, and that correlational studies suck. We really can't show that minimum wages are any good without a randomized controlled trial. Luckily, the governments of every country in the world are totally game for splitting their countries in half and instituting different economic regimes in each part for ten years, so after a decade it comes out that in the randomized controlled trial the minimum wage helped the economy with p < 0.01. The Science Czar worries about publication bias. What if there were a lot of other teams who got all the countries in the world to split in half and institute different wage policies in each of the two territories for one decade, but they weren't published because their results weren't interesting enough? Everything the Science Czar has said so far makes perfect sense and he is to be commended for his rigor and commitment to the job. Science is really hard and even tiny methodological mistakes can in principle invalidate an entire field.
But now suppose that a team shows that, in a sample of six restaurants in Podunk Ohio, there was a nonsignificant trend towards the minimum wage making things a little worse.
And the Science Czar says: awesome! That solves that debate, minimum wage is bad, let’s move on to investigating nominal GDP targeting.
Now it looks like the Science Czar is just a jerk who’s really against minimum wage. All his knowledge of the standards of scientific rigor are going not towards bettering science, but toward worsering science. He’s not trying to create a revolutionary new scientific regime, he’s taking pot shots.
That philosopher railing against correlational studies and corrupt Science Czars was none other than Scott Alexander, as some of you already guessed.
IV.
So what is left to do at this juncture? Had his analysis been published in a journal, I could try to get an expression of concern posted. Had it been published in a journalistic outlet with standards, I could perhaps try to speak to an editor. Given that he has posted this on his own Substack, the means of correcting the record are limited. Unfortunately, my attempts at starting a dialogue—in private or in public—have gotten us precisely nowhere. And while I’m sure my approach could be improved in numerous ways, it should not require the world’s smoothest diplomat to carry a message this straightforward. Or if it did, such capable envoys appear to have been busy with other projects.
At this point, I’m starting to accept that the article will not be corrected in the near future. The only way to mitigate the damage is to inform. If you want to help, and you believe the case made here and in this series is compelling, consider sharing with communities that may be interested in this information. Perhaps with the kind of communities that Scott’s audience frequents. Not to harass, and definitely not with any hint of aggression. If you share, do it to inform, as you would want others to inform you if a trusted sensemaker ignored glaring internal contradictions of this magnitude.
Perhaps irrationally, I still retain some hope that Scott has the kind of epistemic honor that will convince him to retract his article—if he can be made aware of the depth and breadth of the flaws in his essay—that continues to be shared on social media every day. If not for the dozens of indisputable errors, if not for committing libel against many honest scientists, if not for the bizarre execution of a meta-analysis with a t-test which produced the opposite result than the one he should have gotten, then for this:
He failed to live up to the standards he so stridently demands others should; especially after teaching so many of us how bad an idea this sort of creative accounting is when it comes to scientific evidence.
This is a public peer review of Scott Alexander’s essay on ivermectin, of which this is the fifteenth part. You can find an index containing all the articles in this series here.
I was the (potential) whistleblower once and I was defeated, of sorts.
I walked away knowing I told the truth and didn’t mislead anyone in anyway.
The other parties however went from strength to strength and maybe bad mouthed me in the process.
At least you can look your family and loved ones in the eye and know you’re honest, something perhaps Scott Alexander, G M-K and even maybe Nick Mark can’t.
Doesn’t pay well but it’s a nice feeling
I would've voted for Hillary in 2016 if I'd been American but then when James Damore happened I noticed how fundamentally the 4th estate had been hollowed out and I felt like Truman reaching the horizon. I just do not have the expertise to judge if Ivermectin works medically but it is clear that that the 4th estate is not doing its job anymore. There are too many red flags. And yes, it is incredibly disappointing that my previous hero of rationality, Scott Alexander, does not have the integrity to engage with your hard work.
Still, I think Ivermectin is the strongest story out there to show how corrupt and air tight the bubble has become. It is scary to see how effectively it can be kept from the main public. Although I am convinced that Ivermectin is perfectly safe and an almost perfect Pascal wager notice that I hesitate to order it for when I get covid again. The social pressure is enormous.