I know not everyone wants to read a whole series on Scott Alexander’s article on ivermectin, so this is my distilled overview, written for someone with little or no prior context of my critique so far.
It’s hard to deny that the Astral Codex Ten article titled “Ivermectin: Much More Than You Wanted To Know,” written by Scott Alexander, shifted the conversation on ivermectin in the public sphere. While the mainstream audience had made up their minds one way or another by the time he wrote it, there was a large contingent of heterodox public intellectuals who still held a somewhat-neutral stance. It is exactly those opinion leaders that were swayed by Scott’s essay:
In fact, to this very day, whenever an ivermectin study is posted on Twitter, you can be sure someone will post a link to Scott’s article in the replies—as the definitive resolution to the question—especially if the posted study is positive.
Being someone who has been active in the rationalist community since the early days of lesswrong.com—and even a little earlier than that—and also someone who has done the deepest of deep dives on the topic of ivermectin, the essay came as something of a shock. I immediately spotted a number of problems and tried to start a dialogue with Scott, hoping to get to the bottom of the issues. Not getting very far, I ended up writing an extremely thorough review of the essay, as a many-many-part series. Obviously, I don’t expect most people to read that much material—but it was essential for me to actually understand—in depth—what the core issues were with Scott’s essay. This included reading up on many of the relevant fields of statistics and medicine.
To any members of the extended rationalist community who might be reading this, I want to address a few additional thoughts: the community started in an effort to “raise the sanity waterline.” While none of us is without fault, a community that aspires to collectively produce high-quality thought and analysis that will make the world better, must be responsive to reasonable claims of serious errors in its most high-profile outputs. This is not an easy article to read, both technically and emotionally, but if the “rationality” part of “rationality community” stands for anything, it must stand for updating our views as new evidence comes in—which is pointless unless we allow ourselves to be exposed to new and contradictory evidence.
On that note, I am aware that my style of writing may not be for everyone. I have done my best to keep this essay as dispassionate as possible. At this point, it’s about as good as it’s going to get. I sure wish someone more diplomatic than me could take up the mantle and articulate all this much better, but everyone seems to be busy with other projects at the moment.
1. Steelmanning Scott
As is tradition among rationalists everywhere, the best way to start a productive counter-argument is by giving the most neutral reconstruction of the position you’re arguing against as you can muster. What follows is my attempt at doing exactly this.
Scott Alexander’s argument on ivermectin, in terms of logical structure, went something like this:
Of the ~80 studies presented on ivmmeta.com in Nov ‘21, zoom in on the 29 “early treatment” studies.
After reviewing them, he rejected 13 for various reasons, leaving him with 16.
He also rejected an additional five studies based on the opinion of epidemiologist Gideon Meyerowitz-Katz, leaving him with 11 studies.
He ran a t-test on the event counts of the remaining studies, finding a “debatable” benefit, though he did later admit results were stronger than that based on a correction I provided.
He then explained that the prevalence of Strongyloides worms is correlated with how well ivermectin did in those studies, based on Dr. Avi Bitterman’s analysis.
This doesn’t explain non-mortality associated results, especially viral load reductions and PCR negativity results, but a funnel plot analysis—also by Dr. Avi Bitterman—indicated there was substantial indication of asymmetry in the funnel.
Scott interpreted that asymmetry as publication bias, and in effect attributes any improvement seen to an artifact of the file-drawer effect.
Scott’s conclusion was that there is little if any signal left to explain once we take this principled approach through the evidence—considering all the reasons for concern—and as a result he considers it most likely that if ivermectin works at all, it would be only weakly so, and only in countries with high Strongyloides prevalence.
Here is my incredibly implausible thesis, that I never would have believed in a million years, had I not done the work to verify it myself:
Not just one, but each of the steps from 2 to 7 were made in error.
What’s more, when we correct the steps, the picture of ivermectin’s efficacy emerges much stronger than Scott represented.
Once again, if this sounds like the least plausible thing you’ve heard this month, I completely understand where you’re coming from.
To explain why I think what I think, we have to dive into the depth and breadth of what constitutes evidence-based medicine. In Sections 2-7, I will explain the issue with each of the corresponding links in this chain of reasoning, and link to longer explanations you can read until you’re fully satisfied. If anything at all is unclear or seems wrong, I am actively asking—nay, begging—you to leave a comment, explaining the issue as clearly as possible. Getting this right matters greatly to me.
2. Scott’s Study Selection
The first thing you might expect me to quibble with is the selection of studies. While there is a lot to say about that portion of the article, I do realize that much about that section is subjective; no meta-analysis matches my preferences precisely. Thus, I will keep my comments here to the methodological level.
Continuous Baseline Variable Checks
Scott’s reliance on the team of “fraud hunters” carries with it a trust in their methods—in particular John Carlisle’s method of detecting irregularities in manuscripts of clinical trials. Dr. John Carlisle has single-handedly applied his method to thousands of papers, finding very suspicious patterns in more than 100 publications. His work has led to several retractions of published randomized trials. His method essentially consists of finding differences in the baseline continuous variables of randomized trials that are so large, they would be incredibly unlikely to occur in a properly randomized trial.
Carlisle himself, however, is extremely careful to note that his methods catch all sorts of errors, and should not be immediately considered evidence of academic fraud or fabrication:
Some trials with extreme p values probably contained unintentional typographical errors, such as the description of standard error as standard deviation and vice versa. The more extreme the p value the more likely there is to be an error (mine or the authors’), either unintentional or fabrication
Several other statisticians have chimed in, noting that Carlisle makes some assumptions that are not actually true. To summarize Stephen Senn:
baseline variables should not assumed to be independent
distribution of p-values in published randomized trials is not uniform
randomization is not always “pure” (e.g. many trials use block randomization)
These concerns don’t invalidate Carlisle’s work, but they do create a context within which its findings must be carefully considered on a case-by-case basis. F. Perry Wilson, in his excellent analysis of the Carlisle method summarizes it thusly:
With that in mind, what Carlisle has here is a screening test, not a diagnostic test. And it’s not a great screening test at that. His criteria would catch 15% of papers that were retracted, but that means that 85% slipped through the cracks. Meanwhile, this dragnet is sweeping up a bunch of papers where sleep-deprived medical residents made a transcription error when copying values into excel.
I want to stress that I’m not saying the method Carlisle pioneered is useless. It has led to genuine fraud being discovered—for example in the case of Italian surgeon Mario Schietroma—whose trials consistently threw up red flags. It is, however, extremely important to understand what claims it enables one to make, and what shaky assumptions it requires one to accept. Carlisle’s tests are a starting point for further investigation, not a one-step conviction for a bad study, never mind academic fraud, as Scott seems to be interpreting it.
Derivatives of Carlisle’s Method
While the above is true of the application of Carlisle’s method when followed strictly, people Scott cites favorably have extended those methods and unwittingly fallen into statistical traps. In particular, both Kyle Sheldrick and Nick Brown have improperly extended Carlisle’s method to dichotomous variables, making errors that can technically be described as “catastrophic.”
The error is explained formally by Dr. Daniel Victor Tausk, in this paper, with the cases of Sheldrick and Brown used as examples to show if the method had been correctly extended, their investigation’s results would be unremarkable. Unfortunately, their mathematical error led them to make unfounded accusations instead, and it appears that Kyle Sheldrick is currently being sued in relation to those accusations. He has deleted the relevant pages from his blog.
In some cases (Elalfy et al., Ghauri et al.), Scott makes intuition-based observations that invoke Carlisle-style heuristics or their derivatives, but with no actual application of the method that can be checked.
3. Erring Towards Severity
Next, Scott excluded five studies on the say-so of Gideon Meyerowitz-Katz:
I asked him about his decision-making, and he listed a combination of serious statistical errors and small red flags adding up. I was pretty uncomfortable with most of these studies myself, so I will err on the side of severity, and remove all studies that either I or Meyerowitz-Katz disliked.
The fact that Scott accepted five additional exclusions by Gideon Meyerowitz-Katz eats into the credibility of the analysis, given that Meyerowitz-Katz has been very forthright about his point of view—when participating in this debate—since 2020, before most of these studies were published.
I’m not saying that someone’s biases being articulated in public discredits everything they’ve brought to the debate, and I’ll be first to say that Meyerowitz-Katz has made genuine individual contributions. However, someone with known and declared biases should have been counterbalanced with someone who would have been able to push back on potential bad arguments, if Scott wanted to get to the truth of the matter. Instead, it is clear that besides the five studies that were excluded, in the cases of Biber et al. and Babalola et al., Scott’s trust in Meyerowitz-Katz led him to endorse critiques that ended up being false.
4. Meta-analysis Paralysis
Even if we accept Scott’s study selections as given—including the additional exclusions by Gideon Meyerowitz-Katz—the argument still doesn’t hold.
Scott chose to run a t-test over the event counts of the 11 studies he ended up keeping. On that basis, he characterized the result this way:
[..] the trends really are in ivermectin’s favor, but once you eliminate all the questionable studies there are too few studies left to have enough statistical power to reach significance.
When I let him know that nobody would ever consider a t-test of unadjusted event counts to be a meta-analysis, he amended the article to write that the t-test is “overly simplistic” and that using the standard tools researchers tend to use for a meta-analysis the benefit he found goes from “debatable” to “clear,” but that his conclusion didn’t change.
However, the correction fails to recognize the true extent of the issue. The very formula Scott uses—the paired sample t-test, especially the way Scott applies it—is not just “overly simplistic.” It does not even reliably aggregate evidence in the right direction.
Here’s my reconstruction of Scott’s original analyses showing the exact same results:
And here’s what happens if we add an extra study that is large, but with a similar event ratio to the other studies:
Notice that the effect now looks less significant (i.e. the p-values go up) while a meta-analysis with a seemingly large, positive, non-outlier study added should end up with the effect looking more significant (i.e. smaller p-value). In fact, the second analysis (on the right) goes from being statistically significant to being statistically not significant. There is much debate in the meta-analysis world about what approach is best to use, but I think it’s a fairly unanimously held view that when strong, non-outlier evidence in favor of rejecting the null hypothesis is added, the p-value of the result should shift in the direction of rejecting the null hypothesis.
A t-test on event counts is “overly simplistic” in the same way as tossing a coin to see if you have lung cancer is. Scott’s improvised method fails at the basic task of reliably accumulating evidence in the expected direction.
What should we use instead? The standard approach most people use is the DerSimonian and Laird method: from Dr. Tess Lawrie’s team, to Gideon Meyerowitz-Katz, (and again) to Cochrane, to Andrew Hill, (and of course to Ivmmeta, if that matters). As the Wikipedia article on meta-analysis puts it:
The most widely used method to estimate between studies variance (REVC) is the DerSimonian-Laird (DL) approach.[43] Several advanced iterative (and computationally expensive) techniques for computing the between studies variance exist (such as maximum likelihood, profile likelihood and restricted maximum likelihood methods)
However, a comparison between these advanced methods and the DL method of computing the between studies variance demonstrated that there is little to gain and DL is quite adequate in most scenarios.[47][48]
There are other approaches, and there are also questions about what the limits of applicability of DerSimonian-Laird are. Obviously, if I had my choice of methods, I’d go with a Bayesian approach. But instead of having to defend a fairly niche meta-analysis method, let’s see what the standard approach shows us:
As you can see, the straightforward combination of studies—without needing to modify the selected endpoints—gives us a statistically significant result at p=0.03, with a 55% improvement estimate. The value Scott originally got from the same endpoints via the t-test, was p=0.15, which was not statistically significant. When we add back the studies that were excluded solely on the recommendation of Gideon Meyerowitz-Katz, the result strengthens considerably:
We now see a 64% improvement with a p-value that is simply represented as p<0.0001. This is after we exclude all the studies that Scott intended to exclude, almost half the original set.
To be clear, I’m not saying I endorse this meta-analysis as a valid result, though I’d be lying if I said it means nothing. What I’m saying is that this is the expression of Scott’s logic when implemented in a principled, or at least “best practices” way.
5. Enter Mind Worms
Even if we accept Scott’s original meta-analysis conclusion of a borderline result, the prevalence of the parasite Strongyloides stercoralis does not appear to explain it.
For one, the version of the hypothesis that was initially published with Scott’s article fails to clear the hurdle of statistical significance by a substantial amount (p=0.27):
Scott updated that analysis with a later iteration—that indeed cleared that hurdle—though the update is not noted in the article itself.
The version in the article today is substantially the same as the one that was eventually published in JAMA. Quoting from Scott’s article:
Dr. Avi Bitterman carries the hypothesis to the finish line:
The good ivermectin trials in areas with low Strongyloides prevalence, like Vallejos in Argentina, are mostly negative. The good ivermectin trials in areas with high Strongyloides prevalence, like Mahmud in Bangladesh, are mostly positive.
The new analysis makes important changes:
It adds two new studies (Fonseca, Okumus).
It collapses the three prevalence categories(Low/Medium/High) into two (Low/High).
It uses a different datasource for Strongyloides prevalence, but only for Brazilian studies (TOGETHER, Fonseca).
As a result of these changes, the case does appear to strengthen:
The effect in high-prevalence regions appears much stronger.
The difference between subgroups is now statistically significant at p=0.03.
As discussed in depth in my long-form article, the second version contains a serious flaw which negates its statistical significance. While the paper says it is using only prevalence estimates from sources using parasitological methods (basically, stool sample examination), one of its two sources—used to obtain prevalence estimates for 10 of the 12 studies—in fact uses an adjusted blend of parasitological and serological methods (blood sample analysis). To demonstrate fairly intuitively, the estimates for the country of Brazil by the one source (Paula et al.) is outside the 95% confidence interval of the other source’s (Buonfrate et al.) estimate for the same country. Therefore, this data cannot properly be used together without some adjustment.
When I did a fairly simple adjustment to line up the two data sources, the correlation in the dichotomous analysis weakened to the point of being even weaker than the original version (p=0.35), or as a frequentist might say, “there’s no difference:”
In the meta-regression, similarly—while visually there seems to be some hint of a correlation—by any formal standard, it’s not something that would merit additional attention:
As Dr. Daniel Victor Tausk, who ran the meta-regression wrote (emphasis mine):
In the JAMA paper they estimated a = −0.0983 with p-value = 0.046 for the test of a=0. I was able to reproduce these numbers exactly (as well as the entire forest plot in the JAMA paper and the recalculated forest plot from your article).
If I rerun the metaregression replacing the JAMA prevalence with your adjusted numbers I get:
a = -0.0517, with p-value = 0.3413 for the test of a=0.
So, the original effect size was reduced and it went from barely significant to not significant at all.
Let’s Play Hypothesis Roulette!
Since I had the setup of the meta-analysis done already, I decided to try some alternative hypotheses. The first one I tried is whether the presence of doxycycline— which is known to be synergistic with ivermectin in vitro and in vivo—might explain some of the results.
I reclassified the same studies as the JAMA Strongyloides paper by whether the study administered doxycycline in addition to ivermectin or not:
Unfortunately, with only two studies administering doxycycline in the set, it’s hard to draw firm conclusions. The p-value for subgroup differences is 0.14 though—still smaller than the corrected value for the Strongyloides hypothesis at 0.35.
In other words, by frequentist standards, the “doxycycline synergy” hypothesis must take precedence over the “Strongyloides co-infection” hypothesis. Besides, we are a lot more certain that the patients took doxycycline than we are that they suffered from hyperinfection that their doctors, despite being in endemic areas, didn’t recognize and treat on time. That additional layer of uncertainty must count for something also.
RCTs Are Hard, Especially in Latin America
I did try one more hypothesis: ivermectin use in the community was highlighted by Nature magazine in late 2020 as an invalidating factor for any RCT that might take place in Latin America:
The core point is that the widespread use of ivermectin made local studies very hard to execute:
Some early studies in cells and humans hinted that the drug has antiviral properties, but since then, clinical trials in Latin America have struggled to recruit participants because so many are already taking it.
“Of about 10 people who come, I’d say 8 have taken ivermectin and cannot participate in the study,” says Patricia García, a global-health researcher at Cayetano Heredia University in Lima and a former health minister for Peru who is running one of the 40 clinical trials worldwide that are currently testing the drug. “This has been an odyssey.”
Still, researchers might never have sufficient data to justify ivermectin’s use if its widespread administration continues in Latin America. The drug’s popularity “practically cancels” the possibility of carrying out phase III clinical trials, which require thousands of participants — some of whom would be part of a control group and therefore couldn’t receive the drug — to firmly establish safety and efficacy, says Krolewiecki.
I wondered if perhaps the Latin American use in the community would explain the difference seen between various studies of ivermectin across the world. So I repeated the analysis I did for doxycycline—using the exact studies and results that the Strongyloides hypothesis was based on—but classifying for whether the study was in Latin America or not:
We have a big difference between the two groups: in Latin American studies, we see a very small effect estimate (5% mortality reduction) with no indication of becoming statistically significant. In the trials from every other part of the world, we see a strong reduction in mortality (55% mortality reduction), with statistical significance. The difference between the two groups is also statistically significant, at p=0.03.
In my detailed essay on this hypothesis, I also reviewed Google Trends data, investigator statements, exclusion criteria employed by the trials, adverse events profiles, potential lack of statistical power, as well as other evidence such as sales data, which seems to confirm the hypothesis.
While all signs point to Latin American community use being a significant factor, this is still an observational comparison between studies in different locations, so I don’t think I’m comfortable enough to declare this “the answer.” What must be noted, however, is that privileging the “Strongyloides co-infection” hypothesis over the “doxycycline synergy” and the “Latin American community use” hypotheses is not a defensible position, given the state of the evidence.
6. Viral Funnel Blues
Even if Strongyloides co-infection explained the mortality results, as Scott notes, it doesn’t explain results that are earlier in the causal chain, especially ones around PCR negativity and viral load reduction. To this end, he quotes a different analysis by Dr. Avi Bitterman that he interprets to mean that those results are actually the result of publication bias:
Worms can’t explain the viral positivity outcomes (ie PCR), but Dr. Bitterman suggests that once you remove low quality trials and worm-related results, the rest looks like simple publication bias:
There’s a lot to say here, but let’s start with Scott’s description of the analysis:
“…once you remove low quality trials and worm-related results…”
Long story short (and the long part is available here), of the 11 studies in this analysis, two are “worm-related results,” three are studies Scott rejected for being low quality, and two had not been examined at at all (one looks terribly low quality). So while Scott says this analysis includes no “low quality trials and worm-related results,” it actually includes mostly (6 of 11) “low quality trials and worm-related results,” according to his own standards.
Fixed vs. Random
Even if we accept the study selection here as valid, there is one more oversight in Dr. Bitterman’s analysis that results in downplaying the effect. As Dr. Daniel Victor Tausk observed, even though the heterogeneity index I² is 80%, indicating a high degree of variation between studies, Dr. Bitterman runs the analysis using a “fixed-effect model” (aka “common-effect model”), which is inappropriate. A fixed-effect model assumes that all studies are running the exact same experiment—for instance, this would be appropriate if combining different study centers in the same clinical trial who ran the same protocol over the same period of time.
Given the extreme diversity of times, locations, and protocols being used in the studies, the random-effect model is the obviously correct choice in this case.
The two models result in very different effect estimates:
In particular, the fixed (or common) effect model estimates a 22% reduction in PCR positivity, while the random effects model estimates a 40% reduction in PCR positivity. This matters, because in the presence of publication bias, any adjustments will take away from this estimate, which makes it a type of safety margin. As you can see, the different models also shift the results of the funnel plot significantly:
Asymmetric Warfare
Even if we accept the funnel as given, the asymmetry noted is not actually valid. For one, there are many different ways to generate and evaluate a funnel plot, so people tend to pick-and-choose:
Using 198 published meta-analyses, we demonstrate that the shape of a funnel plot is largely determined by the arbitrary choice of the method to construct the plot. When a different definition of precision and/or effect measure were used, the conclusion about the shape of the plot was altered in 37 (86%) of the 43 meta-analyses with an asymmetrical plot suggesting selection bias. In the absence of a consensus on how the plot should be constructed, asymmetrical funnel plots should be interpreted cautiously.
The meta-analysis linked to the funnel plot shows a heterogeneity metric I^2 of 80% with p < 0.00001. What does this mean? According to Cochrane, that means there’s somewhere between “considerable” and “substantial” heterogeneity. In other words, lots and lots.
John Ioannidis and Thomas Trikalinos did a study of meta-analyses to see where funnel plots were used inappropriately. They set a “not very demanding” threshold of 50% heterogeneity, above which they did not consider the use of the various statistical tests to be meaningful. They also tried an alternative analysis, with “even more lenient” criteria, one of which was heterogeneity no greater than 79%. The asymmetry result in the funnel plot Scott references does not qualify as “meaningful or appropriate” under either set of criteria. As the authors note:
Some debate about the extent to which criteria need be fulfilled for asymmetry tests to be meaningful or appropriate is unavoidable. The thresholds listed above are not very demanding, based on the properties of the tests.
Now, a few years later, Ioannidis was part of a 19-author consensus paper subtly titled “Recommendations for examining and interpreting funnel plot asymmetry in meta-analyses of randomised controlled trials.“ That paper does give us a way forward if we indeed insist on going down the route of evaluating asymmetry:
If there is substantial between study heterogeneity (the estimated heterogeneity variance of log odds ratios, τ2, is >0.1) only the arcsine test including random effects, proposed by Rücker et al, has been shown to work reasonably well.
I asked Dr. Daniel Victor Tausk for help running that analysis, and he sent me this back:
The p-value over this analysis—if we use Egger’s test—is 0.09, but this is exactly what the consensus paper warns against, since Egger’s test does not take heterogeneity into account. If we use AS-Thompson-REML as implemented in the relevant R package, we get p=0.18. If we use a different variant of the same test called AS-Thompson-moments, which better conforms to what was recommended by Thompson’s paper, we get p=0.17. I cite all these tests because their results might differ, but the one thing they don’t seem in conflict about is whether there is “statistically significant” asymmetry. You can see here what a funnel that takes between-study heterogeneity into account would look like:
As you can see, there’s more width at the top, to account for the fact that there is very large heterogeneity between the studies.But even visually, it’s hard for me to see any compelling asymmetry, and the tests—when run appropriately—seem to agree.
7. Reading the Tea Leaves
Even if we accept that there is valid asymmetry in the results of the funnel plot, the conclusion of publication bias still doesn’t hold.
The heterogeneity of the studies being plotted can make the results appear as if a bias exists, when in fact, what is visualized is between-study heterogeneity:
As Choi and Lam explain in “Funnels for publication bias – have we lost the plot?”:
Despite Egger's careful explanation of the possible causes of funnel plot asymmetry, asymmetry is more often than not solely attributed to publication bias. But in addition to publication bias and pure chance, what other factors contribute to funnel plot asymmetry?
True heterogeneity in intervention effects is one such factor. For example, a significant effect may only be seen in high-risk patients, and these patients were more likely to be recruited into the smaller, early trials. Larger, multi-centre interventions involving hundreds of patients may be more difficult to implement when compared with smaller trials, and, in addition, there may be methodological differences between centres. In such a case, the data from the smaller, better-controlled study may be more precise than the larger, and perhaps less vigorously implemented study involving a more heterogeneous group of participants.
As I’ve tried to do throughout this essay—instead of simply giving a statistical argument and moving on—I will also demonstrate my thoughts in practical terms.
I looked into the specifics of the PCR positivity endpoint used. I also added the recruitment mid-point for each study. Here’s the amended table:
The endpoints on this meta-analysis are measuring the PCR positivity of patients. So what happens if we separate them by when that PCR test was taken?
Strangely enough, the strongest (leftmost) results tend to be those which waited the most days to take t for the treatment to work (blue, three days). Given what we’ve already discussed about PCR tests and dead nucleotides, these are the least surprising results the world has ever seen, and point to very different explanations for the asymmetry observed.
What if we—instead—color the results by the midpoint of the recruitment period, as a way to distinguish earlier from later studies?
It seems the May 2020 studies fall to the left, the July and August 2020 studies cluster towards the middle of the group, and the Fall 2020 studies are to the right. As time passes, we have a move towards the right (smaller effect), except the one study with a midpoint in March 2021 (Aref), seems to be an outlier, falling closer to the May 2020 studies. You heard it here first folks:
FOR IMMEDIATE PRESS RELEASE
NEW JOURNAL OF THE AMERICAN MEDICAL JOURNAL OF MEDICINE (NJAMJM):
IVERMECTIN ONLY WORKS IN THE SPRING AND A LITTLE IN THE SUMMER
Or… with so few data points and so much heterogeneity, you can prove almost anything. You don’t have to believe me that these are the first two classifications I tried, but it’s true.
8. Reining It In
As we saw by exploring the logical backbone of Scott’s argument, it’s not that the chain of inference has a weak link, but more that each link in the chain is weak. Even if you’re not convinced by each of the arguments I make (and I do think they’re all valid), being convinced by one or two of these arguments makes the whole structure collapse. In brief:
Scott’s starting point of the early treatment studies from ivmmeta is somewhat constrained, but given the number of studies, it should be sufficient to make the case.
If we accept the starting point, we must note that Scott’s filtering of the studies is over-eagerly using methods such as the one by John Carlisle that are simply not able to support his definitive conclusions. Worse, some of his sources modify Carlisle’s methods in ways that compromise any usefulness they might have originally had.
Even if we accept Scott’s filtering of studies though, throwing out even more studies based on trust in Gideon Meyerowitz-Katz without any opposing argument, is all but certain to shift the results in the direction of finding no effect.
Even if we accept the final study selection, the analysis methodology is invalid.
Even if it wasn’t, the Strongyloides co-infection correlation is not the best explanation for the effect we see.
Even if it was, it can’t explain the viral load and PCR positivity results, but Scott offers us a funnel plot that he claims demonstrates publication bias. However, it should have been computed as random, not fixed effect. Also, if we use the only available test that is appropriate for such high heterogeneity studies, there is no asymmetry to speak of.
Even if there was, funnel plot asymmetry doesn’t necessarily imply publication bias, especially in the presence of heterogeneity, so Scott’s interpretation is unjustified.
When we look at the evidence, sliced and diced in different ways, as Scott did, we consistently see a signal of substantial efficacy. And even though the original meta-analysis from ivmmeta.com Scott started from can be criticized for pooling different endpoints, the viral load results do not suffer from such concerns, and still show a similar degree of efficacy.
Each of bullet points 2-7 is detailed in the section with the same number.
As I mentioned in my original response, Scott’s argument requires each of these logical steps to be correct. All of them have to work to explain away the signal. It’s not enough for a couple of them to be right, because there’s just too much signal to explain away.
In short, I like support for my positions to be linked mostly with OR operators, and am suspicious of arguments held together by multiple AND operators.
The Moral Argument
While the above cover the logical flaws in Scott’s argument, before closing, I need to highlight what I see as moral flaws. In particular, I found his flippant dismissal of various researchers to be in contradiction to the values he claims to hold. I will only highlight some of the most egregious examples, because it is deeply meaningful to set the record straight on these. (Click on the study name for more in-depth analysis about what my exact issue with each accusation is):
Biber et al.—Accused of performing statistical manipulations of their results when, in fact, the change in question was extremely reasonable, had been done with the blessing of their IRB, and the randomization key of the trial had not been unsealed until after recruitment had ended.
Cadegiani et al.—The most substantial accusation Scott has on Cadegiani is that in a different paper than the one examined, there are signs of randomization failure. For this, he is dragged throughout the essay as a fraudster. While Scott has partially corrected the essay, it still tars him as a fraudster accused of “crimes against humanity.” If some terms should not be thrown around for effect, I put to you that this is one of them.
Babalola et al.—Lambasted for presenting impossible numbers despite the fact that Kyle Sheldrick had already reviewed and cleared the raw data of the study. A commenter on Scott’s post demonstrated very clearly how the numbers were not impossible at all, but instead a result of practices that pretty much all clinical trials follow.
Carvallo et al.—Accused of running a trial that the hosting hospital didn’t know anything about. As it turns out, Carvallo presented paperwork demonstrating the approval by the ethics board of that hospital, which Buzzfeed confirmed. The accusation is that a different hospital—from which healthcare professionals had been recruited—did not have any record of ethics approval for that trial, though the person who spoke to Buzzfeed admitted that it may not have been needed. After all, the exact same pattern is visible in the Okumus trial where four hospitals participated, but the IRB/ethics approval is from the main hospital only. The issue with Carvallo—that most recognize—is that he didn’t record full patient-level data but summaries. That could have been OK if he was upfront about it, but he instead made a number of questionable statements, that he was called out on. Given this history, it is sensible to disregard the trial. But this is very different from the accusations of fraud that Scott makes.
Elafy et al.—Accused of incompetence for failing to randomize their groups multiple times in Scott’s piece. The paper writes in six separate places that it is not reporting on a randomized trial, amongst them on a diagram that Scott included in his own essay. Hard to imagine how else they could have made it clear.
Ghauri et al.—Scott accuses the authors of having suspicious baseline differences but without actually running the Carlisle tests to substantiate his claims. Reviewing the same data, I am entirely unconvinced.
Borody et al.—“this is not how you control group, %$#^ you.” This is what Scott has to say to the man who invented the standard therapy for h. pylori, saving a million lives—by conservative estimates. To top things off, what was done in that paper was an entirely valid way to control group. Maybe they should have recruited more patients to make their data even more compelling—and they would have, had the Australian regulator not banned the use of ivermectin for COVID-19, even in the context of a trial.
To be extremely clear, I’m not saying that Scott should have necessarily kept one or more of these trial in his analysis, only that he failed to treat others as he would like them to treat him.
Let’s Pull It All TOGETHER
With all this behind us, how can we make some forward progress on answering the question at hand?
One of Scott’s main recommendations is that a social norm be created that promotes the sharing of raw data. On this we agree. In particular, I’m very concerned that the three big-name studies that were heavily publicized (TOGETHER, ACTIV-6, COVID-OUT) have not—to my knowledge—provided any raw data to anyone outside the study.
Perhaps there’s something positive that can come of all this, instead of a zero-sum stare-down. Scott has a loud enough voice to put real pressure on the TOGETHER trial to release its raw data—as it has promised to do on multiple occasions. It’s been almost 15 months since the initial slide-set was released setting—the world’s headlines on fire—and seven months since the ivermectin paper was released, driving yet another news cycle.
This is what the master protocol released at the same time as those original results said about data sharing:
Unfortunately, multiple requests to the authors have been redirected to ICODA. However, when the organization responded, they also seemed to have nothing and want nothing to do with this:
The authors are now redirecting requests to a different platform called Vivli, that may or may not eventually release the data. What I’ve seen of their terms of use is so restrictive as to make any effort pointless: any proposal for using the data must be approved by the study authors, and must follow the pre-submitted protocol. So, what must happen is for a proposal that clearly articulates that someone intends to check the integrity of the trial for data manipulation to be approved by the TOGETHER authors.
To this day, nobody outside the trial has seen raw data, including the organizations that funded the trial (I say this having direct knowledge). This, despite repeated promises to do so, and in the face of serious concerns. Try for just one minute to think what would have happened if a large-scale pro-ivermectin study had appeared to systematically be avoiding sharing its data with auditors—or with anyone at all—including the people who funded it.
According to Scott’s own stated values, he should want this data to be shared. To put some of my own skin in the game, if Scott helps, by public advocacy or otherwise, to get the raw data for the ivermectin, metformin, and fluvoxamine studies available in a way that I—or someone I trust—can access it without undue limitation on sharing any findings, I commit to making a $25,000 donation to Scott’s ACX grants. Happy to discuss reasonable alterations of this offer.
I will stick my neck out further and register a prediction that should that happen, I will be able to demonstrate fairly quickly and beyond reasonable doubt, that the ivermectin results—as published—were substantially and improperly distorted.
And if I’m wrong, I wish to know it.
I'm glad you realized you needed to do something at a more summary level like this. Amazing work. I'd nominate you for a Pulitzer but I'm not sure where you go to vote. Thanks for doing this. I think generic early treatment options never got a a fair trial and that there was clearly a push to discredit it from some intensely monied interests. Glad someone like you answered the call and did the work.
Side note (in reference to the comment above)...I would be shocked to learn that you do not hold an advanced degree. Your aptitude is glaringly evident. And greatly appreciated.