The Potemkin Argument, Part 19: My Scientific Takeaway—Or—Why You Still Can't Box Intelligence
This is a public peer review of Scott Alexander’s essay on ivermectin. You can find an index containing all the articles in this series here.
If I’ve learned one thing in this whole ordeal, it’s that we’re looking at an arms race.
Peer review was instituted as a way to guarantee quality, but peer reviewers don’t have time to actually reconstruct papers, so their comments are mostly superficial. A well-written paper (perhaps by a dedicated medical writer) will pass those hurdles with flying colors.
Scott cites a laundry list of countermeasures:
Check for selection bias. Distrust “adjusting for confounders”. Check for p-hacking and forking paths. Make teams preregister their analyses. Do forest plots to find publication bias. Stop accepting p-values of 0.049. Wait for replications. Trust reviews and meta-analyses, instead of individual small studies.
Here’s the problem: I’ve seen these followed or not-followed in Scott’s analysis—depending on the subject or the outcome of the study. And whatever it is you’re supposed to do, this is what you’re not supposed to do. So these weapons, too, have crumbled in our hands; if they ever were weapons, that is.
He continues with a list of other measures, including Carlisle’s method and the GRIM test. He concludes by saying that researchers really must start giving up raw data as a norm. Let’s sidestep for a minute that without Scott being on the forefront of the fight for the TOGETHER trial to release raw data, all of this rings hollow.
If you were a researcher determined to defraud the medical establishment for fun or profit, what would you do in such circumstances? Would you, perhaps, use a clinical trial simulator such as this one by Cytel to generate data that looks just imperfect enough to be perfect? (Yes, that’s the same Cytel that “designed and led the TOGETHER Trial”). Now, I’m no Cytel, but I’ll bet you I can write a generator that makes data that pass all the relevant checks. Not because I’m particularly clever, but because any half-decent coder with a bit of stats knowledge can write one. If the checks can be operationalized in a test suite, you can literally keep trying until you make it. Hell, you can probably write a test suite and connect it to a fuzzer, and you’ll be most of the way there. Every time another test becomes popular, simply plug it in and become New Test Compliant (TM) in a few hours.
So what, then, is left to test for?
Here’s the problem: science—like the internet—used to run on the honor code. But as the fields became more esoteric, the experiments harder to replicate, and more and more money and fame hung in the balance of various studies succeeding, the more attractive fabrication became.
Yes, it’s enticing to think that if we can just find that one extra test we can finally make a dent in scientific fraud. How do we feel we’re doing on that front? We live in a world where superstimuli routinely become even better than the real thing. So all this accomplishes is to weed out the small-time criminals, while concentrating all the gains to the truly competent sociopath. Much like the war on drugs, this is not the kind of tournament you want to run.
My point is that this is an incentives game. As long as we treat it as a data-checking game, we’ll improve the quality of the fakes we get, while shutting out honest, well-meaning researchers who don’t have the funds available to keep up with the demands to prove they’re not elephants.
Goodhart’s law says it best: "When a measure becomes a target, it ceases to be a good measure." There’s a book that is not for the faint of heart, but everyone who wants to pronounce on policy should read: The Tyranny Of Metrics, by Jerry Z. Muller.
This book goes through the myriad of ways in which metric-driven thinking gives us the sense that we’re working to improve things, while the ground recedes from under our feet—like running up a descending escalator. Your “steps climbed” metric is going through the roof, but you’re not going anywhere.
Every university student that picks an “easy A” class over one that will challenge them—because they want to graduate with a better average—is participating in the dissolution of the value of the grade as a metric. It’s not their fault. A measure that becomes a target ceases to be a good measure. Every surgeon that declines a challenging surgery because they want to keep their success rate high, every police officer that records a crime as a lower classification than it truly deserves in order to make their department’s statistics look better, every salesperson that offers a deep discount in order to hit this quarter’s targets, every academic that cites papers from the journal they are hoping to publish to, are all participating in the same dynamic. Manipulation of medical studies is but a tiny facet of this civilization-scale problem.
Wouldn’t you know it, it’s exactly this problem that I wrote about on lesswrong.com— the online community where I first came across Scott’s work—more than a decade ago.
The first thing a newly-hatched herring gull does after breaking out of its shell is to peck on its mother’s beak, which causes her to give it its first feeding. Puzzled by this apparent automatic recognition of its mother, Dutch ethologist and ornithologist Nikolaas Tinbergen conducted a sequence of experiments designed to determine what precisely it was that the newborn herring gull was attracted to. After experimenting with facsimiles of adult female herring gulls, he realized that the beak alone, without the bird, would elicit the response. Through multiple further iterations he found that the characteristics that the newborns were attracted to were thinness, elongation, redness and an area with high contrast. Thus, the birds would react much more intensely to a long red stick-like beak with painted stripes on the tip than they would to a real female herring gull.
This is where we are: trying to tell apart long red beak-like sticks with painted stripes on their tips from the real deal. You won’t be surprised to hear I have been thinking about this problem a little bit over the last decade. And I have two bags of solutions to offer.
The Read-Only Answer
It comes with many names. Robotics calls it “sensor fusion,” hedge funds call it using “alternative data,” evolutionary psychology calls it “nomological networks of cumulative evidence,” social sciences call it “methodological triangulation,” astronomers call it “multi-messenger astronomy,” in biology it’s called “multisensory integration,” and in vision it’s called “parallax.”
The idea itself is very simple, though it took me a decade to absorb it. If you’re worried about implementation issues in your sensors, use many different sensors. If you’re seeing something, your eyes may be playing tricks on you. But if you’re seeing, smelling, touching, and tasting something, well, at that point, it’s not an artifact of your senses. It would be extraordinarily unlikely (read: impossible) for all your senses to misfire in the same way at the same time without some unifying cause, either in your brain, or in the world. I’ve written more about this idea here:
Smart people like Daniel Schmachtenberger and Tom Beakbane call it Consilience.
The Read-Write Answer
Standard approaches to combining signals get really tough to swing when there are determined opponents flooding every channel with different narratives. So here’s another approach I’ve taken in the confusing times of COVID to figuring things out: correcting people.
Consider these two groups of people:
Group one: Bret Weinstein, Norman Fenton, Pierre Kory, and Harvey Risch, ivmmeta.com.
Group two: David Fuller, Claire Lehmann, Yuri Deigin, Sam Harris, Nature Magazine.
You might think you know what separates these two groups, but it’s not what I’m actually getting at. Throughout the pandemic, I’ve read and found errors in all sorts of materials. A simple test of intellectual honesty is to bring these errors to people’s attention to see what they do—especially when the errors cut against their preferred narrative direction. For people in the first group, I’ve brought errors I happened to find to their attention, and they’ve changed course. Often they’ve made edits to materials they had published, sometimes made loud, public corrections when the mistake warranted it. For the second group, when I’ve brought errors to their attention, they’ve retreated to denial and (often) attacks. The kind of errors I’m talking about are not subtle subjective things, either.
Let me give you a particularly public example, when I chose three straightforwardly black and white errors in a particularly disturbing piece by Quillette to bring to the attention of its editor:
You might think Claire Lehmann ignored me, or perhaps that she said these are not important enough to address, or possibly that she doesn’t have time for any of this. If you thought that, you’d be wrong:
The first response is as self-explanatory as it is revealing. The second response—as you can see in the thumbnail—is a podcast with different guests than the ones mentioned in the article in question, and the third response does not address the issue I raised.
You might think this was a fruitless exchange, but you’d be wrong. Here’s the thing: by seeing the response to a request for a correction, I learned something important. I learned the answer to the question “will they incorporate feedback when it comes to them?” If the answer is no, I can cast doubt on the entire content of someone’s feed or publication—since if there were a flaw, and it were found by someone somewhere—I would not expect them to admit it.
On the other hand, when I address corrections to others, and they incorporate those corrections, this gives me confidence that they are conscientious about the epistemic quality of their work, and I can let my guard down somewhat (but only somewhat!), knowing that they probably do the same when others give them corrections, also.
This fairly simple rule has been invaluable to me in figuring out who I should pay attention to during the pandemic. As you can see, the errors don’t have to be important at all to reveal valuable information. In this way, one can bootstrap a little bit of knowledge into more and more knowledge—though the work is never done. I’m not claiming this is an easy path, only that it’s possible.
How Does This Apply to Academia?
Modern academic publishing has many mechanisms to protect people’s egos, and very few mechanisms to encourage—or at least theoretically allow—corrections to materials that are out there. Moreover, it treats making corrections as something of a black mark. Even ignoring that, in an era of publish-or-perish—yet another facet of the tyranny of metrics—academics are incentivized to do new work, not to make sure their old work holds up. If I had a set of recommendations for academic publishing, they would focus around this part of the chain: making public peer reviews a real thing, and one that can easily lead to changes. We’re no longer running out of paper, and we no longer need to gatekeep the way we have been.
The other lesson is to embrace integrated evidence synthesis, rather than aiming for perfection with a finely tuned monoculture:
After all, reality is singular, and all valid evidence must ultimately point in the same direction. I don’t mean “same direction” as in a binary “yes/no” answer to a question about a drug working, but a nuanced multi-dimensional answer about why, when, and who might benefit from an intervention. What I’ve seen of modern medicine is an attempt to throw away as much evidence as possible so that what’s left can be declared “the answer,” when in fact much more can be done to integrate all the signals available into a coherent picture that makes sense.
Part of the reason I’ve dug into the TOGETHER trial as much as I have is that it’s not enough for me to just say “Yeah well, it was fraud and the investigators were on the take.” That’s easy to do, but that’s something anyone can do with any piece of evidence. Even if there was some type of corruption involved, it’s important to understand what exactly went wrong, and how the data coming out of that trial might be reconciled with what we know from other sources.
The two approaches work together by surfacing more inconsistencies for correction. We can then examine how well the various scientists react to criticism of their work—especially when that criticism is valid. Sadly, right now, the answer to this question is abysmal. As things stand, the default response by various scientists is to stonewall any inquiries, and try to win the argument by playing for compassion. It’s no wonder the predictions of the official “voices of science” have worked out far worse than the “conspiracy theorists” (don’t @ me) during the pandemic, seeing as all the incentives nudge in favor of the thing that will help actual scientific progress the least.
So these are my main scientific takeaways:
Encourage signal diversity and synthesis, not monocultures, and…
…make it easy for scientists to work in the open and improve or evolve their positions as feedback comes in. The wall between scientists and the real world isn’t coming back, and that doesn’t have to be a bad thing.
Encourage reasoned dissent, and value the cost born by those who practice it. And whatever you do, do not ostracize for disagreement with the orthodoxy.
Oh, and, stop abusing statistical screens as ways of jumping to conclusions. That was never going to end well.
This will not solve everything, but it sure as hell would be a huge step forward.
You're on fire. I'm going to incorporate some of your thinking on consilience into my eventual essay on handling skepticism... good stuff.
Great work Alexandros.
The behaviour of the medical community, and some of those to whom you have drawn attention in this essay, was eye-opening, to say the least, and not in a good way.
Very pleased that you have proposed a better alternative to the current, failing, model.