Welcome to Science with Shrike! Today we continue our discussion of modelling. Last week looked at some of the good with models, and a few challenges to avoid. Today, we cover some of the more problematic aspects of models.
Models for hypothesis generation
Many of the controversial uses of models come when we try to use them to predict the future. Weather is probably the most notably incorrect model, though climate change and infectious disease spread are other models with notable prediction failures. In contrast, NMR and X-ray crystallography data do a very good job. More recently, Alphafold does a good job of predicting new protein structures for which X-ray structures do not exist. Alphafold still gets some parts wrong, however.
Consequently, models have a lot of power to predict new things. The key problem is that these predictions must be tested. This places them in the realm of hypothesis-generation, not hypothesis-testing. If the hypothesis is falsified upon testing, it is time for a new model.
This is not a huge problem for something like Alphafold. In many cases, we can just solve the protein structure by X-ray crystallography to see if Alphafold got it right or wrong. However, for higher stakes predictions, this becomes a challenge when we do not want to test the predictions.
Let’s increase the stakes by looking at human health. Animal models are exactly that: models. Yes, many processes are conserved between humans, and mice, or even nematodes and yeasts. But we’re also very different. What works in mice may not apply in humans.
Leptin is visually one good example of this. Knock leptin out of a mouse, and it overeats in a big way. Leptin was not as successful a target in humans. The same holds true for cancer and other disease models. If it worked in a mouse, that is rationale supporting the hypothesis it will work in humans. However, that hypothesis needs to be tested. This is why the FDA often requires multiple animal models prior to human trials. Even then, there are many treatment failures, and sometimes human deaths. These failures are very expensive challenges to developing new health-promoting treatments, but we can raise modeling stakes even higher.
Consider the climate models. If a model predicts 10,000,000 deaths under Condition A or 1,000,000 deaths under Condition B, we don’t want to test Condition A in the real world if we have any confidence in that model. The question becomes how much confidence we have in the model, and our risk framework.
The combination of models working beyond their assumptions, our confidence assessments, risk frameworks, and the nature of policy ends up causing many problems. First, the models have a limited ability to predict the future. There could be variables missing, some data points could be out of range, etc. These limitations are hopefully factored into the confidence assessment. However, our confidence does not always match up with the model limitations, so this may be off.
Then our risk framework may consider models that are unlikely. For example, a model that has a 10% chance of being right means it will be wrong 9 times out of 10. But if that 1 time it fails is catastrophic, it may still need to be considered in the risk framework. If the cost is considered minor compared to the potential benefit of listening to the model, low confidence models may play a role in decision making, despite being low confidence. Enough low confidence models may add up to the net cost outweighing the benefit.
If one takes a conservative risk framework, models that predict catastrophe, even when the catastrophe is low probability and the model is unlikely to be correct, get undue weighting in the final framework. Generate enough of these low quality models, and it can influence the risk framework in ways that are not congruent with reality. The simple thinking ends up being ‘if you have 10 models each with a 10% chance of being right, you expect 1 of those might be right’ (the more complex thinking is to calculate the odds that 0 events occur, and decide that at least one model occurring is ~65% likely).
There are additional problems with this framework. First, the models might not be independent, in which case summing the probabilities is not appropriate. Second, the model is NOT evidence that something is going to happen even though it may get used as such. Third, the people get the probabilities screwed up. If a model predicts global catastrophe with a 1% chance of occurring, but it is only 10% likely to be true, people will quickly conflate model correctness with catastrophe.
It gets worse from here because global catastrophes require a national response, which requires policy. Policy does not deal with uncertainties well at all. ‘The world might end under condition A, but we’re only 15% sure’ comes across as weak leadership.
Another challenge is the goals of policy. The goals of policy are to advance the policy maker’s agenda, and to have compliance with policy. If there’s an 85% chance everyone will be fine, it gets harder to enforce compliance. So now the probabilities need to be wiped out in the name of driving policy compliance. At scale, compliance is easier with simpler rules. This makes nuanced approaches harder to execute. It also changes the estimation of model confidence. If one is forced to defend a model that is 10% likely to occur enough, the emotional assessment becomes 40%, then 51%, then progressively closer to 100%. This is one cost of policy making.
So how do we adjust for these challenges? Shrike favors a skeptical approach to trusting models. Unlikely models should not be considered or included in decision making at all. Model independence and cumulative costs of models should be considered as well. When models fail, they need to be adjusted or abandoned. Finally, it needs to be emphasized that models generate hypotheses, not evidence.
This is all hard to do at a policy level. We observed this in real time with COVID models. The models were used as a benchmark, when they were just lousy (but scary!) predictions. Even after it was clear the models were wrong, the lockdowns and masking persisted. Some would argue that there were additional policy considerations favoring these measures (eg use COVID as an excuse to consolidate power), which is possible. However, it’s not clear the extent to which this is nefarious vs the inertia inherent to large policy decisions.
Models are powerful tools when they are used appropriately. However, the temptation to abuse models grows as they tackle successively higher stakes problems. Our skepticism of models should grow as well. We should demand clearer disclosure of limitations, set up benchmarks against which we judge the model, and require policy change when the model fails those benchmarks.