Welcome to Science with Shrike! Today we’ll talk about modelling. Models can be incredibly useful in science and in other arenas, but they are often abused and/or misinterpreted. We’ll cover the good and the bad with models, without using scary math. Today focuses more on the good.
Why do we want to model anything?
Models solve communication and calculation problems. When used correctly in science, they let us describe large groups of data very succinctly. Which is easier to say:
· At 100 ng/uL the absorbance was 0.4, at 200 ng/uL the absorbance was 0.46, at 300 ng/uL the absorbance was 0.55
· The data were linear, the equation is <this>
Clearly the second option. Even better, once we have the equation, we can figure out an unknown concentration if we measure the absorbance.
Another example of modeling is letting us visualize three dimensional datasets like this one
This is picture of what the pro-inflammatory protein called Interleukin-1β (IL-1β) looks like. This picture was drawn from a model created using X-ray crystallography data. We can use this model see how IL-1β might interact with other proteins, which parts of the protein contact other proteins, and re-engineer the protein to change how it interacts, or figure out where we can add new things to the protein without interfering with its function.
Other uses for modelling include hypothesis generation—making predictions, so that we can test how well those predictions hold up to reality by doing experiments. Half of the problems with modelling occur when people assume the model is evidence supporting the prediction, instead of rationale for testing the prediction.
The other half of the problems come from using a bad model.
How do I know if the model is good?
Modelling is just a fancy way of saying that you are fitting your data to one (sometimes more) equations. This raises two questions: How do we know which equation? and How well does the model fit the data?
Both of these questions can be answered by measuring how well the model fits the data, and checking if it makes any weird predictions. We’ll illustrate by modeling some data plotted on the graph below. These data show how much light is absorbed (absorbance) by known concentrations of a protein solution (Concentration). Absorbance is the variable that we measure (which is why it goes on the y axis), while concentration is the variable we can easily vary (which is why it goes on the x axis). Plotting these together lets us model the relationship between the concentration of a protein solution and its absorbance. Using that model, we can determine the concentration of an unknown protein solution if we measure its absorbance. When this method uses a reagent called Coomassie Blue to color the protein, it is called a Bradford assay.
To model the data from a Bradford assay, you might choose to use a straight line:
You don’t need any math to tell you that a line is a poor model for these data.
However, we can use math to put numbers on it. How well the data fit the model is called R2, and it goes from 0 to 1. The closer the number is to 1, the better the fit. Excel can magically calculate this for us if we check the box ‘display R2 on chart’.
If a line doesn’t work, we can try a different equation. As a rough rule of thumb, for modeling two variables like we are doing here, we use the following types of equations:
· If it looks like a line, try a line
· If it looks like an S, try a logistic curve
· If it curves very sharply, try a log or exponential curve
· If none of those fit well, use a weird non-linear curve
With our example above, we could try a log curve. And when Shrike says, ‘we try a log curve’, it’s clicking a couple buttons in Excel and making the computer do the math. The graph looks like this:
We can tell that the R2 is much better than the last one. This model fits the data better than a straight line. But can we make this model even better? Excel lets us choose a whole bunch of trendlines, especially weird polynomials. If we use a 6th order polynomial, we get a terrific R2! With a high enough order polynomial we can hit all the data points. You know from above that R2 is math-speak for how well the line hits the existing data points.
While the R2 is good, this is a terrible model because it makes some weird predictions. This model suggests absorbance values around 1200 should be negative. Negative absorbance values would mean the protein is either emitting light or making the solution more transparent than the original solution. While these could be dismissed out of hand as absurd, you could also test this model by measuring the protein concentration at 1200. If the absorbance is positive, you know this model was wrong.
This illustrates the concern for what Shrike calls ‘weird’ predictions. Any time there is a hump in the middle of the data with no data points on the hump, you have a weird prediction. In the graph above, we have two of these: one between 500 and 1000, and the second one between 1000 and 1500. The overwhelming majority of the time, weird predictions mean you chose a bad model. If you insist on believing your model’s weird predictions, you must test them experimentally.
Thus, getting an R2 of 1 is not the be-all and end-all of modeling.
In the example above, there are three options for modeling the protein concentration. The first is to use the log curve above, the second is to use a different non-linear model (fitting the data to the curve: y = (A-D)/(1+(X/C)^B)+D can work), and the third is to use a subset of the data where the data are linear. In the last case, the 200-500 region might be ok.
Notice the R2 is better than the log scale. That means the data are linear within this region. The equation is also now on the graph, which Shrike will use for the next caveats.
However, there is a catch. If we’ve chosen a good model, it is only good for interpolation, not for extrapolation. Interpolation means we can only analyze data that fall between the upper and lower bounds of the graph (in this case absorbances 0.6115 and 0.906). If we measure an unknown protein with an absorbance of 0.75, we have good confidence the concentration is 341.
But we don’t know how the data behave outside of this range. If we use the equation to calculate the concentration for an absorbance of 1.084, we would get a concentration of 674. Instead, we know from the full graph that absorbance corresponds to 1500.
The response outside of our range was not linear, so our linear model got it wrong.
This is why we cannot extrapolate from models accurately. The model might not work outside of the tested bounds. We need to test the model under our predicted conditions before we know how useful the model is.
We need to use models only within their predictive range.
Part 2 will consider the use of models for hypothesis generation.