Calibration plots - how confident is your model about itself?
Calibration plots can be used to analyze probabilistic models. In short, the plot casts light upon how confident or uncertain a model is about its own predictions, and if we, as users of the model, can trust this uncertainty. Let us explain how.
- Author
- Solution Seeker
- Publish date
- · 4 min read
Imagine that you have a model that predicts the flow rate of oil from a petroleum well given its surrounding conditions. It is crucial that you know how much oil will be produced when adjusting the control valve opening. Why? There are numerous restrictions on how much oil you can produce, both economically and politically. Further, producing too much from the well may damage the structure in the reservoir for instance by inducing water breakthrough or excessive sand production.
90%
Alright, so you adjust your valve opening to 50% and you demand of your model to know how much oil will be produced. And the model gives you a number, say 800bbl/d. But how certain is this number? If the well actually produces 900bbl/d for this valve opening you might eventually face a problem with OPEC as you produce more than you planned for. If the well produced 700bbl/d you will definitely lose economically. Wouldn't it be marvelous if you knew how certain your model was about that number? Now, a probabilistic model will at least give you an estimate of the uncertainty. You will not only get the value 800bbl/d, but your model will tell you that there might be a 90% chance that you will produce 900bbl/d. Will you risk this? Maybe not. What if the model said that there is only 5% chance that you will produce 900bbl/d? A lot better risk to take, no? But then again, how certain is this uncertainty? Can we trust the model uncertainty? The answer is, you need to analyze the model with a calibration plot.
![](https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=581&q=90&w=600&s=c5633c61abeab80c01fd31f93540e530 600w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=678&q=90&w=700&s=ef940cfedb22306849e71900ce6cc317 700w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=775&q=90&w=800&s=70c49b0434636981f5ed50ddcce8f727 800w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=872&q=90&w=900&s=fc108578c3b8647b62b2b91c4ad06016 900w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=968&q=90&w=1000&s=35db5cf447ea7f4a17af185a13b7ecf8 1000w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1065&q=90&w=1100&s=d45898a7f8fabc65e015b75c30b138ae 1100w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1162&q=90&w=1200&s=25874552d3b68bda690bdbb7a85ac31d 1200w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1259&q=90&w=1300&s=1a4e5e9cd4d6a60aa86a87a900bd64a1 1300w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1356&q=90&w=1400&s=5293e23cca8dc514f7f7c7107cc57ab9 1400w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1453&q=90&w=1500&s=c7dbbc05c015d70db41bf74f24e71b77 1500w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1549&q=90&w=1600&s=d350f3ab049ca51678d2751c7b9a3cbe 1600w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1646&q=90&w=1700&s=e0af91a3c3c6634ed6ab2c762ca22a94 1700w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1743&q=90&w=1800&s=ec774407dd0b8e8b1d2bbf6155e859b5 1800w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1840&q=90&w=1900&s=6a51ee9072ccfeec206794aa6fe52032 1900w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1937&q=90&w=2000&s=940c4c5f211289c32dd71514f851ec2a 2000w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=2034&q=90&w=2100&s=712afcb651af5de5c581e41560453813 2100w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=2130&q=90&w=2200&s=efafcdf90619e7d748df6bbd3a59f814 2200w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=2227&q=90&w=2300&s=5507052cf56c7bd3a298edeb1032947d 2300w, https://solutionseeker2021.imgix.net/images/old/Figure1-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=2324&q=90&w=2400&s=9fe939ee087f6dc7969ae54cafdd164b 2400w)
Let us jump straight into the calibration plot by looking at some figures. In Figure 1, two calibration plots of two different models (green and black) are illustrated. The models are evaluated on a test data set. The graphs show the frequency of residuals lying within varying posterior intervals of the predictive distribution. Another word for these intervals are credible intervals. A perfectly calibrated model should lie along the diagonal, gray line. Wait, what? Frequency of residuals, what does that even mean? Let's take it step-by-step.
![](https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=434&q=90&w=600&s=e8f5f8dc780aa7e8f023bd561ad20c6e 600w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=507&q=90&w=700&s=477f4feef1553d657c9773aba2f22373 700w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=579&q=90&w=800&s=75751063a37f956377cf16d59d6afc04 800w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=652&q=90&w=900&s=849b84144079c39b9fc6a3b1a3ec3c1a 900w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=724&q=90&w=1000&s=dbf5ac90be0f7f30f507b6d810622182 1000w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=796&q=90&w=1100&s=60ecd4c93dd4a90b435b268bf79c26f1 1100w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=869&q=90&w=1200&s=31621a8da8e6ee6aba9bb9f151f2eb01 1200w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=941&q=90&w=1300&s=31dd314033ee68931c85434a58be881a 1300w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1014&q=90&w=1400&s=f1ab092961ad34d26a0cf015538e1e2b 1400w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1086&q=90&w=1500&s=5d9519a31bc715ec8e7cdf17cf9b2f61 1500w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1158&q=90&w=1600&s=da52ca0cb7eb95d99de9a5657b985461 1600w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1231&q=90&w=1700&s=bdb36b3de0f6590f4c947e9c491eb99d 1700w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1303&q=90&w=1800&s=d21752ca396beb373bc0064f2fe6f6b4 1800w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1376&q=90&w=1900&s=f634c84a8d5c5e8c0bf6ec592d0aa3fd 1900w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1448&q=90&w=2000&s=a5da5326bfd59407b98b4b4f06466a32 2000w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1520&q=90&w=2100&s=0de7bdc9575ad52dfa97bdc3494f14a3 2100w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1593&q=90&w=2200&s=815fb4dca33d7efbcf476945481abd0a 2200w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1665&q=90&w=2300&s=226df79d2b97356e69cc21fcc9059a9a 2300w, https://solutionseeker2021.imgix.net/images/old/Figure2-with-number.png?auto=compress%2Cformat&crop=focalpoint&cs=srgb&fit=crop&fp-x=0.5&fp-y=0.5&h=1738&q=90&w=2400&s=f382e863421d8924b73fb2ed2ef3ba79 2400w)
First, imagine a new measurement point for which you use your model to predict the output. Now, as your model is probabilistic, it will not only give out a point estimate but a distribution. For instance, if your model follows a normal distribution, the predictive distribution will look like the bell-shaped one illustrated in Figure 2. The black dotted line in the middle is your model's median predictive value (for the normal distribution the median corresponds to the mean) for the measurement point, also called the 50th (P50) percentile. A percentile is basically just an expectation of the percentage of the data points that will have a value lower than the value at the percentile. For instance, if the 40th percentile has a value of 5, it is expected that 40% of all data points will have a value lower than 5. We have also indicated the 20% posterior interval centered about the median, which is the area in between the percentiles P40 and P60. This means that we expect 20% of data points to fall within this threshold.
Alright, we have our model posterior predictive distribution for the new measurement point, let us now indicate where the true measurement value actually lies in relation to this distribution. This is indicated by the yellow star. From this we can determine which posterior interval the true value lies in. If we do this exercise for all the test points in our test data set, we can calculate the frequency of residuals lying within varying posterior intervals centered around the median. A perfectly calibrated model will have 20% of the test points lying in the 20% probability threshold and so forth. In other words, on the gray diagonal line in the calibration plot in Figure 1. If so, this means that we can trust the uncertainty in the predictions that the model believes it has.
Let's get back to the illustrated calibration plots in Figure 1. From the figures we see that none of the models are well calibrated. The model with its curve above the diagonal (green) shows an under-confident model. This is because there is a greater percentage of test points that lies within a certain threshold than the model was expecting. For instance, more than 50% of the test points lie within the 20% probability threshold. Ergo, the model lacks confidence in its own predictions. The model with its curve below the diagonal (black) shows an overconfident model. Here only about 10% of all test points lie within the 20% probability threshold. The model is therefore too cocky about its own predictions.
Stay tuned for our next article in this series, where we will explore the uncertainty of the results from our large-scale study of data-driven VFMs using Bayesian neural networks. Read our previous articles on data-driven VFMs here.