Let us jump straight into the calibration plot by looking at some figures. In Figure 1, two calibration plots of two different models (green and black) are illustrated. The models are evaluated on a test data set. The graphs show the frequency of residuals lying within varying posterior intervals of the predictive distribution. Another word for these intervals are credible intervals. A perfectly calibrated model should lie along the diagonal, gray line. Wait, what? Frequency of residuals, what does that even mean? Let's take it step-by-step.
First, imagine a new measurement point for which you use your model to predict the output. Now, as your model is probabilistic, it will not only give out a point estimate but a distribution. For instance, if your model follows a normal distribution, the predictive distribution will look like the bell-shaped one illustrated in Figure 2. The black dotted line in the middle is your model's median predictive value (for the normal distribution the median corresponds to the mean) for the measurement point, also called the 50th (P50) percentile. A percentile is basically just an expectation of the percentage of the data points that will have a value lower than the value at the percentile. For instance, if the 40th percentile has a value of 5, it is expected that 40% of all data points will have a value lower than 5. We have also indicated the 20% posterior interval centered about the median, which is the area in between the percentiles P40 and P60. This means that we expect 20% of data points to fall within this threshold.
Alright, we have our model posterior predictive distribution for the new measurement point, let us now indicate where the true measurement value actually lies in relation to this distribution. This is indicated by the yellow star. From this we can determine which posterior interval the true value lies in. If we do this exercise for all the test points in our test data set, we can calculate the frequency of residuals lying within varying posterior intervals centered around the median. A perfectly calibrated model will have 20% of the test points lying in the 20% probability threshold and so forth. In other words, on the gray diagonal line in the calibration plot in Figure 1. If so, this means that we can trust the uncertainty in the predictions that the model believes it has.
Let's get back to the illustrated calibration plots in Figure 1. From the figures we see that none of the models are well calibrated. The model with its curve above the diagonal (green) shows an under-confident model. This is because there is a greater percentage of test points that lies within a certain threshold than the model was expecting. For instance, more than 50% of the test points lie within the 20% probability threshold. Ergo, the model lacks confidence in its own predictions. The model with its curve below the diagonal (black) shows an overconfident model. Here only about 10% of all test points lie within the 20% probability threshold. The model is therefore too cocky about its own predictions.
Stay tuned for our next article in this series, where we will explore the uncertainty of the results from our large-scale study of data-driven VFMs using Bayesian neural networks.
Read our previous articles on data-driven VFMs here.