Blog post by: Professor Mark Levene, Principal Scientist at NPL

Calibration is an important concept in metrology, the science of measurement. It involves the comparison of measured values delivered by a device with those of a measurement standard of generally greater accuracy. To account for variability in the measurements, for example, caused by random effects, calibration often involves taking repeated measurements and constructing a confidence interval to express the uncertainty in the calibration result. A device is regarded as calibrated when this comparison has been carried out and a document containing the expected values from the standard, the measured values from the device, and the associated uncertainty (a ācalibration certificateā) has been issued.

It is advantageous to describe the variability in a measurement by a probability distribution, which is a mathematical function that gives the probability of different possible values of a random variable. Probability distributions are often visualised using the probability density function, a curve showing how often a given value of a variable is likely to occur relative to other values. These functions are normalised so that the area under curve is equal to 1. The most common distribution is the normal distribution, also known as the bell curve due to its shape. Examples of approximately normally distributed random variables are the height, weight, and heart rate of a human.

Why am I going to all the trouble of telling you this? Well, the parameters associated with a measurement, such as its variability, are derived from a probability distribution, and more generally a probability distribution provides us with the most complete description of a measurement result.

Now to our main topic of machine learning (ML).Ā To solve a problem, an ML algorithm produces a representation of the input data set in the form of a statistical model, which can be used for either classification or regression tasks to aid decision making.Ā As an example, letās make use of our old friend, the weather. Two common questions we would like answers to, within a given time frame, are: (i) will it rain or not (a classification problem with two classes), and (ii) what will the temperature be (a regression problem with a range of possible temperature values).

Calibration in ML thus comes in two flavours depending on whether the task at hand is one of classification or regression. First, consider a classification problem involving prediction of whether it will rain or not. Here the output of the ML algorithm is more refined than simply telling us whether it will rain or not, which is too difficult a task to answer accurately due to the uncertainty involved. Rather, the ML algorithm outputs a probability that it will rain, in the more user-friendly format as a percentage, which is the probability multiplied by 100; this output is called the predicted probability. It is then up to us to interpret this probability when deciding whether we should grab an umbrella when going out.

Recall that a measurement result is incomplete without an associated uncertainty. As a simplification of how the Meteorological Office may arrive at the probability that it will rain, we assume that a complex model of the weather is repeatedly run with slightly different initial conditions. Each simulation result will output yes for “it will raināā or no for “it will not raināā, and the probability reported is the proportion of yes outputs; this probability is called the empirical probability. The data set used for this calculation is called the validation data set. The predicted probability can then be calculated from the inputs of the validation set using the statistical model constructed by the ML algorithm. So, an ML classifier outputting a probability of rain as the prediction is said to be calibrated if, whenever the ML model outputs a predicted probability, this equals the empirical probability as described above. That is, if we plot the empirical probability values versus the predicted probability values, then calibration implies that they will fall on the diagonal line or close enough to it. Such a plot is called a reliability diagram, which will help us ascertain how calibrated the probabilities really are.

Second, we have a regression problem, where the output is a temperature value. Producing an accurate single value, say T, for the temperature is too difficult in practice. So, instead let us consider the output to be the probability, say P, that the temperature will be less than or equal to T; this is the predicted probability for a regression problem. Now, to get the empirical probability P we can evaluate the weather model, as before, where each simulation will output a temperature, and count the proportion of times the output was less or equal to a temperature T. As before the data used for this calculation is the validation set.

To complete the explanation for a regression problem we still need to know how given a probability, say P, we can obtain the temperature value, T, such that the probability that the temperature is less than or equal to T is P. To do this we need to introduce one more term, the quantile of a probability distribution expressed as a percentage, say PāÆ%, which is our notation for P times 100. The quantile is the value T of the temperature, such that exactly PāÆ% of the distribution is less than or equal to T. Now, as with classification, we can plot a reliability diagram of the empirical probabilities against the predicted probabilities and expect the plotted values to fall close to the diagonal line when the probabilities are calibrated.

We have covered quite a few statistical concepts that are needed to understand calibration in ML. The key takeaway is that both in classical metrology and ML, calibration needs to deal with uncertainty. In the case of classical metrology, the empirical probabilities emanate from repeated measurements, and are compared to a measurement standard. On the other hand, in ML the empirical probability is derived from the outputs of a validation data set and is compared to the predicted probability output from the ML model with the inputs coming from the validation set. In this sense the empirical probability, which we assume derives from Meteorological Office data, acts as a reference probability for the ML calibration process.

## 0 Comments