Calibration: Why Model Scores Aren't Probabilities and How to Generate Them?

Some years ago, I was part of a machine learning consulting team assigned to a project for an insurance firm. Our task was to deploy a…

Uri Itai

~8 min read · March 4, 2024 (Updated: March 8, 2024) · Free: Yes

Some years ago, I was part of a machine learning consulting team assigned to a project for an insurance firm. Our task was to deploy a predictive model, and when we presented our findings to the head of analytics, we were optimistic about our accomplishments. We showcased various goodness-of-fit metrics like Accuracy, recall, precision, and F1-score, among others. Despite the favorable results, there seemed to be a hint of dissatisfaction in his expression, which made us realize that our presentation might not have been entirely convincing. He expressed his reservations, emphasizing the importance of having the probability of an accident occurring in the upcoming year rather than just the accuracy of our predictions. This distinction, he explained, was crucial for the firm's decision-making process, as it could mean the difference between a five percent and a forty percent chance of an accident. In response to this feedback, our initial instinct was to rely on the predict_proba method for our model's assessment of the likelihood of an accident. However, upon closer examination, we realized that this score did not truly represent a probability. For instance, if we assume a geometric distribution, where an event with probability p is expected to occur once every 1/p times, our data did not align with this expectation. Therefore, it became apparent that the model score provided by predict_proba was not an accurate reflection of probability.

In addressing this challenge, we engaged in calibration. Calibrating machine learning models involves refining the output probabilities of a model to more accurately correspond with the true probabilities of events occurring. This adjustment guarantees that the predicted probabilities serve as dependable estimates of outcome likelihoods, thereby enhancing the model's calibration. Essentially, calibration ensures that the predicted probabilities faithfully represent the genuine likelihood of events, thereby enhancing the accuracy and reliability of the model's predictions.

In data analysis, graphical models often enhance comprehension, with calibration being a notable example. A calibration curve, also referred to as a reliability diagram or calibration plot, serves as a graphical tool for evaluating the calibration performance of a probabilistic classification model. It visually depicts the correlation between the predicted probabilities produced by the model and the observed frequencies of the predicted outcomes.

Calibration curve

In a calibration curve, the x-axis typically represents the predicted probabilities generated by the model, ranging from 0 to 1, while the y-axis represents the observed frequencies of the predicted outcomes. The predicted probabilities are divided into intervals or bins, and for each interval, the average predicted probability and the corresponding observed frequency of the positive outcome are calculated.

Ideally, a well-calibrated model should produce predicted probabilities that closely match the observed frequencies of the positive outcomes across different probability intervals. As such, the calibration curve should ideally follow a diagonal line, indicating perfect calibration, where the predicted probabilities equal the observed frequencies.

However, deviations from this diagonal line indicate calibration errors. If the curve is above the diagonal line, it suggests that the model tends to overestimate the probabilities of positive outcomes, while a curve below the diagonal line indicates underestimation. These deviations provide insights into the calibration performance of the model and can help identify areas where calibration adjustments may be needed to improve predictive accuracy.

Most machine learning models inherently suffer from calibration issues, and the causes vary depending on the learning algorithm employed. For instance, tree-based ensembles like random forests derive their predictions by aggregating individual trees, leading to probabilities that rarely approach zero or one due to the inherent variability in the trees' predictions. Consequently, these models tend to exhibit overestimation towards zero and underestimation towards one. Additionally, many models are optimized and evaluated using binary metrics such as accuracy, which focus solely on correctness rather than certainty. Decision trees, for instance, utilize Gini impurity to make splits, prioritizing accuracy over probabilistic calibration. Similarly, support vector machines prioritize widening the margin rather than optimizing probabilities. Furthermore, deep neural networks often employ techniques like dropout to prevent overfitting, which may impact their calibration.

To evaluate the effectiveness of the calibration process, one commonly used metric is the Brier score, also known as the mean squared error. This metric measures the average squared difference between the predicted probabilities and the actual outcomes for a set of predictions. A lower Brier score indicates better calibration and accuracy, with a perfect score of 0 indicating that the model's predictions align perfectly with the actual outcomes.

Brier score

Another avenue to gauge the quality of the calibration is through the Log-loss metric. This measure, also known as logarithmic loss or cross-entropy loss, evaluates the accuracy of probabilistic predictions made by a model. It quantifies the difference between the predicted probabilities and the actual outcomes, considering the entire probability distribution. A lower Log-loss value indicates better calibration, with the ideal score being close to zero, suggesting accurate and precise predictions.

Log loss score

It's worth mentioning that interpreting log loss can be challenging, as it's primarily utilized for comparative purposes, with lower values suggesting a better model fit

Various methods exist for calibrating machine learning models, all aimed at refining the model's predicted probabilities to better match the actual probabilities of events. This process, known as model calibration, is vital for ensuring that the predicted probabilities offer precise estimates of outcome likelihoods, thereby enhancing the model's reliability and efficacy. Essentially, during model calibration, we construct a curve illustrating the probabilities for each bin relative to the true conditions. The difference between this curve and the linear probability line can be used as a metric to assess calibration performance.

Next, we'll delve into the classical methods of calibration. Platt scaling, also known as Platt calibration, is a parametric approach utilized to calibrate the output probabilities of a classification model. Introduced by John Platt in 1999, this method aims to convert the raw output scores or logits generated by a classifier into calibrated probabilities.

The fundamental concept of Platt scaling involves fitting a logistic regression model to the output scores produced by the classifier. By optimizing the log-likelihood of the observed labels, the logistic regression model learns to map these scores to calibrated probabilities. Once trained, the logistic regression model can effectively transform the raw scores into probabilities that better align with the true likelihood of class membership. If one replaces the logistic function with a sigmoid it is called sigmoid calibration.

On the contrary, isotonic regression presents a non-parametric approach to fitting a function to a given dataset while preserving the order of the input variables. This method ensures that the predicted values either increase or remain constant as the input variables increase, making it suitable for scenarios where the relationship between input and output variables is monotonic but not necessarily linear.

In isotonic regression, the goal is to identify a piecewise-constant, non-decreasing function that effectively fits the data. By dividing the input space into intervals and assigning a constant value to each interval, isotonic regression guarantees that the predicted values demonstrate a non-decreasing trend. The term "isotonic" highlights the regression function's characteristic of maintaining its direction, remaining monotonic — either increasing or constant — as the input variables change.

Isotonic regression finds applications across various domains, including the calibration of probability estimates, where it ensures the correct ordering of predicted probabilities. It is also useful in monotonic regression problems, where a monotonic relationship between variables is expected but not necessarily linear. Since isotonic regression is a non-parametric curve, it does not assume anything about the shape of the calibration curve, except of course monotonicity, offering flexibility. However, this flexibility may lead to overfitting if the dataset is insufficient.

There are techniques to expand this to multi-class. Various calibration techniques, such as Platt scaling, or isotonic regression can be employed to achieve this goal. By calibrating the model's predicted probabilities, data scientists can improve the model's performance and make more informed decisions based on its output.

Skitlearn example:

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.calibration import CalibratedClassifierCV

from sklearn.calibration import calibration_curve

# Load the iris dataset

iris = load_iris()

X, y = iris.data, iris.target

# Convert the multiclass problem into a binary classification problem

# Let's consider class 1 (Iris-Versicolour) as the positive class

y_binary = np.where(y == 1, 1, 0)

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)

# Train a random forest classifier

rf_classifier = RandomForestClassifier(n_estimators=100)

rf_classifier.fit(X_train, y_train)

# Calibrate the classifier using CalibratedClassifierCV with sigmoid calibration

calibrated_classifier = CalibratedClassifierCV(rf_classifier, method='sigmoid')

calibrated_classifier.fit(X_train, y_train)

# Predict probabilities for the test set

probabilities = calibrated_classifier.predict_proba(X_test)[:, 1] # Use probabilities of the positive class

# Compute calibration curve

prob_true, prob_pred = calibration_curve(y_test, probabilities, n_bins=10, pos_label=1)

# Plot calibration curve

plt.figure(figsize=(8, 6))

plt.plot(prob_pred, prob_true, marker='o', linestyle='-', color='b', label='Calibration Curve')

plt.plot([0, 1], [0, 1], linestyle=' — ', color='gray', label='Perfectly Calibrated')

plt.xlabel('Mean Predicted Probability')

plt.ylabel('Fraction of Positives')

plt.title('Calibration Curve')

plt.legend()

plt.grid(True)

plt.show()

The output calibration curve

Returning to the insurance firm scenario, we opted for isotonic regression. Despite a slight decrease in accuracy, the collaborative results were satisfactory.

It's essential to acknowledge that there are scenarios where Model Calibration May Not Be Necessary. While the main objective of model calibration is to ensure that the output probabilities are meaningful and interpretable, there are situations where calibration might not be crucial. For instance, consider a scenario where a model is tasked with ranking the quality of news article titles. Here, the goal may be simple: to identify the title with the highest score, following a policy of selecting the top-ranked title. In such cases, calibrating the model may not offer significant benefits, as the focus is solely on identifying the best-performing title rather than interpreting probabilities. In summary, it's anticipated that a calibrated model would demonstrate a higher Brier score or a lower log-loss compared to one that is not well calibrated.

In conclusion, this post delves into the collaborative process for machine learning models. As highlighted earlier, the scores generated by the model often do not represent the true probabilities of outcomes. In contexts like insurance where accurate probabilities are crucial, this calibration process becomes indispensable. However, it's important to acknowledge that this process may impact the model's performance, necessitating a trade-off between accuracy and calibration.

#probability #calibration #machine-learning #data-science #model-tuning

Calibration: Why Model Scores Aren't Probabilities and How to Generate Them?

Some years ago, I was part of a machine learning consulting team assigned to a project for an insurance firm. Our task was to deploy a…

Reporting a Problem