Courage to Learn ML: An In-Depth Guide to the Most Common Loss Functions

MSE, Log Loss, Cross Entropy, RMSE, and the Foundational Principles of Popular Loss Functions

Amy Ma

Towards Data Science

· ~13 min read · December 28, 2023 (Updated: December 28, 2023) · Free: No

Welcome back! In the 'Courage to Learn ML' series, where we conquer machine learning fears one challenge at a time. Today, we're diving headfirst into the world of loss functions: the silent superheroes guiding our models to learn from mistakes. In this post, we'd cover the following topics:

What is a loss function?
Difference between loss functions and metrics
Explaining MSE and MAE from two perspectives
Three basic ideas when designing loss functions
Using those three basic ideas to interpret MSE, log loss, and cross-entropy loss
Connection between log loss and cross-entropy loss
How to handle multiple loss functions (objectives) in practice
Difference between MSE and RMSE

What are loss functions, and why are they important in machine learning models?

Loss functions are crucial in evaluating a model's effectiveness during its learning process, akin to an exam or a set of criteria. They serve as indicators of how far the model's predictions deviate from the true labels ( the 'correct' answers). Typically, loss functions assess performance by measuring the discrepancy between the predictions made by the model and the actual labels. This evaluation of the gap informs the model about the extent of adjustments needed in its parameters, such as weights or coefficients, to more accurately capture the underlying patterns in the data.

There are different loss functions in machine learning. These factors include the nature of the predictive task at hand, whether it's regression or classification, the distribution of the target variable, as illustrated by the use of Focal Loss for handling imbalanced datasets, and the specific learning methodology of the algorithm, such as the application of hinge loss in SVMs. Understanding and selecting the appropriate loss function is quite important, since it directly influences how a model learns from the data.

To learn machine learning, one should know the most popular ones. For example, (Mean Squared Error) MSE and (Mean Absolute Error) MAE are commonly used in regression problems, while cross entropy is the most common loss function for classification tasks.

How do loss functions differ from metrics, and in what ways can a loss function also serve as a metric?

Your statement about loss function can also be metrics is misleading. Loss functions and metrics both assess model performance, but in different stages and for different purpose:

Loss Functions: These are used during the model's learning process to guide its adjustments. They need to be differentiable to facilitate optimization. For instance, Mean Squared Error (MSE) and Mean Absolute Error (MAE) are common loss functions in regression models.
Metrics: These evaluate the model's performance after training. Metrics should be interpretable and provide clear insights into model effectiveness. While some metrics, like accuracy, can be straightforward, others like F1 score involve threshold decisions and are non-differentiable, making them less suitable for guiding learning.

Notably, some measures, such as MSE and MAE, can serve both as loss functions and metrics due to their differentiability and interpretability. However, not all metrics are suitable as loss functions, primarily due to the need for differentiability in loss functions for optimization purposes.

In practice, one should always carefully choose the loss function and metrics together for learning, and ensure that the learning and evaluation are aligned in the same direction. This alignment ensures that the model is optimized and evaluated based on the same criteria that reflect the end goals of the application.

Author's Note: It's important to clarify that using the F1 score as a loss function in machine learning models isn't entirely infeasible. In my ongoing study, I've encountered innovative methods that address the non-differentiability issue commonly associated with the F1 score. For instance, Ashref Maiza's post introduces a differentiable approximation of the F1 score. This approach involves "softening" precision and recall using likelihood concepts, rather than setting arbitrary thresholds. Additionally, some online discussions like the one explore similar themes.

The challenge lies in the inherent nature of the F1 score. While it's a highly informative metric, selecting an appropriate loss function to effectively optimize the model under the same criteria can be complex. Moreover, tuning thresholds adds another layer of complexity. I'm really interested into this topic. If you have insights or experiences to share, please feel free to connect with me. I'm eager to expand my understanding and engage in further discussions.

You said MSE and MAE as typical metrics in regression problems. What are they and when to use them?

In regression problems, where the predictions are continuous values, the goal is to minimize the difference between the model's predictions and the actual values. To assess the model's effectiveness in grasping the underlying pattern, we use metrics like Mean Squared Error (MSE) and Mean Absolute Error (MAE). Both these metrics quantify the gap between predictions and actual values, but they do so using different evaluation approaches.

MSE is defined as:

Here, y_i is the actual value, y_hat_i is the predicted value, and n is the number of observations.

MSE calculates the average of the squared differences between predictions and actual values, which is the Euclidean distance (l2 norm) of predictions and the true labels.

On the other hand, MAE is defined as:

Here, the absolute differences between the actual and predicted values are averaged, corresponding to the Manhattan distance (l1 norm). In the other words, MAE calculates the average distance between the estimated values and the actual value without considering the direction (positive or negative).

We talked about Lp norm and different distances in our discussion on l1, l2 regularizations https://medium.com/p/1bb171e43b35,

The primary distinction between MSE and MAE is their response to outliers. MSE, by squaring the errors, amplifies and gives more weight to larger errors, making it sensitive to outliers. This is useful if larger errors are more significant in your problem context. However, MAE assigns equal weight to all errors, making it more robust to outliers and non-normal error distributions.

The choice between MSE and MAE should based on the properties of the training data and the implications of larger errors in the model. MSE is preferable when we want to heavily penalize larger errors, while MAE is better when we want to treat all errors equally.

I get that squaring the differences in MSE amplifies the errors, leading to a greater emphasis on outliers. Are there other perspectives or aspects that help differentiate between these two metrics?

Certainly, there's another perspective to understand the differences between Mean Squared Error (MSE) and Mean Absolute Error (MAE) beyond their handling of outliers. Imagine you're tasked with predicting a value 'y' without any additional features (no 'Xs'). In this scenario, the simplest model would predict a constant value for all inputs.

When using MSE as the loss function, the constant that minimizes the MSE is the mean of the target values. This is because the mean is the central point that minimizes the sum of squared differences from all other points. On the other hand, if you use MAE, the median of the target values is the minimizing constant. The median, unlike the mean, is less influenced by extreme values or outliers.

In the universe of Douglas Adams' 'The Hitchhiker's Guide to the Galaxy,' 42 is the ultimate answer to life, the universe, and everything. Who knows, maybe 42 is also the magic number to shrink your loss function — but hey, it all depends on what your loss function is! Image created by ChatGPT

This difference in sensitivity to outliers stems from how the mean and median are calculated. The mean takes into account the magnitude of each value, making it easier to being skewed by outliers. The median, however, is only concerned with the order of the values, thus maintaining its position regardless of the extremities in the dataset. This intrinsic property of the median contributes to MAE's robustness to outliers, providing an alternative interpretation of the distinct behaviors of MSE and MAE in modeling contexts.

You can find an explanation of why the mean minimizes MSE and the median minimizes MAE in Shubham Dhingra's post.

We've talked about how MSE and MAE measure errors, but there's more to the story. Different tasks need different ways to measure how good our models are doing. This is where loss functions come in, and there are three basic ideas behind them. Understanding these ideas will help you pick the right loss function for any job. So, let's get started with the most important question:

What are the 3 basic ideas that guide the design of any loss function?

In designing loss functions, three basic ideas generally guide the process:

Minimizing Residuals: The key is to reduce the residuals, which are the differences between predicted and actual values. To address both negative and positive discrepancies, we often square these residuals, as seen in the least squares method. This approach, which sums the squared residuals, is a staple in regression problems for its simplicity and effectiveness.
Maximizing Likelihood (MLE): Here, the goal is to adjust the model parameters to maximize the likelihood of the observed data, making the model as representative of the underlying process as possible. This probabilistic approach is fundamental in models like logistic regression and neural networks, where fitting the model to the data distribution is crucial.
Distinguishing Signal from Noise: This principle, rooted in information theory, involves separating valuable data (signal) from irrelevant data (noise). Methods based on this idea, focusing on entropy and impurity, are essential in classification tasks and form the basis for algorithms like decision trees.

Additionally, it's important to recognize that some loss functions are tailored to specific algorithms, such as the hinge loss for SVM, indicating that the nature of the algorithm also plays a role in loss function design. Additionally, the nature of the data impacts the selection of a loss function. For instance, in cases of imbalanced training data, we might adjust our loss function to a class-balanced loss or opt for focal loss.

Now, equipped with these fundamental concepts, let's apply them for interpretive analysis to enhance our comprehension. With this approach, we can attempt to address the following question:

How might we apply MLE and the least squares method to enhance our comprehension of MSE?

First, let's break down MSE with the least squares method. The LSE approach finds the best model fit by minimizing the sum of the squares of the residuals. In linear regression (which deals with continuous outputs), a residual is the difference between the predicted value and the actual label. MSE, or Mean Squared Error, is essentially the average of these squared differences. Therefore, the least squares method aims to minimize MSE (factoring in this averaging step), making MSE an appropriate loss function for this method.

Next, looking at MSE from a Maximum Likelihood Estimation (MLE) perspective, under the assumption of linear regression, we typically assume that residuals follow a normal distribution. This allows us to model the likelihood of observing our data as a product of individual probability density functions (PDFs). For simplification, we take the natural logarithm of this likelihood, transforming it into a sum of the logs of individual PDFs. It's important to note that we use density functions for continuous variables, as opposed to probability mass functions for discrete variables.

Note: Likelihood calculations differ for discrete and continuous variables. Discrete variables use a probability mass function, while continuous variables employ a probability density function. For more on MLE, refer to my previous post.

When we examine the log likelihood, it comprises two parts: a constant component and a variable component that calculates the squared differences between the true labels and predictions. To maximize this log likelihood, we focus on minimizing the variable component, which is essentially the sum of squared residuals. In the context of linear regression, this minimization equates to minimizing MSE, especially when we consider the scaling factor 1/2σ² that arises from the normal distribution assumption.

In summary, MSE can be derived and understood from both the perspectives of the Least Squares Estimation (LSE) and MLE, with each approach providing a unique lens into the significance and application of MSE in regression analysis.

So what is a good loss function for logistic regression or more general classification problem?

Alright, diving into the world of loss functions for logistic regression, let's see how we can apply some basic design ideas to understand them better.

First off, let's look at the least squares method. The core idea here is to minimize the gap between our model's output and the true labels. A straightforward approach is setting a threshold to convert logistic regression's probability outputs into binary labels, and then comparing these with the true labels. If we choose, say, a 0.5 threshold for classifying donuts and bagels, we label predictions above 0.5 as donuts and below as bagels, then tally up the mismatches. This approach, known as the 0–1 loss, is directly corresponds to accuracy but isn't used as a loss function for training due to its non-differentiability and non-convex nature, making it impractical for optimization methods like gradient descent. It's more of a conceptual approach than a practical loss function.

When I first visited America, I couldn't tell the difference between a donut and a bagel. A classifier to distinguish between donuts and bagels could be useful. Image created by ChatGPT

Moving on, let's use the MLE (Maximum Likelihood Estimation) idea. In logistic regression, MLE tries to find the weights and bias that maximize the probability of seeing the actual observed data. Imagine our goal is to find a set of weights and bias that maximize the log likelihood, where the likelihood L is the product of individual probabilities of observing each outcome. We're assuming our data points are independent and each follows a Bernoulli distribution.

So we'd have the log loss as:

Finally, let's bring in some information theory, treating logistic regression as a signal capture machine. In this approach, we employ concepts like entropy and cross-entropy to assess the information our model captures. Entropy measures the amount of uncertainty or surprise in an event. Cross-entropy gauge how well our model's predicted probability distribution lines up with the actual, true distribution. The goal here is to minimize cross-entropy, which is kind of like minimizing the KL divergence. Though not exactly a 'distance' in the strict sense, KL divergence represents how far off our model's predictions are from the actual labels.

Softmax is another topic on my writing list. Source: https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e

So, through the application of three distinct design principles for loss functions, we've crafted various types of loss functions suitable for logistic regression and broader classification challenges.

It's particularly fascinating to observe that, despite originating from different perspectives, log loss and cross-entropy loss are essentially the same in the context of binary classification. This convergence occurs in situations where only two possible outcomes exist; under these conditions, cross-entropy effortlessly simplifies and transforms into log loss. Comprehending this shift is vital for understanding the interplay and practical application of these theoretical concepts:

Derive log loss from cross-entropy loss. Source: https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e

Author's Note: In the future, I'm considering delving into the fascinating world of information theory — a topic that, surprisingly, is both intuitive and practical in real-world applications. Until then, I highly recommend Kiprono Elijah Koech's post as an excellent resource on the subject. Stay tuned for more!

In practical scenarios, how should one approach the situation where multiple loss functions need to be minimized?

When managing multiple loss functions in a model, balancing them can be challenging, as they may conflict. One common approach is to create a weighted sum of these loss functions, assigning specific weights to each. However, this introduces new hyperparameters (the weights), necessitating careful tuning. Adjusting these weights means retraining the model, which can be time-consuming and may affect interpretability and performance.

Alternatively, a constraint-based approach can be effective. For instance, in SVM, we aim to maximize the margin (reducing variance) while minimizing classification error (reducing bias). This can be achieved by treating the margin maximization as a constraint, using techniques like Lagrange multipliers, and focusing on minimizing the classification error. This method requires a strong mathematical foundation and thoughtful formulation of constraints.

A third option is to decouple the objectives, building separate models for each and then combining their results. This approach simplifies model development and maintenance, as each model can be independently monitored and retrained. It also offers flexibility in responding to changes in objectives or business goals. However, it's important to consider how the combined results of these models align with the overall objective.

However, it's important to understand that adversarial loss in GANs isn't just a combination of the discriminator's and generator's losses. This is because the two networks are engaged in a responsive interaction, learning and adapting in response to each other, rather than optimizing their losses independently.

Before we conclude, I'd like to address a straightforward yet practical query:

Why do we sometimes prefer using RMSE (Root Mean Squared Error) instead of MSE ?

MSE (Root Mean Squared Error) is often preferred over MSE (Mean Squared Error) in certain situations due to its interpretability. By taking the square root of MSE, RMSE converts the error units back to the original units of the data. This makes RMSE more intuitive and directly comparable to the scale of the data being analyzed. For instance, if you're predicting housing prices, RMSE provides an error metric in the same unit as the prices themselves, making it easier to understand the magnitude of the errors.

Additionally, RMSE is more sensitive to larger errors than RMAE (Root Mean Absolute Error) due to the square root transformation, emphasizing significant deviations more than RMAE. This can be particularly useful in scenarios where larger errors are more undesirable.

If you're enjoying this series, remember that your interactions — claps, comments, and follows — do more than just support; they're the driving force that keeps this series going and inspires my continued sharing.