Have you ever found yourself packing for a trip, weighing the necessity of each item? What did you decide to leave behind?

So, suppose you are packing for a hiking trip where you need to limit the weight of your backpack.

None
Image by author

You have several essential items, like water and food, and others that are nice-to-haves, like a pen or a book. The goal is to minimize the weight while keeping what's necessary to enjoy the hike.

In Lasso Regression (Aw, we aren't really going for a hike!), coefficients are the "items," and we try to minimize the weight of our model. If a feature doesn't add much value (like the book),

If you are not a member, check the article here: Finding the Right Balance: Better Choices with Lasso Regression!

Lasso Regression

Lasso, which stands for Least Absolute Shrinkage and Selection Operator, adds a unique penalty to the regression model.

Like Ridge regression, it includes a regularization term to keep the coefficients under control. Still, here, the penalty is the sum of the absolute values of the coefficients rather than their squares.

This difference allows Lasso to do something unique: it can shrink some coefficients to zero, effectively selecting only the most relevant features in the model.

Linear Regression and the Need for Lasso

In Linear Regression, we minimize the sum of squared residuals (SSR), aiming for a line that best fits all data points. However, this can lead to overfitting, especially with complex datasets.

None
Comparison of linear, lasso and ridge regression. Image by Author

Ridge Regression addresses this by shrinking the coefficients through a penalty, but it doesn't eliminate them.

Enter Lasso Regression, which reduces some coefficients to zero, removes irrelevant features, and simplifies the model.

This approach can make the model more interpretable by focusing only on the core predictors.

How Lasso Regression Works

In Lasso Regression, the goal is to minimize the cost function with an added penalty term:

Cost Function for Lasso:

None
Cost function for lasso, image by author

Where:

  • The first term, SSR, is the same as in Linear Regression, which is the error term.
  • ∑∣θj​∣ is the L1 penalty on the coefficients that sum up the absolute values of the coefficients.
  • λ is the regularization parameter.

Here, λ (lambda) controls the strength of the penalty. With a higher λ, Lasso applies more pressure to the coefficients, possibly shrinking some of them to zero.

This penalty term leads Lasso to automatically select features that matter most, discarding others by setting their coefficients to zero.

Example: Predicting House Prices with Lasso

Let's say you have a dataset predicting house prices based on multiple features, like house size, number of rooms, and neighbourhood quality. Some of these features might not contribute significantly to predicting the price.

None
House price dataset, image by author

Step 1: Standardise the Dataset

Before applying Lasso, we standardize each feature to have a mean of 0 and a standard deviation of 1. This ensures that the penalty λ affects all features equally.

None
Mean and standard deviation of each feature.

Standardized Dataset

The standardized dataset (with mean 0 and standard deviation 1) looks as follows:

None
Standardised dataset

Step 2: Set Up the Lasso Cost Function

The Lasso cost function for this setup is:

None
Lasso cost function

Let's set a small λ value, say λ = 0.5, to see its effect on feature selection.

Step 3: Calculate Initial Coefficients Using Linear Regression

Using ordinary least squares (OLS), we first calculate initial coefficients (without regularization) as a baseline.

None
initial coefficients

Step 4: Apply Lasso Penalty

Now, we adjust each coefficient using Lasso's soft-thresholding rule to determine the new values. The soft-thresholding rule for each coefficient θj\theta_jθj with Lasso is:

None

Calculating the Adjusted Coefficients

  1. House Size (X1):
None
value of theta 1

2. Number of Rooms (X2):

None
value of theta 2

3. Neighborhood Quality (X3):

None
value of theta 3

4. Distance to Parks (X4):

None
value of theta 4

Step 5: Interpretation of Results

After applying the Lasso penalty, we see the coefficients slightly reduced:

None

In this small λ example, Lasso hasn't zeroed out any coefficients because λ isn't large enough. However, as we increase λ, the less impactful features, like Neighbourhood Quality and Distance to Parks, may reduce to zero if they do not contribute significantly to the prediction.

Wrapping Up

Lasso Regression is like finding the perfect angle along with the right pull on the bowstring — it helps your model focus on the essentials, ignoring distractions.

With Lasso, you can achieve a more streamlined model that's easier to interpret and less prone to overfitting. It's precious when you have many predictors and want a model that highlights only the core insights in your data.