Common Causes of Data Leakage and how to Spot Them

Let's learn how to identify and deal with the common causes of data leakage in ML models

Ivo Bernardo

Towards Data Science

· ~8 min read · May 17, 2024 (Updated: May 18, 2024) · Free: No

One of a Data Scientist's worst nightmares is data leakage. And why is data leakage so harmful? Because the ways your models may suffer from it are designed to convince you that your model is excellent and that you should be expecting awesome results in the future.

As you may know, the primary goal of every machine learning model is to generalize the patterns it learns for the future data. As we can't really access future data (without making a time machine first), the best way to simulate this during model development is to set aside a test and a validation set to be used to assess model performance.

During development, data scientists aim to improve the model's test and validation performance. If results are satisfactory, then data scientists assume that their models are ready for production. And that's why data leakage is so dangerous. As it's able to hide itself in seemingly innocent behaviors, it can lurk in the background, waiting for you to put that model into production and just find out that the performance in real life is way worse than the one on the test set.

Normally, data leakage involves the process of polluting our train set with things that will end up violating one or more parameters of "future dataset" rule. And it's one of the most major causes of unsuccessful model deployments.

In this blog post, we'll discuss some of the most common data leakage sources and some tips on how to spot them!

Let's dive in!

Feature Leakage

Probably the most famous data leakage source (and the easiest to spot, as well).

Feature leakage happens when we have features that are directly related to the target often being a result of the target event. Normally, feature leakage happens because the feature value is updated in a point in time after the target event.

Let's see some examples:

1) You are trying to predict loan default of a certain customer and one of your features is the number of outbound calls that the customer had in past 30 days. What you don't know is that in this fictional bank the customer only receives outbound calls if it has already entered into a default scenario.
2) You are trying to predict a certain disease of a patient and are using the number of times the patient went through a specific diagnostic test. However, you later find that this test is only prescribed to people that have a high likelihood of having the disease.
3) You are using a feature with the name days_finish_contract as a feature to predict if a customer will churn or not.

These three examples also show different levels of feature leakage. For example, in 1) and 3), there is a direct correlation between target and features and it's very likely that 100% of the target will have these variables as non-null values. However, 2) is harder to spot, because other patients may have gone through the diagnostic test and didn't have the disease making this leakage source less obvious.

One way to spot feature leakage is via feature importance data obtained in different models. For me, what works very well is fitting an extremely shallow decision tree and checking if one of the features has an enormous difference, in terms of importance, compared to others. When I find that a feature is overly indexed, those features are normally related to leakage problems.

To solve these problems, it's recommended that you look into how these features that have leakage were generated so that you can:

Remove the feature if there's no other way to generate it.
Change the feature so that no information from the future leaks into the target.

Time Correlated Data using Random-Test Split

Funny story — I've learned this the hard way. In the past, I've developed machine learning models to predict soccer games' results and ended up having a betting system that was completely flawed.

Probably you've felt something along these lines before — have you ever trained a time-series model and had spectacular results on the test set?

That's quite common when you perform a random train-test split on your dataset when rows have a time-dependence. Even if you are not training a time series model, but your rows are time-dependent, you should always perform a continuous train-test split based on time and not a random train-test split.

Time-dependent data is a tricky thing to model, particularly because covariates and data distribution shifts as time goes by. If you hint any portion of the new distributions to the train set, your model will be flawed.

The recommendation for time-dependent data is to perform assessments on a continuous cross-validation (CV) mechanism, similar to the one below:

Appropriate Train and Test for Time-Dependent Data — Image by Author

So, yeah.. never use random train-test split on time-dependent data!

Hinting Something to the Test Set via Scaling and Preprocessing

This is a trickier one has even sometimes even experienced data scientists end up falling for it. When performing outlier analysis, scaling, missing value imputation, always make sure that you perform those analysis on the train set and after train-test splitting.

Sometimes people confuse this sentence with applying the same transformations to the test set — while that is completely true (your test set needs to mimic your training set's scale and not contain missing values), you shouldn't give any hint of the distribution of the test set to the train one.

For example, imagine you are applying a Standard Scaler (that uses a Z-Score with mean and standard deviation). You want this standard scaler to memorize the average and standard deviation of the train set. After that set you will apply the same scaler with the fixed statistics from the training set to the test set.

Take another famous preprocessing step: removing outliers. Does it make sense to remove outliers from the test set? Well, if you know what will be an outlier in the future (remember that test sets are just really poor simulations of a future dataset where the model will be used on) then tell me please, because you have just invented a time machine!

As it's not possible to remove outliers in the future (of course, you can remove obvious outliers from being scored by your model), you generally lack information about what a "future" outlier will look like. If you do remove outliers in the test set, you should expect a larger drift between the test set performance and real world deployment.

To avoid this type of leakage, perform train-test split before you do anything on the data.

A Data Scientist finding out their model is wrong in Production — Image by AI

Leakage from Data Generation and Sources

This one is trickier to spot. Feature leakage from data generation and/or data sources is normally tied with how data is collected. You probably won't have full control of it as a data scientist.

Let's take the customer default example that we pointed out in the Feature Leakage — imagine you are building a predictive model that wants to find out if a customer will default in the next installment and you find out that the number of outbound calls to the customer is a great feature to predict this.

You speak with people who are actively doing these phone calls and you find out that the process is fair. They actually call riskier customers and it may make sense that you include this feature in the model. But, after sitting down with them, you find out that there are certain calls that happen after they are certain that the customer will default.

What happens is that your IT systems only log defaults one day after they actually happened so some of these calls are actually calls to ask customer why they defaulted. If you use this feature (call is done on day x, default is logged on day x+1), you'll actually be adding leakage to your model's prediction as a feature is leaking information into the target.

This is the hardest leakage to spot as it requires that you deeply know how your features were generated and good data governance models. Nevertheless, using the technique we spoke about in the Feature Leakage section can be a quick and raw way to understand if you have obvious leakage from the generation and sourcing of your data.

Feature Engineering

Another common way to generate data leakage is during the feature engineering process. Although this is just a subset of Feature Leakage, it's so common that I needed to point it out.

One common example is when creating moving averages in time-series. Often, people may confuse time intervals and add periods that you are supposed to predict on the average value. When overlooked, this will create an enormous distortion on the time series model.

Another example is when using features that were themselves generated by other machine learning models. Using these features is always risky as these models have uncertainty (and may have leakage) inside them. Building these features based on other models can have a high impact and "seemingly" create a much better model that loses its generalization power.

Conclusion

And that's it! These are the most common data leakage problems that I've found in my career as a data scientist.

In a nutshell, data leakage is the silent killer of machine learning models. It's designed to make you feel that you have a great model but whose performance is highly influenced by leakage problems. Data leakage issues can be so bad that they can impact your model as soon as you get it into production. Identifying leakage as soon as possible will not only make your model more stable but also more accurate in the real world.

Do you know any other data leakage causes that I didn't mention here? Point them out in the comments!

Also, feel free to check me out on my Youtube Channel:

You can subscribe to my channel "The Data Journey" here. Or my Udemy page here.

#dataleakage #artificial-intelligence #data-science #machine-learning #mlops