This article is about skewness and data scaling. How to detect and handle it, Practical python hands-on with data scaling methods like MinMaxScaler and StandardScaler, Power Transforms/ Skewness Handling: Log Transformation, Square Root Transformation and BoxCox
This article is part of a series where we walk step by step through solving fintech problems with different Machine Learning techniques using the "All lending club loan" dataset. Here you can find the complete end-to-end data science project for beginners to learn data science.
What is skewness in your data?
Skewness refers to data distortion or symmetry/ asymmetry of data distribution. When your data is skewed, the value of the mean and the median is different. Skewness violates the normality assumption of some ML models, for example, linear regression.
When you fit a linear regression model based on skewed data, the results of this model might be misleading. You can always apply other models that are immune to skewness like tree-based, however, it would limit your possibilities in trying other models.
- Skewness-sensitive models are Linear Regression, Logistic Regression (less sensitive than linear regression and may benefit from transformation in some cases)
- Skewness Immune: All tree-based or complex algorithms
What is the data scales in your data?
Imagine, you have two features 1) Loan Interest Rate and 2) Loan Amount, both the features have different scales, and because of this difference, the feature with larger values or ranges may sometimes dominate over features with smaller values or ranges, and therefore it can degrade the predictive performance of ML algorithms, or sometimes even prevent the convergence of gradient-based estimators. In addition, the difference in ranges may lead to difficulties to visualize your data.
You can see the below example of how scaling affects — left is unscaled data and right is scaled.
In simple language, the skewness is about the shape of your data, and scaling is about the range of your data.
How do I detect skewness and deal with it?
In order to detect skewness in your data, you can deploy the following techniques:
- Data visualization (e.g. visualize with a histogram)
- Calculate the coefficient of asymmetry (aka coefficient of skewness) and kurtosis. In a normal distribution, these indicators are equal to 0 (see the symmetrical distribution curve below as an example). Or use other methods like Kolmogorov-Smirnov, Lilliefors, and Shapiro-Wilk.
- Calculate mean, mode, median and percentiles and compare them (see in the below charts how mode, median and mean vary depending on data skewness)
As a general rule of thumb:
- Data is symmetrical: skewness is between -0.5 and 0.5
- Data is slightly skewed: skewness is between -1 and -0.5 or 0.5 and 1
- Data is highly skewed: skewness is less than -1 or greater than 1.
There are two types of skewness:
- The skewness is positive when you see the tail on the right side of the distribution.
- The skewness is negative when you see the tail on the left side of the distribution.
To deal with skewness and fit the skewed data into a normal one (Gaussian or bell shape), you may apply the following techniques: square root, logarithm or BoxCox transformations. These are called power transforms, and there are many others. We will review the applications and limitations of these in the python hands-on section later.
How do I handle data scaling?
Scaling is typically handled through normalization or standardization techniques (ref.1, ref.2). You need to normalize or standardize your data if you're going to use a machine learning algorithm that does not make assumptions about the distribution of your data e.g. Linear or Logistic Regression, KNN, and Neural Networks. The differences between these two methods are:
- Normalization is a scaling technique when you know your data distribution is not Gaussian (aka a bell curve/ normal distribution). Normalization is a method to adjust all your features to be on a similar scale, between 0 and 1.
- Standardization assumes that your data has a Gaussian (bell curve) distribution. When you apply this method, it will transform your data which will result in a distribution where the mean is 0, and the standard deviation is 1.
You may use MinMaxScaler or StandardScaler methods which are the most popular to scale your data, however, there are many others too.
Practical Python Hands On
Just as a note, I use Kaggle environment to run my code, and if you never used Kaggle in the past, I suggest you read this article. Also, to provide the use cases of these functions we will use the All Lending Club loan dataset. A few preparations before we begin:
#load packages and set pd formats
import pandas as pd
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
pd.set_option('display.float_format', lambda x: '%.0f' % x)
load dataset loan = pd.read_csv('../input/lending-club/accepted_2007_to_2018Q4.csv.gz', compression='gzip', low_memory=True)
#consider only business-critical features
loans = loan[['loan_amnt', 'term','int_rate', 'sub_grade', 'emp_length', 'home_ownership', 'annual_inc', 'loan_status', 'addr_state', 'dti', 'mths_since_recent_inq', 'revol_util', 'bc_open_to_buy', 'bc_util', 'num_op_rev_tl']]
#remove missing values
loans = loans.dropna()
1-Data Scaling
Let's take an example of how to practically apply our knowledge to a feature from the loan dataset. We will use the 'int_rate' feature, and this is how the distribution of this feature looks like before any manipulation:
loans['int_rate'].hist(bins = 10, figsize = (20,10), color = 'black')
MinMaxScaler
MinMaxScaler is part of sklearn library, and is used for normalization. This method scales the range of the feature between 0 and 1. Keep in mind this method is sensetive to outliers, and you will need to handle that before applying this method. You can read this article to get a better idea on how to tackle outliers. Also, as we talked in the previous section, this method does not change the shape of your data, and only rescales it.
from sklearn.preprocessing import MinMaxScaler
scaler = preprocessing.MinMaxScaler()
scaled_mms = scaler.fit_transform(loans[['int_rate']])
scaled_mms = pd.DataFrame(scaled_mms, columns=['int_rate'])
scaled_mms.hist(bins = 10, figsize = (20,10), color = 'black')
StandardScaler
StandardScaler is part of the sklearn library and is used for standardization, it assumes that your data is normally distributed. This method scales the range of the feature so that the distribution is centred around 0 with a standard deviation of 1. Keep in mind this method is also sensitive to outliers, and you will need to handle that before applying this method. Also, as we talked about in the previous section, this method does not change the shape of your data, and only rescales it.
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaled_sc = scaler.fit_transform(loans[['int_rate']])
scaled_sc = pd.DataFrame(scaled_sc, columns=['int_rate'])
scaled_sc.hist(bins = 10, figsize = (20,10), color = 'black')
2-Power Transforms, Skewness Handling
Log Transformation
Logarithmic transformations are typically used to "normalize" skewed distributions, or change the shape of your data into one that is looking like a bell shape. This method will stabilize variance and minimize skewness. However, this method can not be applied to zero or negative values.
loans['int_rate_log'] = np.log(loans['int_rate']) loans['int_rate_log'].hist(bins = 10, figsize = (20,10), color = 'black')
Square Root Transformation
The square root method is typically used when your data is moderately skewed, and again changes the shape of your data into one that is looking like a bell shape.
loans['int_rate_sqr'] = np.sqrt(loans['int_rate']) loans['int_rate_sqr'].hist(bins = 10, figsize = (20,10), color = 'black')
BoxCox
Another way to achieve this task is to apply BoxCox. A limitation of the Box-Cox transform is that it assumes that all values in the data sample are positive.
from sklearn.preprocessing import power_transform
scaled_bc = power_transform(loans[['int_rate']], method='box-cox') scaled_bc = pd.DataFrame(scaled_bc, columns=['int_rate'])
scaled_bc.hist(bins = 10, figsize = (20,10), color = 'black')
Conclusion
There is much more to cover on this topic, however, the basics we have reviewed in this article will help you to get started.
One thing to keep in mind is how to use transformed and scaled data in production. Once you apply these techniques to your test and training datasets and eventually deploy your ML model into production, you have to apply the same processing steps to your new data that comes for inference. Typically you can save these processing objects and load them in production, on more detailed steps to achieve this, you can read this article here.
Kaggle notebook is here.
References for further read
Want to learn more? Here is the complete end-to-end data science project for beginners to learn data science. By completing this project: 1) you will experience the entire data science cycle yourself, 2) you will develop a project that you can use to prove your experience, and 3) you will answer the most popular interview questions in case you decide to pursue the career of a data scientist.
What do you struggle with in your early journey? Please share it with me here, and I am happy to help! I listen to your stories carefully and want to produce content that helps you in this journey. For more content like this, sign up for my newsletter.