Scikit-learn is a widely-used machine learning library for python. It is well known for off-the-shelf machine learning algorithms. However, scikit-learn also provides lots of useful tools for data preprocessing.

Data preprocessing is an important step in machine learning pipeline. We cannot just dump raw data to a model. We need to clean the data and apply some preprocessing techniques to be able to create a robust and accurate machine learning model.

Feature selection simply means using more valuable features. The value here is the information. We want to use the features that are more informative in terms of the target variable. In a supervised learning task, we typically have many features (independent variables) and some of them are likely to carry little or no valuable insight about the target (dependent variable). On the other hand, some features are so critical that they explain most of the variance of the target. Feature selection is about finding those informative features. Another application of feature selection is dimensionality reduction which means reducing the number of features by deriving new features using the existing ones. Dimensionality reduction is especially useful when we have high-dimensional (lots of features) data.

In this post, we will cover the 3 feature selection techniques offered by scikit-learn.

1. VarianceThreshold

VarianceThreshold removes features with a variance less than the specified threshold. Consider a feature that takes the same value for all the observations (rows) in the dataset. It would not add any informative power to a model. Using this feature also adds an unnecessary computation burden. Thus, we should just eliminate it from the dataset. Similarly, features with a very small variance can also be omitted.

Let's create three features with different variance values.

import numpy as np
import pandas as pd
col_a = pd.Series(np.ones(50))
col_b = pd.Series(np.ones(50))
col_b[:5] = 0
col_c = pd.Series(np.random.randint(20,30, size=50))
features = pd.concat([col_a,col_b,col_c], axis=1)
None

The variances of the features:

None

We can create a selector instance of VarianceThreshold and use it to only select features with a variance higher than 0.1.

from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=(0.1))
selector.fit_transform(features)
None

It only selected col_c as expected.

2. Recursive Feature Elimination

As the name suggests, recursive feature elimination (RFE) works by eliminating features recursively. The elimination is done based on outputs from an estimator that assigns some kind of weights to features. For instance, the weights can be the coefficients of a linear regression or feature importances of a decision tree.

The process starts by training the estimator on the entire dataset. Then, the least important features are pruned. After that, the estimator is trained with the remaining features and the least important features are pruned again. This process is repeated until the desired number of features is reached.

Let's use a sample house price dataset. The dataset is available on kaggle. I will only use some of the features.

df = pd.read_csv("/content/train_houseprices.csv")
X = df[['LotArea','YearBuilt','GrLivArea','TotRmsAbvGrd',
'OverallQual','OverallCond','TotalBsmtSF']]
y = df['SalePrice']

We have 7 features and a target variable. The following piece of code will use RFE to select the best 4 features.

from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
lr = LinearRegression()
rfe = RFE(estimator=lr, n_features_to_select=4, step=1)
rfe.fit(X, y)
None

We used linear regression as estimator. Desired number of features is determined with n_features_to_select parameter. RFE assigns a rank to each feature. The features that are assigned with 1 are the selected ones.

rfe.ranking_
array([4, 1, 2, 1, 1, 1, 3])

3. SelectFromModel

Just like RFE, SelectFromModel is used with an estimator that has coef_ or feature_importantances_ attribute. The more important features are selected according to the weights of features.

Let's use the same subset of features that we used in the previous section. We will use ridge regression as the estimator. As the threshold for selecting features, we use the 'mean' keyword.

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Ridge
ridge = Ridge().fit(X, y)
model = SelectFromModel(ridge, prefit=True, threshold='mean')
X_transformed = model.transform(X)
None

We have selected 2 features out of 7. The selected features are "OverallQual" and "OverallCond" which make sense because these are the key factors in determining the price of a house. They also match with features selected using recursive feature elimination techniques.

In this case, we could determine important features to some extent by intuition. However, the cases in real life are more complex and may contain lots of features. Feature selection techniques come in handy for those cases.

Scikit-learn provides many feature selection and data preprocessing tools. We have covered three of them. You can always visit the documentation for the entire scope.

Thank you for reading. Please let me know if you have any feedback.