I have been studying machine learning for these past few weeks and have come to the lesson where outliers are studied. According to data science professionals, one way to improve accuracy is to identify outliers and then remove them from the dataset. Identifying outliers will come in useful when looking for fraud. Because outliers can potentially have an effect on a dataset's accuracy when making predictions, I have decided therefore to write a blog post on this subject.

In Python's premier machine learning library, sklearn, there are four functions that can be used to identify outliers, being IsolationForest, EllepticEnvelope, LocalOutlierFactor, and OneClassSVM.

IsolationForest is a tree based anomaly detection algorithm. It is based on the modelling of normal data in such a way as to isolate anomalies in such a way that are both few in number and different in the feature space.

Sklearn uses EllepticEnvelope as a minimum covariance determinant. If the input variables have a Gaussian distribution then statistical methods can be used to detect outliers.

The LocalOutlierFactor is a method that attempts to harness the idea of nearest neighbours for outlier detection.

The support vector machine, or SVM, can be used for one-class classification. When modelling one class, the algorithm captures the density of the majority class and classifies extremes of density function as outliers. This method of outlier detection can be used for both classification and regression datasets.

Although I have covered the four functions that sklearn has in its library to cover outliers, I am only going to be using one of those functions in the program that I have written concerning this subject.

I wrote the program in Google Colab, which is my Jupyter Notebook of choice because it is free and relatively easy to use. The only drawback to Google Colab that I can see is the fact that it does not have an undo function, so care needs to be taken not to delete or overwrite valuable code, which may not be retrievable.

After I created the program, I imported the libraries that I would need to run it. Libraries are essential to writing a program in Python because they simplify the process of performing complex computing tasks. The libraries that I initially imported were pandas, numpy, sklearn, matplotlib and seaborn. Pandas is a library that creates and manipulates dataframes, numpy performs complex algebraic computations, sklearn houses many functions needed to perform machine learning, and matplotlib and seaborn are used to plot graphs of the data being predicted on:-

None

Once the libraries were imported, I created a multivariate regression dataset using sklearn's make_regression function. The first thing I did was to create the dataframe that holds the independent variables, features:-

None

I then created the dataframe the houses the dependant variable, y:-

None

Once features and y were established, I combined them to form the dataframe, df, which houses both the independent and dependent variables:-

None

I plotted out the graph of one column of features, being 4 on this occasion:-

None

I then used statsmodels to print a report on the vital statistics of this multivariate dataframe, to include the errors, which are used to determine a regression's accuracy:-

None

I defined the X and y variables, which are independent and dependent respectively. The X variable is the first five columns of df and the y variable is the last column:-

None

Once the X and y variables were established, I split the dataset up for training and validation:-

None

I then used sklearn's LocalOutlierFactor to locate and remove 1% of the outliers in the dataset and then printed out the rows that contain outliers:-

None

I then reset x_train and y_train to the new shape once the outliers were removed from the dataframe, and then printed them out:-

None

When the outliers had been removed from the training set, I defined the model that I would make predictions on. In this instance I chose sklearn's ARDRegression, which is a type of linear regression.I achieved an accuracy of 96.66% when I trained and fitted the training data, which was very close to the accuracy I achieved without outlier elimination:-

None

I then made predictions on the validation set, which was 98.06%, which again was very close to the accuracy I achieved without outlier elimination:-

None

I evaluated the error rate, which can be seen in the screenshot below:-

None

I also compared the actual values to the predicted values, and it can be seen there is a slight variance between the two:-

None

In this exercise the rate of accuracy when I removed the outliers was very similar to that which I achieved with them in the data. I very well could have had a more pronounced variance if I had introduced more noise into the dataset or used another dataset altogether.

One thing to keep in mind is the fact that when entering data science competitions, such as Kaggle, outliers cannot be removed when submitting predictions. It is because I try to enter as many competitions as I can that I don't have a lot of experience in outliers. This is one area of data science that, of course, I could improve upon.