One of the most performant machine learning algorithms

XGBoost is a supervised learning algorithm that can be used for both regression & classification. Like all algorithms, it has its virtues & draws, of which we'll be sure to walk through.

For this post, we'll just be learning about XGBoost from the context of classification problems. For the regression portion, be sure to keep an eye on my blog at datasciencelessons.com.

Supervised learning catch up

I won't dive in too deep here, but for those who need a quick refresher on supervised learning; supervised learning is when you have a specific thing in mind you'd like to predict. For instance, you want to predict future home prices; so you have what you want to predict, the next step would be labeling historic data as a means to predict the future. To dive deeper into this example; let's say you wanted to sell your home but wanted to know what price you should pay, you could potentially accumulate data points about the homes as well as what their sale price was during that same time period. From this point, you would train a model to which you would pass the datapoints about your own home to generate a prediction about the value of your home. For the classification example where your predicting class; let's say your Gmail and wanting to predict spam… this requires a model to train on many emails that were 'labeled' as spam as well as a corresponding amount that were not.

What exactly is it?

XGBoost uses what is called an ensemble method. Without going into too much detail on ensemble methods, what makes XGBoost unique is how it leverages the outputs of many models to generate its prediction! XGBoost makes use of what are known as many 'weak learners' to produce a 'strong learner'.

The XGBoost process looks something like this:

  • It iteratively trains many weak models
  • Weights each prediction according to performance
  • Combines the many weighted predictions to come up with a final output.

What makes XGBoost so popular?

  • Accuracy
  • Speed
  • This algorithm makes good use of modern computation allows itself to be parallelized
  • Consistent outperformance of other algorithms

Your First XGBoost Model

Let's break down the steps!

  • You'll start by importing the XGBoost package in python
  • Break out your dependent & independent variables using y & X respectively
  • Break out train test split
  • Instantiate your classifier
  • Train your classifier
  • Predict y for your test set
  • Assess accuracy!

For this example, we'll be classifying survival on the titanic.

import xgboost as xgb from sklearn.model_selection 
import train_test_split
X, y = titanic.iloc[:,:-1], titanic.iloc[:,-1]
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=.2, random_state=44)
xgb_model = xgb.XGBClassifier(objective='binary:logistic', n_estimators= 7, seed=44)
xgb_model.fit(X_train, y_train) pred = xgb_model.predict(X_test)
accuracy = float(np.sum(pred == y_test)) / y_test.shape[0]

Well done! That was a great first pass! Let's learn a bit more about how to assess the quality of our model.

Assessing Performance

We've already seen accuracy in the code snippet above, but there are also the other aspects of a confusion matrix, precision & recall. I won't talk about those two here, but if you'd like more info, jump over to this post on the random forest algorithm: https://datasciencelessons.com/2019/08/13/random-forest-for-classification-in-r/

Apart from those, I'd like to talk about AUC.

To put it simply; if you chose one positive datapoint & one negative datapoint, AUC is the probability that the positive datapoint would be more highly ranked than the negative datapoint.

XGBoost allows you to run cross validation testing & to specify metrics you care about in the core algorithm call itself. This is partly done by creating a data structure called a dmatrix which consists of your X & y values.

The core difference this time is that we'll create a dmatrix, specify parameters of our model, then generate an output where we specify the metric as AUC.

titanic_dm = xgb.DMatrix(data=X, label=y)
params = {"objective":"reg:logistic", "max_depth":3}
output = xgb.cv(dtrain=titanic_dm, params=params, nfold=3, num_boost_round=5, metrics="auc", as_pandas=True, seed=123)
print((output["test-auc-mean"]).iloc[-1])

How often should you use it?

XGBoost is not to be used every time you need to predict something. It can, however, be very useful in the right scenarios:

  • You have a lot of training data
  • You don't only have categorical data, rather a good mix of numeric & categorical variables or just numeric variables.

You'll definitely want to keep XGBoost away from computer vision & nlp related tasks… or if you have some very limited data.

As always, I hope this proves useful in your data science endeavors! Be sure to check out my other posts at datasciencelessons.com!

Happy Data Science-ing!