Perquisites:

  • LGBM == lightgbm (python package): Microsoft's implementation of gradient boosted machines
  • optuna (python package): automated hyperparameter optimization framework favoured by Kaggle grandmasters. Being algorithm agnostic it can help find optimal hyperparameters for any model. Apart from gridsearch, it features tools for pruning the unpromising trails for faster results

So what's the catch?

Complete model optimization includes many different operations:

  • Choosing the optimal starting hyperparameters for your algorithm (conditional on the task type and data stats)
  • Defining the hyperparameters to optimize, their grid and distribution
  • Selecting the optimal loss function for optimization
  • Configuring the validation strategy
  • Further optimization (e.g. n_estimators tuning with early_stopping for tree ensembles like LGBM)
  • Results analysis
  • and much more…

Putting it all together even for a single task requires a lot of code and any subsequent tasks will require substantial modifications to this code

There is a reason this article includes two important keywords:

  • LGBM — fastest gradient boosting framework
  • optuna — fastest hyperparameter optimization framework

Wisely using them together will help you build the best and most optimal model in half the time

But one has to be prepared to deal with all the implications outlined above.

Luckily another open-source package combines the advantages of both these frameworks and provides a one-line method to create your best model with lightgbm and optuna

pip install verstack

Not only it finds optimal hyperparameters for your task, it also provides convenient methods for prediction and analytics. And it uses multiprocessing carefully and almost to the fullest capacity leaving behind some processing power for your machine to operate without freezing

We will use boston housing dataset from Kaggle for this demonstration

import pandas as pd
from verstack import LGBMTuner
# import the data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X = train.drop('medv', axis = 1)
y = train['medv']
# tune the hyperparameters and fit the optimized model
tuner = LGBMTuner(metric = 'mse') # <- the only required argument
tuner.fit(X, y)
# check the optimization log in the console.
pred = tuner.predict(test)

Basically that is it…

  • optimal hyperparameters had been selected running the default 100 trials using early stopping so the best number of estimators has alse been defined
  • optimization history, parameters and features importances had been saved in the plotting pipeline
  • optimized model trained on the whole train data
  • prediction methods had been prepared to predict on new data (including various evristics for predicting negatives in regression and predicting classes/probabilities in multiclass/binary)

What else?

Wait, there is much more…

categorical_feature support

LGBM has a neat feature that gears the model to figure out the encoding of categorical features inside your data so you don't have to encode them yourself. LGBMTuner supports this integration.

According to LGBM docs you have to transform your unique categories into consecutive integers and then cast them into "categoric" dtype like so:

df['Sex'].unique()
encoding_dict = {val:ix for ix, val in enumerate(df['Sex'].unique())}
df['Sex'] = df['Sex'].map(encoding_dict)
df['Sex'] = df['Sex'].astype('category')
print(df['Sex'].dtype)
#--->CategoricalDtype(categories=[0, 1], ordered=False)

And then just pass this data to LGBMTuner without any additional settings

from verstack import LGBMTuner
tuner = LGBMTuner(metric = 'accuracy')
X = df.drop('target', axis = 1)
y = df['target']
tuner.fit(X, y)

Custom grids to iterate over

LGBMTuner is configured for best performance by default.

Depending on the given task (classification/regression) and dataset length it will automatically set the fixed starting parameters for LGB model.

The default grid for parameters selection is the following:

None

These settings can be overridden as well as new parameters and their respective grids can be passed to the LGBMTuner instance like so:

tuner = LGBMTuner(metric = 'auc', trials = 300)
# SHOW SUPPORTED AND SELECTED OPTIMIZATION PARAMETERS
tuner.grid
#--->{'boosting_type': None,
#--->'num_iterations': None,
#--->'learning_rate': None,
#--->'num_leaves': {'low': 16, 'high': 255},                  <--- default setting
#--->'max_depth': None,
#--->'min_data_in_leaf': None,
#--->'min_sum_hessian_in_leaf': {'low': 0.001, 'high': 10.0}, <--- default setting
#--->'bagging_fraction': {'low': 0.5, 'high': 1.0},           <--- default setting
#--->'feature_fraction': {'low': 0.5, 'high': 1.0},           <--- default setting
#--->'max_delta_step': None,
#--->'lambda_l1': {'low': 1e-08, 'high': 10.0},               <--- default setting
#--->'lambda_l2': {'low': 1e-08, 'high': 10.0},               <--- default setting
#--->'linear_lambda': None,
#--->'min_gain_to_split': None,
#--->'drop_rate': None,
#--->'top_rate': None,
#--->'min_data_per_group': None,
#--->'max_cat_threshold': None}

# CHANGE SELECTED OPTIMIZATION PARAMETERS
# parameters can be passed by any of the following ways: 
# - list (will be used for a random search)
# - tuple (will be used to define the uniform grid range between the min(tuple) and the max(tuple))
# - dict with keywords 'choice'/'low'/'high'
tuner.grid['boosting_type'] = ['gbdt', 'rf'] 
tuner.grid['max_data_in_leaf'] = {'choice' : [40, 50, 70]}
tuner.grid['learning_rate'] = (0.001, 0.1)
tuner.grid['lambda_l1'] = {'low': 0.1, 'high': 5}
tuner.fit(X, y)

User can configure custom grids for any/all the parameters in the above dict which can be accessed after defining the class instance via .grid attribute.

Custom LGBM (fixed) params

Based on many requests new release of LGBMTuner 1.1.0 supports setting any LGBM supported parameters.

If for example you need to configure LGBM for optimization with is_unbalance argument or any other supported argument, use the custom_lgbm_params argumet at LGBMTuner init.

from verstack import LGBMTuner

my_custom_params = {'is_unbalance': True, 'zero_as_missing': True}
tuner = LGBMTuner(metric = 'auc', custom_lgbm_params = my_custom_params)

Metrics

LGBMTuner currently supports (evaluation metrics):

'mae', 'mse', 'rmse', 'rmsle', 'mape', 'smape', 'rmspe', 'r2', 'auc', 'gini', 'log_loss', 'accuracy', 'balanced_accuracy', 'precision', 'precision_weighted',  'precision_macro', 'recall', 'recall_weighted', 'recall_macro', 'f1', 'f1_weighted', 'f1_macro', 'lift'
# note the syntax

Evaluation metrics become optimization metrics in the case of regression, given the minimize only strategy. The only exception for regression is 'r2'. If this metric is selected when initializing LGBMTuner, it will be substituted for 'mse' optimization during hyperparameters tuning and.

For classification, regardless of the selected evaluation metric LGBMTuner will optimize the cross_entropy when searching for hyperparameters.

Number of trials

A single trial is a single iteration of training/validation of a model with randomly selected parameters from the search space. By default LGBMTuner will run 100 trials. Number of trials can be defined at tuner initialization: tuner = LGBMTuner(metric = 'mse', trials = 500)

Prediction

Calling tuner.fit(X, y) will eventually fit the model with best params on the X and y

Then the conventional methods: tuner.predict(test) and tuner.predict_proba(test) are available

For classification tasks additional parameter threshold is available: tuner.predict(test, threshold = 0.3)

Tip: One may use the verstack.ThreshTuner for optimizing the threshold parameter

Visualizations

LGBMTuner ships with different built in plotting methods for static png and interactive html plotting for feature importances and optimizations stats

When LGBMTuner is initialized with default parameters, namely visualization = True, it will create 4 static plots after optimization is complete. If you are using an interactive shell like Spyder or Jupiter, these plots will be displayed automatically at the end of tuning. This can be disabled at init with tuner = LGBMTuner(metric = 'mse', vusialization = False)

These plots are also available on demand by their corresponding methods

Feature Importance

tuner.fit(X, y)
tuner.plot_importances()
None

figsize = (10, 6) and n_features = 15 are the default arguments but can be changed if required

An interactive plot is available as an html file, which is displayed automatically in the default browser:

tuner.plot_importances(interactive = True)
None

This html can be saved from the browser's file menu

Trials validation results plot

tuner.plot_intermediate_values()
None

Interactive argument is most useful in this case

tuner.plot_intermediate_values(interactive = True)
None

Here among all the trials you can see the pruned (terminated) trials and their evaluation results

Parameters importances

This is a parameters importance histogram plot that shows which params had the highest impact on the optimization metric

tuner.plot_param_importances()
None
tuner.plot_param_importances(interactive = True)
None

Optimization history plot

tuner.plot_optimization_history()
None
tuner.plot_optimization_history(interactive = True)

In an interactive mode you can see the objective function (optimization metric) values changes

None

Verbosity

This is an important part of the framework. The default verbosity level 1 will display essential optimization results in a nice structured way without cluttering your console all that much

By default the fit method will output the optimal amount of information, including every i-th trial results (omitting the trials that had been pruned), and the final (optimized) model parameters.

None
notice the improvement of the loss function between trial 0 and trial 55
None
None

All the verbosity options are 0,1,2,3,4,5 where 0 is completely silent except for fatal errors and built in exceptions; 1–5 are based on optuna.logging options. Default verbosity level 1 is enriched with essential optimization statistics (screenshots above)

Additional LGBMTuner attributes

Feature importance values

tuner.feature_importances
>>> ID         0.08145
>>> crim       0.07421
>>> zn         0.00424
>>> indus      0.02870
>>> chas       0.00547
>>> nox        0.06929
>>> rm         0.13872
>>> age        0.11890
>>> dis        0.13448
>>> rad        0.02966
>>> tax        0.04619
>>> ptratio    0.03977
>>> black      0.06027
>>> lstat      0.16865

Initially defined params

tuner.init_params
>>> {'learning_rate': 0.01,
>>>  'num_leaves': 16,
>>>  'colsample_bytree': 0.9,
>>>  'subsample': 0.9,
>>>  'verbosity': -1,
>>>  'n_estimators': 10000,
>>>  'early_stopping_rounds': 100,
>>>  'random_state': 42,
>>>  'objective': 'regression',
>>>  'metric': 'l2',
>>>  'num_threads': 10,
>>>  'reg_alpha': 1}

Optimized params

tuner.best_params
>>> {'learning_rate': 0.01,
>>>  'num_leaves': 130,
>>>  'colsample_bytree': 0.8246563384855297,
>>>  'subsample': 0.5335500916057069,
>>>  'verbosity': -1,
>>>  'random_state': 42,
>>>  'objective': 'regression',
>>>  'metric': 'l2',
>>>  'num_threads': 10,
>>>  'reg_alpha': 0.0011166918277076062,
>>>  'min_sum_hessian_in_leaf': 0.00270990587924765,
>>>  'reg_lambda': 8.270186047772752e-06,
>>>  'n_estimators': 605}

Trained model instance

Although after calling tuner.fit(X, y) this LGBMTuner instance is an object that contains the tuned and fitted LGBM model and the tuner itself contains all the necessary methods for predictions tuner.predict(test) the actual LGBM booster model can be extracted from the tuner object:

tuner.fitted_model
>>> <lightgbm.basic.Booster at 0x7ff3b89a5b10>

Additional methods and attributes are well described in the documentation.

The proposed framework encapsulates extensive research and best Data Science practices to reduce the amount of stress and gain a significant improvement for any classification/regression tasks it might be used for

And be sure to check out the rest of the tools verstack has to offer

The package includes solutions to some day-to-day tasks that didn't have convenient solutions before

Current modules:

  • verstack.LGBMTuner
  • verstack.PandasOptimizer — automatic memory optimization when reading data into pandas. One-liner for 5-fold memory footprint reduction & significant training time decrease Medium article
  • verstack.Stacker — automated ensembling factory; create multilayer stacking ensembles with a few lines of code Medium article
  • verstack.FeatureSelector — automated feature selection tool based on quick recursive feature elimination by various ML models Medium article
  • verstack.DateParser — ultimate DateParser class that automatically finds and parses datetime feats from all the possible datetime formats in you dataframe Medium article
  • verstack.Multicore — parallelise any function with a single line of code (by far the most popular tool) Medium article
  • verstack.NaNImputer — impute all the NaN values by machine learning with a single line of code Medium article
  • verstack.ThreshTuner — automatic threshold selection for getting most out of the binary classification predicted probabilities Medium article
  • stratified_continuous_split — continuous data stratification Medium article
  • categoric encoders Factorizer OneHotEncode FrequencyEncoder WeightOfEvidenceEncoder MeanTargetEncoder Medium article
  • timer — convenient timer to measure any function execution

Links

verstack.LGBMTuner documentation

verstack documentation

Git

Pypi

author