Perquisites:
- LGBM == lightgbm (python package): Microsoft's implementation of gradient boosted machines
- optuna (python package): automated hyperparameter optimization framework favoured by Kaggle grandmasters. Being algorithm agnostic it can help find optimal hyperparameters for any model. Apart from gridsearch, it features tools for pruning the unpromising trails for faster results
So what's the catch?
Complete model optimization includes many different operations:
- Choosing the optimal starting hyperparameters for your algorithm (conditional on the task type and data stats)
- Defining the hyperparameters to optimize, their grid and distribution
- Selecting the optimal loss function for optimization
- Configuring the validation strategy
- Further optimization (e.g. n_estimators tuning with early_stopping for tree ensembles like LGBM)
- Results analysis
- and much more…
Putting it all together even for a single task requires a lot of code and any subsequent tasks will require substantial modifications to this code
There is a reason this article includes two important keywords:
- LGBM — fastest gradient boosting framework
- optuna — fastest hyperparameter optimization framework
Wisely using them together will help you build the best and most optimal model in half the time
But one has to be prepared to deal with all the implications outlined above.
Luckily another open-source package combines the advantages of both these frameworks and provides a one-line method to create your best model with lightgbm and optuna
pip install verstack
Not only it finds optimal hyperparameters for your task, it also provides convenient methods for prediction and analytics. And it uses multiprocessing carefully and almost to the fullest capacity leaving behind some processing power for your machine to operate without freezing
We will use boston housing dataset from Kaggle for this demonstration
import pandas as pd
from verstack import LGBMTuner
# import the data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X = train.drop('medv', axis = 1)
y = train['medv']
# tune the hyperparameters and fit the optimized model
tuner = LGBMTuner(metric = 'mse') # <- the only required argument
tuner.fit(X, y)
# check the optimization log in the console.
pred = tuner.predict(test)
Basically that is it…
- optimal hyperparameters had been selected running the default 100 trials using early stopping so the best number of estimators has alse been defined
- optimization history, parameters and features importances had been saved in the plotting pipeline
- optimized model trained on the whole train data
- prediction methods had been prepared to predict on new data (including various evristics for predicting negatives in regression and predicting classes/probabilities in multiclass/binary)
What else?
Wait, there is much more…
categorical_feature support
LGBM has a neat feature that gears the model to figure out the encoding of categorical features inside your data so you don't have to encode them yourself. LGBMTuner
supports this integration.
According to LGBM docs you have to transform your unique categories into consecutive integers and then cast them into "categoric"
dtype like so:
df['Sex'].unique()
encoding_dict = {val:ix for ix, val in enumerate(df['Sex'].unique())}
df['Sex'] = df['Sex'].map(encoding_dict)
df['Sex'] = df['Sex'].astype('category')
print(df['Sex'].dtype)
#--->CategoricalDtype(categories=[0, 1], ordered=False)
And then just pass this data to LGBMTuner without any additional settings
from verstack import LGBMTuner
tuner = LGBMTuner(metric = 'accuracy')
X = df.drop('target', axis = 1)
y = df['target']
tuner.fit(X, y)
Custom grids to iterate over
LGBMTuner is configured for best performance by default.
Depending on the given task (classification/regression) and dataset length it will automatically set the fixed starting parameters for LGB model.
The default grid for parameters selection is the following:
These settings can be overridden as well as new parameters and their respective grids can be passed to the LGBMTuner
instance like so:
tuner = LGBMTuner(metric = 'auc', trials = 300)
# SHOW SUPPORTED AND SELECTED OPTIMIZATION PARAMETERS
tuner.grid
#--->{'boosting_type': None,
#--->'num_iterations': None,
#--->'learning_rate': None,
#--->'num_leaves': {'low': 16, 'high': 255}, <--- default setting
#--->'max_depth': None,
#--->'min_data_in_leaf': None,
#--->'min_sum_hessian_in_leaf': {'low': 0.001, 'high': 10.0}, <--- default setting
#--->'bagging_fraction': {'low': 0.5, 'high': 1.0}, <--- default setting
#--->'feature_fraction': {'low': 0.5, 'high': 1.0}, <--- default setting
#--->'max_delta_step': None,
#--->'lambda_l1': {'low': 1e-08, 'high': 10.0}, <--- default setting
#--->'lambda_l2': {'low': 1e-08, 'high': 10.0}, <--- default setting
#--->'linear_lambda': None,
#--->'min_gain_to_split': None,
#--->'drop_rate': None,
#--->'top_rate': None,
#--->'min_data_per_group': None,
#--->'max_cat_threshold': None}
# CHANGE SELECTED OPTIMIZATION PARAMETERS
# parameters can be passed by any of the following ways:
# - list (will be used for a random search)
# - tuple (will be used to define the uniform grid range between the min(tuple) and the max(tuple))
# - dict with keywords 'choice'/'low'/'high'
tuner.grid['boosting_type'] = ['gbdt', 'rf']
tuner.grid['max_data_in_leaf'] = {'choice' : [40, 50, 70]}
tuner.grid['learning_rate'] = (0.001, 0.1)
tuner.grid['lambda_l1'] = {'low': 0.1, 'high': 5}
tuner.fit(X, y)
User can configure custom grids for any/all the parameters in the above dict
which can be accessed after defining the class instance via .grid
attribute.
Custom LGBM (fixed) params
Based on many requests new release of LGBMTuner
1.1.0 supports setting any LGBM supported parameters.
If for example you need to configure LGBM for optimization with is_unbalance
argument or any other supported argument, use the custom_lgbm_params
argumet at LGBMTuner
init.
from verstack import LGBMTuner
my_custom_params = {'is_unbalance': True, 'zero_as_missing': True}
tuner = LGBMTuner(metric = 'auc', custom_lgbm_params = my_custom_params)
Metrics
LGBMTuner currently supports (evaluation metrics):
'mae', 'mse', 'rmse', 'rmsle', 'mape', 'smape', 'rmspe', 'r2', 'auc', 'gini', 'log_loss', 'accuracy', 'balanced_accuracy', 'precision', 'precision_weighted', 'precision_macro', 'recall', 'recall_weighted', 'recall_macro', 'f1', 'f1_weighted', 'f1_macro', 'lift'
# note the syntax
Evaluation metrics become optimization metrics in the case of regression, given the minimize only strategy. The only exception for regression is 'r2'
. If this metric is selected when initializing LGBMTuner
, it will be substituted for 'mse'
optimization during hyperparameters tuning and.
For classification, regardless of the selected evaluation metric LGBMTuner
will optimize the cross_entropy
when searching for hyperparameters.
Number of trials
A single trial is a single iteration of training/validation of a model with randomly selected parameters from the search space. By default LGBMTuner
will run 100 trials. Number of trials can be defined at tuner initialization: tuner = LGBMTuner(metric = 'mse', trials = 500)
Prediction
Calling tuner.fit(X, y)
will eventually fit the model with best params on the X and y
Then the conventional methods: tuner.predict(test)
and tuner.predict_proba(test)
are available
For classification tasks additional parameter threshold
is available: tuner.predict(test, threshold = 0.3)
Tip: One may use the verstack.ThreshTuner
for optimizing the threshold parameter
Visualizations
LGBMTuner
ships with different built in plotting methods for static png
and interactive html
plotting for feature importances and optimizations stats
When LGBMTuner
is initialized with default parameters, namely visualization = True
, it will create 4 static plots after optimization is complete. If you are using an interactive shell like Spyder or Jupiter, these plots will be displayed automatically at the end of tuning. This can be disabled at init with tuner = LGBMTuner(metric = 'mse', vusialization = False)
These plots are also available on demand by their corresponding methods
Feature Importance
tuner.fit(X, y)
tuner.plot_importances()
figsize = (10, 6)
and n_features = 15
are the default arguments but can be changed if required
An interactive plot is available as an html file, which is displayed automatically in the default browser:
tuner.plot_importances(interactive = True)
This html can be saved from the browser's file menu
Trials validation results plot
tuner.plot_intermediate_values()
Interactive argument is most useful in this case
tuner.plot_intermediate_values(interactive = True)
Here among all the trials you can see the pruned (terminated) trials and their evaluation results
Parameters importances
This is a parameters importance histogram plot that shows which params had the highest impact on the optimization metric
tuner.plot_param_importances()
tuner.plot_param_importances(interactive = True)
Optimization history plot
tuner.plot_optimization_history()
tuner.plot_optimization_history(interactive = True)
In an interactive mode you can see the objective function (optimization metric) values changes
Verbosity
This is an important part of the framework. The default verbosity level 1 will display essential optimization results in a nice structured way without cluttering your console all that much
By default the fit
method will output the optimal amount of information, including every i-th trial results (omitting the trials that had been pruned), and the final (optimized) model parameters.
All the verbosity options are 0,1,2,3,4,5 where 0 is completely silent except for fatal errors and built in exceptions; 1–5 are based on optuna.logging options. Default verbosity level 1 is enriched with essential optimization statistics (screenshots above)
Additional LGBMTuner
attributes
Feature importance values
tuner.feature_importances
>>> ID 0.08145
>>> crim 0.07421
>>> zn 0.00424
>>> indus 0.02870
>>> chas 0.00547
>>> nox 0.06929
>>> rm 0.13872
>>> age 0.11890
>>> dis 0.13448
>>> rad 0.02966
>>> tax 0.04619
>>> ptratio 0.03977
>>> black 0.06027
>>> lstat 0.16865
Initially defined params
tuner.init_params
>>> {'learning_rate': 0.01,
>>> 'num_leaves': 16,
>>> 'colsample_bytree': 0.9,
>>> 'subsample': 0.9,
>>> 'verbosity': -1,
>>> 'n_estimators': 10000,
>>> 'early_stopping_rounds': 100,
>>> 'random_state': 42,
>>> 'objective': 'regression',
>>> 'metric': 'l2',
>>> 'num_threads': 10,
>>> 'reg_alpha': 1}
Optimized params
tuner.best_params
>>> {'learning_rate': 0.01,
>>> 'num_leaves': 130,
>>> 'colsample_bytree': 0.8246563384855297,
>>> 'subsample': 0.5335500916057069,
>>> 'verbosity': -1,
>>> 'random_state': 42,
>>> 'objective': 'regression',
>>> 'metric': 'l2',
>>> 'num_threads': 10,
>>> 'reg_alpha': 0.0011166918277076062,
>>> 'min_sum_hessian_in_leaf': 0.00270990587924765,
>>> 'reg_lambda': 8.270186047772752e-06,
>>> 'n_estimators': 605}
Trained model instance
Although after calling tuner.fit(X, y)
this LGBMTuner
instance is an object that contains the tuned and fitted LGBM model and the tuner itself contains all the necessary methods for predictions tuner.predict(test)
the actual LGBM booster model can be extracted from the tuner
object:
tuner.fitted_model
>>> <lightgbm.basic.Booster at 0x7ff3b89a5b10>
Additional methods and attributes are well described in the documentation.
The proposed framework encapsulates extensive research and best Data Science practices to reduce the amount of stress and gain a significant improvement for any classification/regression tasks it might be used for
And be sure to check out the rest of the tools verstack
has to offer
The package includes solutions to some day-to-day tasks that didn't have convenient solutions before
Current modules:
verstack.LGBMTuner
verstack.PandasOptimizer
— automatic memory optimization when reading data into pandas. One-liner for 5-fold memory footprint reduction & significant training time decrease Medium articleverstack.Stacker
— automated ensembling factory; create multilayer stacking ensembles with a few lines of code Medium articleverstack.FeatureSelector
— automated feature selection tool based on quick recursive feature elimination by various ML models Medium articleverstack.DateParser
— ultimate DateParser class that automatically finds and parses datetime feats from all the possible datetime formats in you dataframe Medium articleverstack.Multicore
— parallelise any function with a single line of code (by far the most popular tool) Medium articleverstack.NaNImputer
— impute all the NaN values by machine learning with a single line of code Medium articleverstack.ThreshTuner
— automatic threshold selection for getting most out of the binary classification predicted probabilities Medium articlestratified_continuous_split
— continuous data stratification Medium article- categoric encoders
Factorizer
OneHotEncode
FrequencyEncoder
WeightOfEvidenceEncoder
MeanTargetEncoder
Medium article timer
— convenient timer to measure any function execution
Links
verstack.LGBMTuner
documentation
verstack
documentation