Abstract
- Background
Gradient boosting is a powerful ensemble machine learning algorithm for predictive modeling that can be applied on tabular data. Creating predictions with models such as xgboost are often used in data science projects. But without having good knowledge of the data in combination with the model parameters, this can quickly result in a poor/overtrained model. By controlling parameters such as the “early stopping rounds” can certainly be helpful.
Parameters can be tuned, and a combination of parameters can result in more accurate predictions. Searching across combinations of parameters is often performed with gridsearches. A gridsearch comes with high computational costs, and can easily result in overtrained models as the search space can easily consist tens of thousands combinations to evaluate.
Luckily we have optimizations models, such as
hyperopt
[1], that can do the heavy lifting using bayesian optimization. But there is more to it because an optimized gridsearch approach may still result in overtrained models. It is wise to carefully split your data into an independent evaluation set, a train, and test set, and then examine, by means of k-fold cross validation, the hyper-parameter space.- Aim
The aim of this library is to determine the most robust gradient boosting model model by evaluating on an independent validation set. The optimal set of parameters are determined by bayesian hyperoptimization using k-fold cross-validation approach on independent train/testsets.
hgboost
can be applied for classification tasks, such as two-class or multi-class, and regression tasks using xgboost, catboost or lightboost.The aim of
hgboost
is to determine the most robust model by efficiently searching across the parameter space using hyperoptimization for which the loss is evaluated using by means of a train/test-set with k-fold cross-validation. In addition, the final optimized model is evaluated on an independent validation set.- Results
hgboost
is a python package for hyperparameter optimization for xgboost, catboost and lightboost using cross-validation, and evaluating the results on an independent validation set. There are many implementations of gradient boosting, some efficiently uses the GPU, whereas others have specific interfaces. For this libraryhgboost
, we incorporated the eXtreme Gradient Boostingxgboost
[2], Light Gradient Boosting MachineLightGBM
[3], and Category Gradient Boostingcatboost
[4]. We also created the option to learn anensemble
model.
Schematic overview
The schematic overview of our approach is as following: