API References

hgboost: Hyperoptimized Gradient Boosting library.

Contributors: https://github.com/erdogant/hgboost

class hgboost.hgboost.hgboost(max_eval=250, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, is_unbalance=True, random_state=None, n_jobs=-1, gpu=False, verbose=3)

hgboost: Hyperoptimized Gradient Boosting.

Description

HGBoost stands for Hyperoptimized Gradient Boosting and is a Python package for hyperparameter optimization for XGBoost, LightBoost, and CatBoost. It will carefully split the dataset into a train, test, and independent validation set. Within the train-test set, there is the inner loop for optimizing the hyperparameters using Bayesian optimization (with hyperopt) and, the outer loop to score how well the top performing models can generalize based on k-fold cross validation. As such, it will make the best attempt to select the most robust model with the best performance.

param max_eval: Search space is created on the number of evaluations.
type max_eval: int, (default : 250)
param threshold: Classification threshold. In case of two-class model this is 0.5
type threshold: float, (default : 0.5)
param cv: Cross-validation. Specifying the test size by test_size.
type cv: int, optional (default : 5)
param top_cv_evals: Number of top best performing models that is evaluated. If set to None, each iteration (max_eval) is tested. If set to 0, cross validation is not performed.
type top_cv_evals: int, (default : 10)
param test_size: Percentage split for the testset based on the total dataset.
type test_size: float, (default : 0.2)
param val_size: Percentage split for the validationset based on the total dataset. This part is kept untouched, and used only once to determine the model performance.
type val_size: float, (default : 0.2)
param is_unbalance: Control the balance of positive and negative weights, useful for unbalanced classes. xgboost clf : sum(negative instances) / sum(positive instances) catboost clf : sum(negative instances) / sum(positive instances) lightgbm clf : balanced False: grid search
type is_unbalance: Bool, (default: True)
param random_state: Fix the random state for validation set and test set. Note that is not used for the crossvalidation.
type random_state: int, (default : None)
param n_jobs: The number of jobs to run in parallel for fit. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
type n_jobs: int, (default : -1)
param gpu: Computing using either GPU or CPU. Note that GPU usage is not very well supported because various optimizations are performed during training/testing/crossvalidation. True: Use GPU. False: Use CPU.
type gpu: bool, (default : False)
param verbose: Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE
type verbose: int, (default : 3)
rtype: None.

References

Blog: https://towardsdatascience.com/a-guide-to-find-the-best-boosting-model-using-bayesian-hyperparameter-tuning-but-without-c98b6a1ecac8
Blog - classifiction: https://erdogant.medium.com/hands-on-guide-for-hyperparameter-tuning-with-bayesian-optimization-for-classification-models-2002224bfa3d
Github : https://github.com/erdogant/hgboost
Documentation pages: https://erdogant.github.io/hgboost/
Notebook Classification: https://colab.research.google.com/github/erdogant/hgboost/blob/master/notebooks/hgboost_classification_examples.ipynb
Notebook Regression: https://colab.research.google.com/github/erdogant/hgboost/blob/master/notebooks/hgboost_regression_examples.ipynb

catboost(X, y, pos_label=None, eval_metric='auc', greater_is_better=True, params='default')

Catboost Classification with hyperparameter optimization.

Parameters

X (pd.DataFrame.) – Input dataset.
y (array-like.) – Response variable.
pos_label (string/int.) – Fit the model on the pos_label that that is in [y].
eval_metric (str, (default : 'auc').) –
Evaluation metric for the regressor of classification model.
- ’auc’: area under ROC curve (default for two-class)
- ’kappa’: (default for multi-class)
- ’f1’: F1-score
- ’logloss’
- ’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.
greater_is_better (bool (default : True).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

Returns

results –

best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

catboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')

Catboost Regression with hyperparameter optimization.

Parameters

X (pd.DataFrame.) – Input dataset.
y (array-like.) – Response variable.
eval_metric (str, (default : 'rmse').) –
Evaluation metric for the regressor model.
- ’rmse’: root mean squared error.
- ’mse’: mean squared error.
- ’mae’: mean absolute error.
greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
params (dict, (default : 'default').) – Hyper parameters.

Returns

results –

best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

ctb_clf(space): Train catboost classification model.

ctb_reg(space): Train catboost regression model.

ensemble(X, y, pos_label=None, methods=['xgb_clf', 'ctb_clf', 'lgb_clf'], eval_metric=None, greater_is_better=None, voting='soft')

Ensemble Classification with hyperparameter optimization.

Description

Fit best model for xgboost, catboost and lightboost, and then combine the individual models to a new one.

param X

Input dataset.

type X

pd.DataFrame

param y

Response variable.

type y

array-like

param pos_label

Fit the model on the pos_label that that is in [y].

type pos_label

string/int.

param methods

The models included for the ensemble classifier or regressor. The clf and reg models can not be combined.

[‘xgb_clf’,’ctb_clf’,’lgb_clf’]
[‘xgb_reg’,’ctb_reg’,’lgb_reg’]

type methods

list of strings, (default : [‘xgb_clf’,’ctb_clf’,’lgb_clf’]).

param eval_metric

Evaluation metric for the regressor of classification model.

‘auc’: area under ROC curve (two-class classification : default)

type eval_metric

str, (default : ‘auc’)

param greater_is_better

If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

auc : True -> two-class

type greater_is_better

bool (default : True)

param voting

Combining classifier using a voting scheme.

‘hard’: using predicted classes.
‘soft’: using the Probabilities.

type voting

str, (default : ‘soft’)

returns

results –

best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

rtype

dict

import_example(data='titanic', url=None, sep=',', verbose=3)

Import example dataset from github source.

Description

Import one of the few datasets from github source or specify your own download url link.

param data: Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’
type data: str
param url: url link to to dataset.
type url: str
param verbose: Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE
type verbose: int, (default : 3)
returns: Dataset containing mixed features.
rtype: pd.DataFrame()

lgb_clf(space): Train lightboost classification model.

lgb_reg(space): Train lightboost regression model.

lightboost(X, y, pos_label=None, eval_metric='auc', greater_is_better=True, params='default')

Lightboost Classification with hyperparameter optimization.

Parameters

X (pd.DataFrame) – Input dataset.
y (array-like) – Response variable.
pos_label (string/int.) – Fit the model on the pos_label that that is in [y].
eval_metric (str, (default : 'auc')) –
Evaluation metric for the regressor of classification model.
- ’auc’: area under ROC curve (default for two-class)
- ’kappa’: (default for multi-class)
- ’f1’: F1-score
- ’logloss’
- ’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.
greater_is_better (bool (default : True)) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

Returns

results –

best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

lightboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')

Light Regression with hyperparameter optimization.

Parameters

X (pd.DataFrame.) – Input dataset.
y (array-like.) – Response variable.
eval_metric (str, (default : 'rmse').) –
Evaluation metric for the regressor model.
- ’rmse’: root mean squared error.
- ’mse’: mean squared error.
- ’mae’: mean absolute error.
greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
params (dict, (default : 'default').) – Hyper parameters.

Returns

results –

best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

load(filepath='hgboost_model.pkl', verbose=3)

Load learned model.

Description

The load function will restore the trained model and results. In a fresh (new) start, you need to re-initialize the hgboost model first. By loading the model, the user defined parameters are also restored.

param filepath: Pathname to stored pickle files.
type filepath: str
param verbose: Show message. A higher number gives more information. The default is 3.
type verbose: int, optional

Examples

>>> # Initialize libraries
>>> from hgboost import hgboost
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load example dataset
>>> iris = datasets.load_iris()
>>> X = pd.DataFrame(iris.data, columns=iris['feature_names'])
>>> y = iris.target
>>>
>>> # Train model using user-defined parameters
>>> hgb = hgboost(max_eval=10, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, random_state=42)
>>> results = hgb.xgboost(X, y, method="xgb_clf_multi")
>>>
>>> # Save
>>> hgb.save(filepath='hgboost_model.pkl', overwrite=True)
>>>
>>> # Load
>>> from hgboost import hgboost
>>> hgb = hgboost()
>>> results = hgb.load(filepath='hgboost_model.pkl')
>>>
>>> # Make predictions again with:
>>> y_pred, y_proba = hgb.predict(X)

returns

* dictionary containing model results.
* Object with trained model.

plot(ylim=None, figsize=(20, 15), plot2=True, return_ax=False)

Plot the summary results.

Parameters

ylim (tuple) – Set the y-limit. In case of auc it can be: (0.5, 1)
figsize (tuple, default (25,25)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

plot_cv(figsize=(15, 8), cmap='Set2', return_ax=False)

Plot the results on the crossvalidation set.

Parameters: figsize (tuple, default (25,25)) – Figure size, (height, width)
Returns: ax – Figure axis.
Return type: object

plot_ensemble(ylim, figsize, ax1, ax2)

Plot ensemble results.

Parameters

ylim (tuple) – Set the y-limit. In case of auc it can be: (0.5, 1)
figsize (tuple, default (25,25)) – Figure size, (height, width)
ax1 (Object) – Axis of figure 1
ax2 (Object) – Axis of figure 2

Returns

ax – Figure axis.

Return type

object

plot_params(top_n=10, shade=True, cmap='Set2', figsize=(18, 18), return_ax=False)

Distribution of parameters.

Description

This plot demonstrate the density distribution of the used parameters. Green will depict the best detected parameter and red demonstrates the top n paramters with best loss.

param top_n: Top n parameters that scored highest are plotted with a black dashed vertical line.
type top_n: int, (default : 10)
param shade: Fill the density plot.
type shade: bool, (default : True)
param figsize: Figure size, (height, width)
type figsize: tuple, default (15,15)
returns: ax – Figure axis.
rtype: object

plot_validation(figsize=(15, 8), cmap='Set2', normalized=None, return_ax=False)

Plot the results on the validation set.

Parameters

normalized (Bool, (default : None)) – Normalize the confusion matrix when True.
figsize (tuple, default (25,25)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

predict(X, model=None)

Prediction using fitted model.

Parameters

X (pd.DataFrame) – Input data.

Returns

y_pred (array-like) – predictions results.
y_proba (array-like) – Probability of the predictions.

preprocessing(df, y_min=2, perc_min_num=0.8, excl_background='0.0', hot_only=False, verbose=None)

Pre-processing of the input data.

Parameters

df (pd.DataFrame) – Input data.
y_min (int [0..len(y)], optional) – Minimal number of samples that must be present in a group. All groups with less then y_min samples are labeled as _other_ and are not used in the enriching model. The default is None.
perc_min_num (float [None, 0..1], optional) – Force column (int or float) to be numerical if unique non-zero values are above percentage. The default is None. Alternative can be 0.8
verbose (int, (default: 3)) – Print progress to screen. 0: NONE, 1: ERROR, 2: WARNING, 3: INFO, 4: DEBUG, 5: TRACE

Returns

data – Processed data.

Return type

pd.Datarame

save(filepath='hgboost_model.pkl', overwrite=False, verbose=3)

Save learned model in pickle file.

Parameters

filepath (str, (default: 'hgboost_model.pkl')) – Pathname to store pickle files.
overwrite (bool, (default=False)) – Overwite file if exists.
verbose (int, optional) – Show message. A higher number gives more informatie. The default is 3.

Examples

>>> # Initialize libraries
>>> from hgboost import hgboost
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load example dataset
>>> iris = datasets.load_iris()
>>> X = pd.DataFrame(iris.data, columns=iris['feature_names'])
>>> y = iris.target
>>>
>>> # Train model using user-defined parameters
>>> hgb = hgboost(max_eval=10, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, random_state=42)
>>> results = hgb.xgboost(X, y, method="xgb_clf_multi")
>>>
>>> # Save
>>> hgb.save(filepath='hgboost_model.pkl', overwrite=True)
>>>

Returns: bool – Status whether the file is saved.
Return type: [True, False]

treeplot(num_trees=None, plottype='horizontal', figsize=(20, 25), return_ax=False, verbose=3)

Tree plot.

Parameters

num_trees (int, default None) – Best tree is shown when None. Specify the ordinal number of any other target tree.
plottype (str, (default : 'horizontal')) –
Works only in case of xgb model.
- ’horizontal’
- ’vertical’
figsize (tuple, default (25,25)) – Figure size, (height, width)
verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

Returns

ax

Return type

object

xgb_clf(space): Train xgboost classification model.

xgb_clf_multi(space): Train xgboost multi-class classification model.

xgb_reg(space): Train Xgboost regression model.

xgboost(X, y, pos_label=None, method='xgb_clf', eval_metric=None, greater_is_better=None, params='default')

Xgboost Classification with hyperparameter optimization.

Parameters

X (pd.DataFrame.) – Input dataset.
y (array-like.) – Response variable.
pos_label (string/int.) – Fit the model on the pos_label that that is in [y].
method (String, (default : 'auto').) –
- ‘xgb_clf’: XGboost two-class classifier
- ’xgb_clf_multi’: XGboost multi-class classifier
eval_metric (str, (default : None).) –
Evaluation metric for the regressor of classification model.
- ’auc’: area under ROC curve (default for two-class)
- ’kappa’: (default for multi-class)
- ’f1’: F1-score
- ’logloss’
- ’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.
greater_is_better (bool.) –
If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
- auc : True -> two-class
- kappa : True -> multi-class

Returns

results –

best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

xgboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')

Xgboost Regression with hyperparameter optimization.

Parameters

X (pd.DataFrame.) – Input dataset.
y (array-like) – Response variable.
eval_metric (str, (default : 'rmse').) –
Evaluation metric for the regressor model.
- ’rmse’: root mean squared error.
- ’mse’: mean squared error.
- ’mae’: mean absolute error.
greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
params (dict, (default : 'default').) – Hyper parameters.

Returns

results –

best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

hgboost.hgboost.import_example(data='titanic', url=None, sep=',', verbose=3)

Import example dataset from github source.

Description

Import one of the few datasets from github source or specify your own download url link.

param data: Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’
type data: str, (default : “titanic”)
param url: url link to to dataset.
type url: str
param verbose: Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE
type verbose: int, (default : 3)
returns: Dataset containing mixed features.
rtype: pd.DataFrame()