API References

hgboost: Hyperoptimized Gradient Boosting library.

Contributors: https://github.com/erdogant/hgboost

class hgboost.hgboost.hgboost(max_eval=250, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, is_unbalance=True, random_state=None, n_jobs=-1, gpu=False, verbose=3)

hgboost: Hyperoptimized Gradient Boosting.

Description

HGBoost stands for Hyperoptimized Gradient Boosting and is a Python package for hyperparameter optimization for XGBoost, LightBoost, and CatBoost. It will carefully split the dataset into a train, test, and independent validation set. Within the train-test set, there is the inner loop for optimizing the hyperparameters using Bayesian optimization (with hyperopt) and, the outer loop to score how well the top performing models can generalize based on k-fold cross validation. As such, it will make the best attempt to select the most robust model with the best performance.

param max_eval

Search space is created on the number of evaluations.

type max_eval

int, (default : 250)

param threshold

Classification threshold. In case of two-class model this is 0.5

type threshold

float, (default : 0.5)

param cv

Cross-validation. Specifying the test size by test_size.

type cv

int, optional (default : 5)

param top_cv_evals

Number of top best performing models that is evaluated. If set to None, each iteration (max_eval) is tested. If set to 0, cross validation is not performed.

type top_cv_evals

int, (default : 10)

param test_size

Percentage split for the testset based on the total dataset.

type test_size

float, (default : 0.2)

param val_size

Percentage split for the validationset based on the total dataset. This part is kept untouched, and used only once to determine the model performance.

type val_size

float, (default : 0.2)

param is_unbalance

Control the balance of positive and negative weights, useful for unbalanced classes. xgboost clf : sum(negative instances) / sum(positive instances) catboost clf : sum(negative instances) / sum(positive instances) lightgbm clf : balanced False: grid search

type is_unbalance

Bool, (default: True)

param random_state

Fix the random state for validation set and test set. Note that is not used for the crossvalidation.

type random_state

int, (default : None)

param n_jobs

The number of jobs to run in parallel for fit. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

type n_jobs

int, (default : -1)

param gpu

Computing using either GPU or CPU. Note that GPU usage is not very well supported because various optimizations are performed during training/testing/crossvalidation. True: Use GPU. False: Use CPU.

type gpu

bool, (default : False)

param verbose

Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

type verbose

int, (default : 3)

rtype

None.

References

catboost(X, y, pos_label=None, eval_metric='auc', greater_is_better=True, params='default')

Catboost Classification with hyperparameter optimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like.) – Response variable.

  • pos_label (string/int.) – Fit the model on the pos_label that that is in [y].

  • eval_metric (str, (default : 'auc').) –

    Evaluation metric for the regressor of classification model.
    • ’auc’: area under ROC curve (default for two-class)

    • ’kappa’: (default for multi-class)

    • ’f1’: F1-score

    • ’logloss’

    • ’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.

  • greater_is_better (bool (default : True).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

catboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')

Catboost Regression with hyperparameter optimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like.) – Response variable.

  • eval_metric (str, (default : 'rmse').) –

    Evaluation metric for the regressor model.
    • ’rmse’: root mean squared error.

    • ’mse’: mean squared error.

    • ’mae’: mean absolute error.

  • greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

  • params (dict, (default : 'default').) – Hyper parameters.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

ctb_clf(space)

Train catboost classification model.

ctb_reg(space)

Train catboost regression model.

ensemble(X, y, pos_label=None, methods=['xgb_clf', 'ctb_clf', 'lgb_clf'], eval_metric=None, greater_is_better=None, voting='soft')

Ensemble Classification with hyperparameter optimization.

Description

Fit best model for xgboost, catboost and lightboost, and then combine the individual models to a new one.

param X

Input dataset.

type X

pd.DataFrame

param y

Response variable.

type y

array-like

param pos_label

Fit the model on the pos_label that that is in [y].

type pos_label

string/int.

param methods
The models included for the ensemble classifier or regressor. The clf and reg models can not be combined.
  • [‘xgb_clf’,’ctb_clf’,’lgb_clf’]

  • [‘xgb_reg’,’ctb_reg’,’lgb_reg’]

type methods

list of strings, (default : [‘xgb_clf’,’ctb_clf’,’lgb_clf’]).

param eval_metric
Evaluation metric for the regressor of classification model.
  • ‘auc’: area under ROC curve (two-class classification : default)

type eval_metric

str, (default : ‘auc’)

param greater_is_better
If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
  • auc : True -> two-class

type greater_is_better

bool (default : True)

param voting
Combining classifier using a voting scheme.
  • ‘hard’: using predicted classes.

  • ‘soft’: using the Probabilities.

type voting

str, (default : ‘soft’)

returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

rtype

dict

import_example(data='titanic', url=None, sep=',', verbose=3)

Import example dataset from github source.

Description

Import one of the few datasets from github source or specify your own download url link.

param data

Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’

type data

str

param url

url link to to dataset.

type url

str

param verbose

Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

type verbose

int, (default : 3)

returns

Dataset containing mixed features.

rtype

pd.DataFrame()

lgb_clf(space)

Train lightboost classification model.

lgb_reg(space)

Train lightboost regression model.

lightboost(X, y, pos_label=None, eval_metric='auc', greater_is_better=True, params='default')

Lightboost Classification with hyperparameter optimization.

Parameters
  • X (pd.DataFrame) – Input dataset.

  • y (array-like) – Response variable.

  • pos_label (string/int.) – Fit the model on the pos_label that that is in [y].

  • eval_metric (str, (default : 'auc')) –

    Evaluation metric for the regressor of classification model.
    • ’auc’: area under ROC curve (default for two-class)

    • ’kappa’: (default for multi-class)

    • ’f1’: F1-score

    • ’logloss’

    • ’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.

  • greater_is_better (bool (default : True)) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

lightboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')

Light Regression with hyperparameter optimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like.) – Response variable.

  • eval_metric (str, (default : 'rmse').) –

    Evaluation metric for the regressor model.
    • ’rmse’: root mean squared error.

    • ’mse’: mean squared error.

    • ’mae’: mean absolute error.

  • greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

  • params (dict, (default : 'default').) – Hyper parameters.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

load(filepath='hgboost_model.pkl', verbose=3)

Load learned model.

Description

The load function will restore the trained model and results. In a fresh (new) start, you need to re-initialize the hgboost model first. By loading the model, the user defined parameters are also restored.

param filepath

Pathname to stored pickle files.

type filepath

str

param verbose

Show message. A higher number gives more information. The default is 3.

type verbose

int, optional

Examples

>>> # Initialize libraries
>>> from hgboost import hgboost
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load example dataset
>>> iris = datasets.load_iris()
>>> X = pd.DataFrame(iris.data, columns=iris['feature_names'])
>>> y = iris.target
>>>
>>> # Train model using user-defined parameters
>>> hgb = hgboost(max_eval=10, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, random_state=42)
>>> results = hgb.xgboost(X, y, method="xgb_clf_multi")
>>>
>>> # Save
>>> hgb.save(filepath='hgboost_model.pkl', overwrite=True)
>>>
>>> # Load
>>> from hgboost import hgboost
>>> hgb = hgboost()
>>> results = hgb.load(filepath='hgboost_model.pkl')
>>>
>>> # Make predictions again with:
>>> y_pred, y_proba = hgb.predict(X)
returns
  • * dictionary containing model results.

  • * Object with trained model.

plot(ylim=None, figsize=(20, 15), plot2=True, return_ax=False)

Plot the summary results.

Parameters
  • ylim (tuple) – Set the y-limit. In case of auc it can be: (0.5, 1)

  • figsize (tuple, default (25,25)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

plot_cv(figsize=(15, 8), cmap='Set2', return_ax=False)

Plot the results on the crossvalidation set.

Parameters

figsize (tuple, default (25,25)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

plot_ensemble(ylim, figsize, ax1, ax2)

Plot ensemble results.

Parameters
  • ylim (tuple) – Set the y-limit. In case of auc it can be: (0.5, 1)

  • figsize (tuple, default (25,25)) – Figure size, (height, width)

  • ax1 (Object) – Axis of figure 1

  • ax2 (Object) – Axis of figure 2

Returns

ax – Figure axis.

Return type

object

plot_params(top_n=10, shade=True, cmap='Set2', figsize=(18, 18), return_ax=False)

Distribution of parameters.

Description

This plot demonstrate the density distribution of the used parameters. Green will depict the best detected parameter and red demonstrates the top n paramters with best loss.

param top_n

Top n parameters that scored highest are plotted with a black dashed vertical line.

type top_n

int, (default : 10)

param shade

Fill the density plot.

type shade

bool, (default : True)

param figsize

Figure size, (height, width)

type figsize

tuple, default (15,15)

returns

ax – Figure axis.

rtype

object

plot_validation(figsize=(15, 8), cmap='Set2', normalized=None, return_ax=False)

Plot the results on the validation set.

Parameters
  • normalized (Bool, (default : None)) – Normalize the confusion matrix when True.

  • figsize (tuple, default (25,25)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

predict(X, model=None)

Prediction using fitted model.

Parameters

X (pd.DataFrame) – Input data.

Returns

  • y_pred (array-like) – predictions results.

  • y_proba (array-like) – Probability of the predictions.

preprocessing(df, y_min=2, perc_min_num=0.8, excl_background='0.0', hot_only=False, verbose=None)

Pre-processing of the input data.

Parameters
  • df (pd.DataFrame) – Input data.

  • y_min (int [0..len(y)], optional) – Minimal number of samples that must be present in a group. All groups with less then y_min samples are labeled as _other_ and are not used in the enriching model. The default is None.

  • perc_min_num (float [None, 0..1], optional) – Force column (int or float) to be numerical if unique non-zero values are above percentage. The default is None. Alternative can be 0.8

  • verbose (int, (default: 3)) – Print progress to screen. 0: NONE, 1: ERROR, 2: WARNING, 3: INFO, 4: DEBUG, 5: TRACE

Returns

data – Processed data.

Return type

pd.Datarame

save(filepath='hgboost_model.pkl', overwrite=False, verbose=3)

Save learned model in pickle file.

Parameters
  • filepath (str, (default: 'hgboost_model.pkl')) – Pathname to store pickle files.

  • overwrite (bool, (default=False)) – Overwite file if exists.

  • verbose (int, optional) – Show message. A higher number gives more informatie. The default is 3.

Examples

>>> # Initialize libraries
>>> from hgboost import hgboost
>>> import pandas as pd
>>> from sklearn import datasets
>>>
>>> # Load example dataset
>>> iris = datasets.load_iris()
>>> X = pd.DataFrame(iris.data, columns=iris['feature_names'])
>>> y = iris.target
>>>
>>> # Train model using user-defined parameters
>>> hgb = hgboost(max_eval=10, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, random_state=42)
>>> results = hgb.xgboost(X, y, method="xgb_clf_multi")
>>>
>>> # Save
>>> hgb.save(filepath='hgboost_model.pkl', overwrite=True)
>>>
Returns

bool – Status whether the file is saved.

Return type

[True, False]

treeplot(num_trees=None, plottype='horizontal', figsize=(20, 25), return_ax=False, verbose=3)

Tree plot.

Parameters
  • num_trees (int, default None) – Best tree is shown when None. Specify the ordinal number of any other target tree.

  • plottype (str, (default : 'horizontal')) –

    Works only in case of xgb model.
    • ’horizontal’

    • ’vertical’

  • figsize (tuple, default (25,25)) – Figure size, (height, width)

  • verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

Returns

ax

Return type

object

xgb_clf(space)

Train xgboost classification model.

xgb_clf_multi(space)

Train xgboost multi-class classification model.

xgb_reg(space)

Train Xgboost regression model.

xgboost(X, y, pos_label=None, method='xgb_clf', eval_metric=None, greater_is_better=None, params='default')

Xgboost Classification with hyperparameter optimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like.) – Response variable.

  • pos_label (string/int.) – Fit the model on the pos_label that that is in [y].

  • method (String, (default : 'auto').) –

    • ‘xgb_clf’: XGboost two-class classifier

    • ’xgb_clf_multi’: XGboost multi-class classifier

  • eval_metric (str, (default : None).) –

    Evaluation metric for the regressor of classification model.
    • ’auc’: area under ROC curve (default for two-class)

    • ’kappa’: (default for multi-class)

    • ’f1’: F1-score

    • ’logloss’

    • ’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.

  • greater_is_better (bool.) –

    If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
    • auc : True -> two-class

    • kappa : True -> multi-class

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

xgboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')

Xgboost Regression with hyperparameter optimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like) – Response variable.

  • eval_metric (str, (default : 'rmse').) –

    Evaluation metric for the regressor model.
    • ’rmse’: root mean squared error.

    • ’mse’: mean squared error.

    • ’mae’: mean absolute error.

  • greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

  • params (dict, (default : 'default').) – Hyper parameters.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

hgboost.hgboost.import_example(data='titanic', url=None, sep=',', verbose=3)

Import example dataset from github source.

Description

Import one of the few datasets from github source or specify your own download url link.

param data

Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’

type data

str, (default : “titanic”)

param url

url link to to dataset.

type url

str

param verbose

Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

type verbose

int, (default : 3)

returns

Dataset containing mixed features.

rtype

pd.DataFrame()