API References
hgboost: Hyperoptimized Gradient Boosting library.
Contributors: https://github.com/erdogant/hgboost
- class hgboost.hgboost.hgboost(max_eval=250, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, is_unbalance=True, random_state=None, n_jobs=-1, gpu=False, verbose=3)
hgboost: Hyperoptimized Gradient Boosting.
Description
HGBoost stands for Hyperoptimized Gradient Boosting and is a Python package for hyperparameter optimization for XGBoost, LightBoost, and CatBoost. It will carefully split the dataset into a train, test, and independent validation set. Within the train-test set, there is the inner loop for optimizing the hyperparameters using Bayesian optimization (with hyperopt) and, the outer loop to score how well the top performing models can generalize based on k-fold cross validation. As such, it will make the best attempt to select the most robust model with the best performance.
- param max_eval
Search space is created on the number of evaluations.
- type max_eval
int, (default : 250)
- param threshold
Classification threshold. In case of two-class model this is 0.5
- type threshold
float, (default : 0.5)
- param cv
Cross-validation. Specifying the test size by test_size.
- type cv
int, optional (default : 5)
- param top_cv_evals
Number of top best performing models that is evaluated. If set to None, each iteration (max_eval) is tested. If set to 0, cross validation is not performed.
- type top_cv_evals
int, (default : 10)
- param test_size
Percentage split for the testset based on the total dataset.
- type test_size
float, (default : 0.2)
- param val_size
Percentage split for the validationset based on the total dataset. This part is kept untouched, and used only once to determine the model performance.
- type val_size
float, (default : 0.2)
- param is_unbalance
Control the balance of positive and negative weights, useful for unbalanced classes. xgboost clf : sum(negative instances) / sum(positive instances) catboost clf : sum(negative instances) / sum(positive instances) lightgbm clf : balanced False: grid search
- type is_unbalance
Bool, (default: True)
- param random_state
Fix the random state for validation set and test set. Note that is not used for the crossvalidation.
- type random_state
int, (default : None)
- param n_jobs
The number of jobs to run in parallel for fit. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
- type n_jobs
int, (default : -1)
- param gpu
Computing using either GPU or CPU. Note that GPU usage is not very well supported because various optimizations are performed during training/testing/crossvalidation. True: Use GPU. False: Use CPU.
- type gpu
bool, (default : False)
- param verbose
Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE
- type verbose
int, (default : 3)
- rtype
None.
References
Blog - classifiction: https://erdogant.medium.com/hands-on-guide-for-hyperparameter-tuning-with-bayesian-optimization-for-classification-models-2002224bfa3d
Github : https://github.com/erdogant/hgboost
Documentation pages: https://erdogant.github.io/hgboost/
Notebook Classification: https://colab.research.google.com/github/erdogant/hgboost/blob/master/notebooks/hgboost_classification_examples.ipynb
Notebook Regression: https://colab.research.google.com/github/erdogant/hgboost/blob/master/notebooks/hgboost_regression_examples.ipynb
- catboost(X, y, pos_label=None, eval_metric='auc', greater_is_better=True, params='default')
Catboost Classification with hyperparameter optimization.
- Parameters
X (pd.DataFrame.) – Input dataset.
y (array-like.) – Response variable.
pos_label (string/int.) – Fit the model on the pos_label that that is in [y].
eval_metric (str, (default : 'auc').) –
- Evaluation metric for the regressor of classification model.
’auc’: area under ROC curve (default for two-class)
’kappa’: (default for multi-class)
’f1’: F1-score
’logloss’
’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.
greater_is_better (bool (default : True).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
- Returns
results –
best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.
- Return type
dict
- catboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')
Catboost Regression with hyperparameter optimization.
- Parameters
X (pd.DataFrame.) – Input dataset.
y (array-like.) – Response variable.
eval_metric (str, (default : 'rmse').) –
- Evaluation metric for the regressor model.
’rmse’: root mean squared error.
’mse’: mean squared error.
’mae’: mean absolute error.
greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
params (dict, (default : 'default').) – Hyper parameters.
- Returns
results –
best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.
- Return type
dict
- ctb_clf(space)
Train catboost classification model.
- ctb_reg(space)
Train catboost regression model.
- ensemble(X, y, pos_label=None, methods=['xgb_clf', 'ctb_clf', 'lgb_clf'], eval_metric=None, greater_is_better=None, voting='soft')
Ensemble Classification with hyperparameter optimization.
Description
Fit best model for xgboost, catboost and lightboost, and then combine the individual models to a new one.
- param X
Input dataset.
- type X
pd.DataFrame
- param y
Response variable.
- type y
array-like
- param pos_label
Fit the model on the pos_label that that is in [y].
- type pos_label
string/int.
- param methods
- The models included for the ensemble classifier or regressor. The clf and reg models can not be combined.
[‘xgb_clf’,’ctb_clf’,’lgb_clf’]
[‘xgb_reg’,’ctb_reg’,’lgb_reg’]
- type methods
list of strings, (default : [‘xgb_clf’,’ctb_clf’,’lgb_clf’]).
- param eval_metric
- Evaluation metric for the regressor of classification model.
‘auc’: area under ROC curve (two-class classification : default)
- type eval_metric
str, (default : ‘auc’)
- param greater_is_better
- If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
auc : True -> two-class
- type greater_is_better
bool (default : True)
- param voting
- Combining classifier using a voting scheme.
‘hard’: using predicted classes.
‘soft’: using the Probabilities.
- type voting
str, (default : ‘soft’)
- returns
results –
best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.
- rtype
dict
- import_example(data='titanic', url=None, sep=',', verbose=3)
Import example dataset from github source.
Description
Import one of the few datasets from github source or specify your own download url link.
- param data
Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’
- type data
str
- param url
url link to to dataset.
- type url
str
- param verbose
Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE
- type verbose
int, (default : 3)
- returns
Dataset containing mixed features.
- rtype
pd.DataFrame()
- lgb_clf(space)
Train lightboost classification model.
- lgb_reg(space)
Train lightboost regression model.
- lightboost(X, y, pos_label=None, eval_metric='auc', greater_is_better=True, params='default')
Lightboost Classification with hyperparameter optimization.
- Parameters
X (pd.DataFrame) – Input dataset.
y (array-like) – Response variable.
pos_label (string/int.) – Fit the model on the pos_label that that is in [y].
eval_metric (str, (default : 'auc')) –
- Evaluation metric for the regressor of classification model.
’auc’: area under ROC curve (default for two-class)
’kappa’: (default for multi-class)
’f1’: F1-score
’logloss’
’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.
greater_is_better (bool (default : True)) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
- Returns
results –
best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.
- Return type
dict
- lightboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')
Light Regression with hyperparameter optimization.
- Parameters
X (pd.DataFrame.) – Input dataset.
y (array-like.) – Response variable.
eval_metric (str, (default : 'rmse').) –
- Evaluation metric for the regressor model.
’rmse’: root mean squared error.
’mse’: mean squared error.
’mae’: mean absolute error.
greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
params (dict, (default : 'default').) – Hyper parameters.
- Returns
results –
best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.
- Return type
dict
- load(filepath='hgboost_model.pkl', verbose=3)
Load learned model.
Description
The load function will restore the trained model and results. In a fresh (new) start, you need to re-initialize the hgboost model first. By loading the model, the user defined parameters are also restored.
- param filepath
Pathname to stored pickle files.
- type filepath
str
- param verbose
Show message. A higher number gives more information. The default is 3.
- type verbose
int, optional
Examples
>>> # Initialize libraries >>> from hgboost import hgboost >>> import pandas as pd >>> from sklearn import datasets >>> >>> # Load example dataset >>> iris = datasets.load_iris() >>> X = pd.DataFrame(iris.data, columns=iris['feature_names']) >>> y = iris.target >>> >>> # Train model using user-defined parameters >>> hgb = hgboost(max_eval=10, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, random_state=42) >>> results = hgb.xgboost(X, y, method="xgb_clf_multi") >>> >>> # Save >>> hgb.save(filepath='hgboost_model.pkl', overwrite=True) >>> >>> # Load >>> from hgboost import hgboost >>> hgb = hgboost() >>> results = hgb.load(filepath='hgboost_model.pkl') >>> >>> # Make predictions again with: >>> y_pred, y_proba = hgb.predict(X)
- returns
* dictionary containing model results.
* Object with trained model.
- plot(ylim=None, figsize=(20, 15), plot2=True, return_ax=False)
Plot the summary results.
- Parameters
ylim (tuple) – Set the y-limit. In case of auc it can be: (0.5, 1)
figsize (tuple, default (25,25)) – Figure size, (height, width)
- Returns
ax – Figure axis.
- Return type
object
- plot_cv(figsize=(15, 8), cmap='Set2', return_ax=False)
Plot the results on the crossvalidation set.
- Parameters
figsize (tuple, default (25,25)) – Figure size, (height, width)
- Returns
ax – Figure axis.
- Return type
object
- plot_ensemble(ylim, figsize, ax1, ax2)
Plot ensemble results.
- Parameters
ylim (tuple) – Set the y-limit. In case of auc it can be: (0.5, 1)
figsize (tuple, default (25,25)) – Figure size, (height, width)
ax1 (Object) – Axis of figure 1
ax2 (Object) – Axis of figure 2
- Returns
ax – Figure axis.
- Return type
object
- plot_params(top_n=10, shade=True, cmap='Set2', figsize=(18, 18), return_ax=False)
Distribution of parameters.
Description
This plot demonstrate the density distribution of the used parameters. Green will depict the best detected parameter and red demonstrates the top n paramters with best loss.
- param top_n
Top n parameters that scored highest are plotted with a black dashed vertical line.
- type top_n
int, (default : 10)
- param shade
Fill the density plot.
- type shade
bool, (default : True)
- param figsize
Figure size, (height, width)
- type figsize
tuple, default (15,15)
- returns
ax – Figure axis.
- rtype
object
- plot_validation(figsize=(15, 8), cmap='Set2', normalized=None, return_ax=False)
Plot the results on the validation set.
- Parameters
normalized (Bool, (default : None)) – Normalize the confusion matrix when True.
figsize (tuple, default (25,25)) – Figure size, (height, width)
- Returns
ax – Figure axis.
- Return type
object
- predict(X, model=None)
Prediction using fitted model.
- Parameters
X (pd.DataFrame) – Input data.
- Returns
y_pred (array-like) – predictions results.
y_proba (array-like) – Probability of the predictions.
- preprocessing(df, y_min=2, perc_min_num=0.8, excl_background='0.0', hot_only=False, verbose=None)
Pre-processing of the input data.
- Parameters
df (pd.DataFrame) – Input data.
y_min (int [0..len(y)], optional) – Minimal number of samples that must be present in a group. All groups with less then y_min samples are labeled as _other_ and are not used in the enriching model. The default is None.
perc_min_num (float [None, 0..1], optional) – Force column (int or float) to be numerical if unique non-zero values are above percentage. The default is None. Alternative can be 0.8
verbose (int, (default: 3)) – Print progress to screen. 0: NONE, 1: ERROR, 2: WARNING, 3: INFO, 4: DEBUG, 5: TRACE
- Returns
data – Processed data.
- Return type
pd.Datarame
- save(filepath='hgboost_model.pkl', overwrite=False, verbose=3)
Save learned model in pickle file.
- Parameters
filepath (str, (default: 'hgboost_model.pkl')) – Pathname to store pickle files.
overwrite (bool, (default=False)) – Overwite file if exists.
verbose (int, optional) – Show message. A higher number gives more informatie. The default is 3.
Examples
>>> # Initialize libraries >>> from hgboost import hgboost >>> import pandas as pd >>> from sklearn import datasets >>> >>> # Load example dataset >>> iris = datasets.load_iris() >>> X = pd.DataFrame(iris.data, columns=iris['feature_names']) >>> y = iris.target >>> >>> # Train model using user-defined parameters >>> hgb = hgboost(max_eval=10, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, random_state=42) >>> results = hgb.xgboost(X, y, method="xgb_clf_multi") >>> >>> # Save >>> hgb.save(filepath='hgboost_model.pkl', overwrite=True) >>>
- Returns
bool – Status whether the file is saved.
- Return type
[True, False]
- treeplot(num_trees=None, plottype='horizontal', figsize=(20, 25), return_ax=False, verbose=3)
Tree plot.
- Parameters
num_trees (int, default None) – Best tree is shown when None. Specify the ordinal number of any other target tree.
plottype (str, (default : 'horizontal')) –
- Works only in case of xgb model.
’horizontal’
’vertical’
figsize (tuple, default (25,25)) – Figure size, (height, width)
verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE
- Returns
ax
- Return type
object
- xgb_clf(space)
Train xgboost classification model.
- xgb_clf_multi(space)
Train xgboost multi-class classification model.
- xgb_reg(space)
Train Xgboost regression model.
- xgboost(X, y, pos_label=None, method='xgb_clf', eval_metric=None, greater_is_better=None, params='default')
Xgboost Classification with hyperparameter optimization.
- Parameters
X (pd.DataFrame.) – Input dataset.
y (array-like.) – Response variable.
pos_label (string/int.) – Fit the model on the pos_label that that is in [y].
method (String, (default : 'auto').) –
‘xgb_clf’: XGboost two-class classifier
’xgb_clf_multi’: XGboost multi-class classifier
eval_metric (str, (default : None).) –
- Evaluation metric for the regressor of classification model.
’auc’: area under ROC curve (default for two-class)
’kappa’: (default for multi-class)
’f1’: F1-score
’logloss’
’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.
greater_is_better (bool.) –
- If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
auc : True -> two-class
kappa : True -> multi-class
- Returns
results –
best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.
- Return type
dict
- xgboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')
Xgboost Regression with hyperparameter optimization.
- Parameters
X (pd.DataFrame.) – Input dataset.
y (array-like) – Response variable.
eval_metric (str, (default : 'rmse').) –
- Evaluation metric for the regressor model.
’rmse’: root mean squared error.
’mse’: mean squared error.
’mae’: mean absolute error.
greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
params (dict, (default : 'default').) – Hyper parameters.
- Returns
results –
best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.
- Return type
dict
- hgboost.hgboost.import_example(data='titanic', url=None, sep=',', verbose=3)
Import example dataset from github source.
Description
Import one of the few datasets from github source or specify your own download url link.
- param data
Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’
- type data
str, (default : “titanic”)
- param url
url link to to dataset.
- type url
str
- param verbose
Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE
- type verbose
int, (default : 3)
- returns
Dataset containing mixed features.
- rtype
pd.DataFrame()