API References

clusteval is a python package to measure the goodness of the unsupervised clustering.

class clusteval.clusteval.clusteval(cluster='agglomerative', evaluate='silhouette', metric='euclidean', linkage='ward', min_clust=2, max_clust=25, normalize=False, savemem=False, verbose='info', params_dbscan={'eps': None, 'epsres': 50, 'min_samples': 0.01, 'n_jobs': -1, 'norm': False})

Cluster evaluation.

clusteval is a python package that provides various evaluation approaches to measure the goodness of the unsupervised clustering.

Parameters:

cluster (str, (default: 'agglomerative')) –
Type of clustering.
- ’agglomerative’
- ’kmeans’
- ’dbscan’
- ’hdbscan’
- ’optics’ # TODO
evaluate (str, (default: 'silhouette')) –
Evaluation method for cluster validation.
- ’silhouette’
- ’dbindex’
- ’derivative’
metric (str, (default: 'euclidean').) –
Distance measures. All metrics from sklearn can be used such as:
- ’euclidean’
- ’hamming’
- ’braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’
linkage (str, (default: 'ward')) –
Linkage type for the clustering.
- ’ward’
- ’single’
- ’complete’
- ’average’
- ’weighted’
- ’centroid’
- ’median’
min_clust (int, (default: 2)) – Number of clusters that is evaluated greater or equals to min_clust.
max_clust (int, (default: 25)) – Number of clusters that is evaluated smaller or equals to max_clust.
normalize (bool (default : False)) – Normalize data, Z-score
savemem (bool, (default: False)) – Save memmory when working with large datasets. Note that htis option only in case of KMeans.
jitter (float, default: None) – Add jitter to data points as random normal data. Values of 0.01 is usually good for one-hot data seperation.
verbose (int, (default: 'info')) – Print progress to screen. The default is ‘info’. 60: None, 40: Error, 30: Warn, 20: Info, 10: Debug

Returns:

dict

Return type:

dictionary with keys:

Examples

>>> # Import library
>>> from clusteval import clusteval
>>> # Initialize clusteval with default parameters
>>> ce = clusteval()
>>>
>>> # Generate random data
>>> from sklearn.datasets import make_blobs
>>> X, labels_true = make_blobs(n_samples=750, centers=4, n_features=2, cluster_std=0.5)
>>>
>>> # Fit best clusters
>>> results = ce.fit(X)
>>>
>>> # Make plot
>>> ce.plot()
>>>
>>> # silhouette plot
>>> ce.plot_silhouette()
>>>
>>> # Scatter plot
>>> ce.scatter()
>>>
>>> # Dendrogram
>>> ce.dendrogram()

dendrogram(X=None, labels=None, leaf_rotation=90, leaf_font_size=12, orientation='top', show_contracted=True, max_d=None, metric=None, linkage=None, truncate_mode=None, update_results=False, figsize=(15, 10), visible=True, dpi=200, ax=None, savefig={'dpi ': None, 'facecolor': 'auto', 'fname': None, 'orientation': 'portrait', <built-in function format>: 'png'})

Plot Dendrogram.

Parameters:

X (numpy-array (default : None)) – Input data.
labels (list, (default: None)) – Plot the labels. When None: the index of the original observation is used to label the leaf nodes.
leaf_rotation (int, (default: 90)) – Rotation of the labels [0-360].
leaf_font_size (int, (default: 12)) – Font size labels.
orientation (string, (default: 'top')) – Direction of the dendrogram: ‘top’, ‘bottom’, ‘left’ or ‘right’
show_contracted (bool, (default: True)) – The heights of non-singleton nodes contracted into a leaf node are plotted as crosses along the link connecting that leaf node.
max_d (Float, (default: None)) – Height of the dendrogram to make a horizontal cut-off line.
metric (str, (default: 'euclidean').) – Distance measure for the clustering, such as ‘euclidean’,’hamming’, etc.
linkage (str, (default: 'ward')) – Linkage type for the clustering. ‘ward’,’single’,’,complete’,’average’,’weighted’,’centroid’,’median’.
truncate_mode (string, (default: None)) – Truncation is used to condense the dendrogram, which can be based on: ‘level’, ‘lastp’ or None
figsize (tuple, (default: (15, 10).) – Size of the figure (height,width).
visible (bool, (default = True)) – Make the fig visible.
savefig (dict.) – https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html {‘dpi’:’figure’, ‘format’:None, ‘metadata’:None, ‘bbox_inches’: None, ‘pad_inches’:0.1, ‘facecolor’:’auto’, ‘edgecolor’:’auto’, ‘backend’:None}

Returns:

results –

labx : int : Cluster labels based on the input-ordering.
order_rows : string : Order of the cluster labels as presented in the dendrogram (left-to-right).
max_d : float : maximum distance to set the horizontal threshold line.
max_d_lower : float : maximum distance lowebound
max_d_upper : float : maximum distance upperbound

Return type:

dict

enrichment(X=None)

Enrichment analysis.

Parameters:: X (DataFrame) – The input dataframe that is used to determine whether there is any enrichment with the detected cluster labels.
Returns:: Dataframe containing significant associations with the cluster labels.
Return type:: pd.DataFrame

fit(X, savemem=False)

Cluster validation.

Parameters:

X (Numpy-array.) – The rows are the features and the colums are the samples.
savemem (bool, (default: False)) – Save memmory when working with large datasets. Setting this value to True will not store the dataset in memory, and K-means will be optimized.

Returns:

evaluate: str: evaluate name that is used for cluster evaluation.
score: pd.DataFrame(): The scoring values per clusters [silhouette, dbindex] provide this information.
labx: list: Cluster labels.
fig: list: Relevant information to make the plot.

Return type:

dict. with various keys. Note that the underneath keys can change based on the used evaluation method.

import_example(data='titanic', url=None, sep=',', params={'n_feat': 2, 'n_samples': 1000, 'noise': 0.05, 'random_state': 170})

Import example dataset from github source.

Import one of the datasets that can be loaded from datzets or use an url link. For more details see here: https://github.com/erdogant/datazets

Parameters:

data (str) –
- ‘blobs’ (numeric)
- ’moons’ (numeric)
- ’circles’ (numeric)
- ’anisotropic’ (numeric)
- ’globular’ (numeric)
- ’uniform’ (numeric)
- ’densities’ (numeric)
- etc
url (str) – url link to to dataset.
sep (str) – Delimiter of the data set.
params (dict) – Parameter that are used to generate more cutomized data sets.

Return type:

pd.DataFrame()

load(filepath='clusteval.pkl')

Restore previous results.

Parameters:: filepath (str) – Pathname to stored pickle files.
Return type:: Object.

plot(title=None, xlabel='Nr. clusters', ylabel='Score', figsize=(15, 8), savefig={'dpi ': None, 'facecolor': 'auto', 'fname': None, 'orientation': 'portrait', <built-in function format>: 'png'}, font_properties={'axis_color': '#000000', 'fontcolor': '#000000', 'size_title': 18, 'size_x_axis': 18, 'size_y_axis': 18}, params_line={'color': 'k'}, params_vline={'color': 'r', 'linestyle': '--', 'linewidth': 2}, ax=None, showfig=True, verbose='info')

Make a plot.

Parameters:

figsize (tuple, (default: (15, 8).) – Size of the figure (height,width).
savefig (dict.) – https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html {‘dpi’:’figure’, ‘format’:None, ‘metadata’:None, ‘bbox_inches’: None, ‘pad_inches’:0.1, ‘facecolor’:’auto’, ‘edgecolor’:’auto’, ‘backend’:None}
verbose (int, (default: 'info')) – Print progress to screen. The default is ‘info’. 60: None, 40: Error, 30: Warn, 20: Info, 10: Debug

Returns:

tuple

Return type:

(fig, ax)

plot_silhouette(X=None, dot_size=25, jitter=None, embedding=None, cmap='tab20c', figsize=(15, 8), savefig={'dpi ': None, 'facecolor': 'auto', 'fname': None, 'orientation': 'portrait', <built-in function format>: 'png'}, showfig=True)

Make a plot.

Parameters:

X (array-like, (default: None)) – Input dataset used in the .fit() function. You can also provide PCA or tSNE coordinates.
dot_size (int, (default: 50)) – Size of the dot in the scatterplot
jitter (float, default: None) – Add jitter to data points as random normal data. Values of 0.01 is usually good for one-hot data seperation.
embedding (str (default: None)) – In case high dimensional data, a embedding with t-SNE can be performed. * None * ‘tsne’
cmap (string (default: 'tab20c')) – Colourmap.
figsize (tuple, (default: (15, 8).) – Size of the figure (height,width).
savefig (dict.) – https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html {‘dpi’:’figure’, ‘format’:None, ‘metadata’:None, ‘bbox_inches’: None, ‘pad_inches’:0.1, ‘facecolor’:’auto’, ‘edgecolor’:’auto’, ‘backend’:None}

Return type:

None.

save(filepath='clusteval.pkl', overwrite=False)

Save model in pickle file.

Parameters:

filepath (str, (default: 'clusteval.pkl')) – Pathname to store pickle files.
overwrite (bool, (default=False)) – Overwite file if exists.

Returns:

bool – Status whether the file is saved.

Return type:

[True, False]

scatter(X=None, s=25, embedding=None, n_feat=2, legend=False, jitter=None, cmap='tab20c', figsize=(25, 15), fontsize=16, fontcolor='k', density=False, grid=True, showfig=True, savefig={'dpi ': None, 'facecolor': 'auto', 'fname': None, 'orientation': 'portrait', <built-in function format>: 'png'}, params_scatterd={'alpha': 0.8, 'dpi': 100, 'edgecolor': None, 'gradient': None, 'marker': 'o'}, interactive=False)

Scatterplot.

Create a scatterplot for the first two features or with an t-SNE embedding. The clusters are colored based on the cluster labels and labeld with the features from the enrichment analysis.

Parameters:

X (DataFrame or Numpy-array.) – The rows are the features and the colums are the samples.
s (int, optional) – Dotsize. The default is 50. After an enrichment analysis, the size is based on the -log(P) value for the enriched cluster label.
embedding (str (default: None)) – In case high dimensional data, a embedding with t-SNE can be performed. * None * ‘tsne’
n_feat (int, (default: 2)) – Number of top significant features for the particular label to be plotted. This requires doing an enrichment analysis first.
legend (bool, (default: False)) – Plot the legend.
jitter (float, default: None) – Add jitter to data points as random normal data. Values of 0.01 is usually good for one-hot data seperation.
cmap (String, optional) – ‘Set1’ (default) ‘Set2’ ‘rainbow’ ‘bwr’ Blue-white-red ‘binary’ or ‘binary_r’ ‘seismic’ Blue-white-red ‘Blues’ white-to-blue ‘Reds’ white-to-red ‘Pastel1’ Discrete colors ‘Paired’ Discrete colors ‘Set1’ Discrete colors
figsize (tuple, optional) – Figure size. The default is (25, 15).
fontsize (str (default: 18)) – The fontsize of the y labels that are plotted in the graph.
fontcolor (list/array of RGB colors with same size as X (default : None)) – None : Use same colorscheme as for c [0,0,0] : If the input is a single color, all fonts will get this color.
savefig (dict.) – https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html {‘dpi’:’figure’, ‘format’:None, ‘metadata’:None, ‘bbox_inches’: None, ‘pad_inches’:0.1, ‘facecolor’:’auto’, ‘edgecolor’:’auto’, ‘backend’:None}
showfig (bool, optional) – Show the figure.

Returns:

Figure and axis of the figure.

Return type:

tuple, (fig, ax)