API References

clusteval is a python package to measure the goodness of the unsupervised clustering.

class clusteval.clusteval.clusteval(cluster='agglomerative', evaluate='silhouette', metric='euclidean', linkage='ward', min_clust=2, max_clust=25, normalize=False, savemem=False, verbose='info', params_dbscan={'eps': None, 'epsres': 50, 'min_samples': 0.01, 'n_jobs': -1, 'norm': False})

Cluster evaluation.

clusteval is a python package that provides various evaluation approaches to measure the goodness of the unsupervised clustering.

Parameters:
  • cluster (str, (default: 'agglomerative')) –

    Type of clustering.
    • ’agglomerative’

    • ’kmeans’

    • ’dbscan’

    • ’hdbscan’

    • ’optics’ # TODO

  • evaluate (str, (default: 'silhouette')) –

    Evaluation method for cluster validation.
    • ’silhouette’

    • ’dbindex’

    • ’derivative’

  • metric (str, (default: 'euclidean').) –

    Distance measures. All metrics from sklearn can be used such as:
    • ’euclidean’

    • ’hamming’

    • ’braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’

  • linkage (str, (default: 'ward')) –

    Linkage type for the clustering.
    • ’ward’

    • ’single’

    • ’complete’

    • ’average’

    • ’weighted’

    • ’centroid’

    • ’median’

  • min_clust (int, (default: 2)) – Number of clusters that is evaluated greater or equals to min_clust.

  • max_clust (int, (default: 25)) – Number of clusters that is evaluated smaller or equals to max_clust.

  • normalize (bool (default : False)) – Normalize data, Z-score

  • savemem (bool, (default: False)) – Save memmory when working with large datasets. Note that htis option only in case of KMeans.

  • jitter (float, default: None) – Add jitter to data points as random normal data. Values of 0.01 is usually good for one-hot data seperation.

  • verbose (int, (default: 'info')) – Print progress to screen. The default is ‘info’. 60: None, 40: Error, 30: Warn, 20: Info, 10: Debug

Returns:

dict

Return type:

dictionary with keys:

Examples

>>> # Import library
>>> from clusteval import clusteval
>>> # Initialize clusteval with default parameters
>>> ce = clusteval()
>>>
>>> # Generate random data
>>> from sklearn.datasets import make_blobs
>>> X, labels_true = make_blobs(n_samples=750, centers=4, n_features=2, cluster_std=0.5)
>>>
>>> # Fit best clusters
>>> results = ce.fit(X)
>>>
>>> # Make plot
>>> ce.plot()
>>>
>>> # silhouette plot
>>> ce.plot_silhouette()
>>>
>>> # Scatter plot
>>> ce.scatter()
>>>
>>> # Dendrogram
>>> ce.dendrogram()
dendrogram(X=None, labels=None, leaf_rotation=90, leaf_font_size=12, orientation='top', show_contracted=True, max_d=None, showfig=True, metric=None, linkage=None, truncate_mode=None, update_results=False, figsize=(15, 10), savefig={'fname': None, <built-in function format>: 'png', 'dpi ': None, 'orientation': 'portrait', 'facecolor': 'auto'})

Plot Dendrogram.

Parameters:
  • X (numpy-array (default : None)) – Input data.

  • labels (list, (default: None)) – Plot the labels. When None: the index of the original observation is used to label the leaf nodes.

  • leaf_rotation (int, (default: 90)) – Rotation of the labels [0-360].

  • leaf_font_size (int, (default: 12)) – Font size labels.

  • orientation (string, (default: 'top')) – Direction of the dendrogram: ‘top’, ‘bottom’, ‘left’ or ‘right’

  • show_contracted (bool, (default: True)) – The heights of non-singleton nodes contracted into a leaf node are plotted as crosses along the link connecting that leaf node.

  • max_d (Float, (default: None)) – Height of the dendrogram to make a horizontal cut-off line.

  • showfig (bool, (default = True)) – Plot the dendrogram.

  • metric (str, (default: 'euclidean').) – Distance measure for the clustering, such as ‘euclidean’,’hamming’, etc.

  • linkage (str, (default: 'ward')) – Linkage type for the clustering. ‘ward’,’single’,’,complete’,’average’,’weighted’,’centroid’,’median’.

  • truncate_mode (string, (default: None)) – Truncation is used to condense the dendrogram, which can be based on: ‘level’, ‘lastp’ or None

  • figsize (tuple, (default: (15, 10).) – Size of the figure (height,width).

  • savefig (dict.) – https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html {‘dpi’:’figure’, ‘format’:None, ‘metadata’:None, ‘bbox_inches’: None, ‘pad_inches’:0.1, ‘facecolor’:’auto’, ‘edgecolor’:’auto’, ‘backend’:None}

Returns:

results

  • labx : int : Cluster labels based on the input-ordering.

  • order_rows : string : Order of the cluster labels as presented in the dendrogram (left-to-right).

  • max_d : float : maximum distance to set the horizontal threshold line.

  • max_d_lower : float : maximum distance lowebound

  • max_d_upper : float : maximum distance upperbound

Return type:

dict

enrichment(X=None)

Enrichment analysis.

Parameters:

X (DataFrame) – The input dataframe that is used to determine whether there is any enrichment with the detected cluster labels.

Returns:

Dataframe containing significant associations with the cluster labels.

Return type:

pd.DataFrame

fit(X, savemem=False)

Cluster validation.

Parameters:
  • X (Numpy-array.) – The rows are the features and the colums are the samples.

  • savemem (bool, (default: False)) – Save memmory when working with large datasets. Setting this value to True will not store the dataset in memory, and K-means will be optimized.

Returns:

evaluate: str

evaluate name that is used for cluster evaluation.

score: pd.DataFrame()

The scoring values per clusters [silhouette, dbindex] provide this information.

labx: list

Cluster labels.

fig: list

Relevant information to make the plot.

Return type:

dict. with various keys. Note that the underneath keys can change based on the used evaluation method.

import_example(data='titanic', url=None, sep=',', params={})

Import example dataset from github source.

Import one of the few datasets from github source or specify your own download url link.

Parameters:
  • data (str) –

    • ‘blobs’ (numeric)

    • ’moons’ (numeric)

    • ’circles’ (numeric)

    • ’anisotropic’ (numeric)

    • ’globular’ (numeric)

    • ’uniform’ (numeric)

    • ’densities’ (numeric)

    • ’sprinkler’ (categorical)

    • ’titanic’ (mixed)

    • ’student’ (categorical)

    • ’fifa’

    • ’cancer’

    • ’waterpump’

    • ’retail’

    • ’breast’

    • ’iris’

  • url (str) – url link to to dataset.

  • sep (str) – Delimiter of the data set.

Returns:

Dataset containing mixed features.

Return type:

pd.DataFrame()

References

load(filepath='clusteval.pkl')

Restore previous results.

Parameters:

filepath (str) – Pathname to stored pickle files.

Return type:

Object.

plot(title=None, xlabel='Nr. clusters', ylabel='Score', figsize=(15, 8), savefig={'fname': None, <built-in function format>: 'png', 'dpi ': None, 'orientation': 'portrait', 'facecolor': 'auto'}, font_properties={'size_title': 18, 'size_x_axis': 18, 'size_y_axis': 18}, ax=None, showfig=True, verbose='info')

Make a plot.

Parameters:
  • figsize (tuple, (default: (15, 8).) – Size of the figure (height,width).

  • savefig (dict.) – https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html {‘dpi’:’figure’, ‘format’:None, ‘metadata’:None, ‘bbox_inches’: None, ‘pad_inches’:0.1, ‘facecolor’:’auto’, ‘edgecolor’:’auto’, ‘backend’:None}

  • verbose (int, (default: 'info')) – Print progress to screen. The default is ‘info’. 60: None, 40: Error, 30: Warn, 20: Info, 10: Debug

Returns:

tuple

Return type:

(fig, ax)

plot_silhouette(X=None, dot_size=25, jitter=None, embedding=None, cmap='tab20c', figsize=(15, 8), savefig={'fname': None, <built-in function format>: 'png', 'dpi ': None, 'orientation': 'portrait', 'facecolor': 'auto'}, showfig=True)

Make a plot.

Parameters:
  • X (array-like, (default: None)) – Input dataset used in the .fit() function. You can also provide PCA or tSNE coordinates.

  • dot_size (int, (default: 50)) – Size of the dot in the scatterplot

  • jitter (float, default: None) – Add jitter to data points as random normal data. Values of 0.01 is usually good for one-hot data seperation.

  • embedding (str (default: None)) – In case high dimensional data, a embedding with t-SNE can be performed. * None * ‘tsne’

  • cmap (string (default: 'tab20c')) – Colourmap.

  • figsize (tuple, (default: (15, 8).) – Size of the figure (height,width).

  • savefig (dict.) – https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html {‘dpi’:’figure’, ‘format’:None, ‘metadata’:None, ‘bbox_inches’: None, ‘pad_inches’:0.1, ‘facecolor’:’auto’, ‘edgecolor’:’auto’, ‘backend’:None}

Return type:

None.

save(filepath='clusteval.pkl', overwrite=False)

Save model in pickle file.

Parameters:
  • filepath (str, (default: 'clusteval.pkl')) – Pathname to store pickle files.

  • overwrite (bool, (default=False)) – Overwite file if exists.

Returns:

bool – Status whether the file is saved.

Return type:

[True, False]

scatter(X=None, s=25, embedding=None, n_feat=2, legend=False, jitter=None, cmap='tab20c', figsize=(25, 15), fontsize=16, fontcolor='k', savefig={'fname': None, <built-in function format>: 'png', 'dpi ': None, 'orientation': 'portrait', 'facecolor': 'auto'}, showfig=True)

Scatterplot.

Create a scatterplot for the first two features or with an t-SNE embedding. The clusters are colored based on the cluster labels and labeld with the features from the enrichment analysis.

Parameters:
  • X (DataFrame or Numpy-array.) – The rows are the features and the colums are the samples.

  • s (int, optional) – Dotsize. The default is 50. After an enrichment analysis, the size is based on the -log(P) value for the enriched cluster label.

  • embedding (str (default: None)) – In case high dimensional data, a embedding with t-SNE can be performed. * None * ‘tsne’

  • n_feat (int, (default: 2)) – Number of top significant features for the particular label to be plotted. This requires doing an enrichment analysis first.

  • legend (bool, (default: False)) – Plot the legend.

  • jitter (float, default: None) – Add jitter to data points as random normal data. Values of 0.01 is usually good for one-hot data seperation.

  • cmap (String, optional) – ‘Set1’ (default) ‘Set2’ ‘rainbow’ ‘bwr’ Blue-white-red ‘binary’ or ‘binary_r’ ‘seismic’ Blue-white-red ‘Blues’ white-to-blue ‘Reds’ white-to-red ‘Pastel1’ Discrete colors ‘Paired’ Discrete colors ‘Set1’ Discrete colors

  • figsize (tuple, optional) – Figure size. The default is (25, 15).

  • fontsize (str (default: 18)) – The fontsize of the y labels that are plotted in the graph.

  • fontcolor (list/array of RGB colors with same size as X (default : None)) – None : Use same colorscheme as for c [0,0,0] : If the input is a single color, all fonts will get this color.

  • savefig (dict.) – https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html {‘dpi’:’figure’, ‘format’:None, ‘metadata’:None, ‘bbox_inches’: None, ‘pad_inches’:0.1, ‘facecolor’:’auto’, ‘edgecolor’:’auto’, ‘backend’:None}

  • showfig (bool, optional) – Show the figure.

Returns:

Figure and axis of the figure.

Return type:

tuple, (fig, ax)

clusteval.clusteval.import_example(data='titanic', url=None, sep=',', params={}, logger=None)

Import example dataset from github source.

Import one of the few datasets from github source or specify your own download url link.

Parameters:
  • data (str) – Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’, ‘breast’, ‘iris’

  • url (str) – url link to to dataset.

  • sep (str) – Delimiter of the data set.

Returns:

Dataset containing mixed features.

Return type:

pd.DataFrame()

class clusteval.clusteval.wget

Retrieve file from url.

download(writepath)

Download.

Parameters:
  • url (str.) – Internet source.

  • writepath (str.) – Directory to write the file.

Return type:

None.

filename_from_url()

Return filename.