API References
clusteval is a python package to measure the goodness of the unsupervised clustering.
- class clusteval.clusteval.clusteval(cluster='agglomerative', evaluate='silhouette', metric='euclidean', linkage='ward', min_clust=2, max_clust=25, normalize=False, savemem=False, verbose='info', params_dbscan={'eps': None, 'epsres': 50, 'min_samples': 0.01, 'n_jobs': -1, 'norm': False})
Cluster evaluation.
clusteval is a python package that provides various evaluation approaches to measure the goodness of the unsupervised clustering.
- Parameters:
cluster (str, (default: 'agglomerative')) –
- Type of clustering.
’agglomerative’
’kmeans’
’dbscan’
’hdbscan’
’optics’ # TODO
evaluate (str, (default: 'silhouette')) –
- Evaluation method for cluster validation.
’silhouette’
’dbindex’
’derivative’
metric (str, (default: 'euclidean').) –
- Distance measures. All metrics from sklearn can be used such as:
’euclidean’
’hamming’
’braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’
linkage (str, (default: 'ward')) –
- Linkage type for the clustering.
’ward’
’single’
’complete’
’average’
’weighted’
’centroid’
’median’
min_clust (int, (default: 2)) – Number of clusters that is evaluated greater or equals to min_clust.
max_clust (int, (default: 25)) – Number of clusters that is evaluated smaller or equals to max_clust.
normalize (bool (default : False)) – Normalize data, Z-score
savemem (bool, (default: False)) – Save memmory when working with large datasets. Note that htis option only in case of KMeans.
jitter (float, default: None) – Add jitter to data points as random normal data. Values of 0.01 is usually good for one-hot data seperation.
verbose (int, (default: 'info')) – Print progress to screen. The default is ‘info’. 60: None, 40: Error, 30: Warn, 20: Info, 10: Debug
- Returns:
dict
- Return type:
dictionary with keys:
Examples
>>> # Import library >>> from clusteval import clusteval >>> # Initialize clusteval with default parameters >>> ce = clusteval() >>> >>> # Generate random data >>> from sklearn.datasets import make_blobs >>> X, labels_true = make_blobs(n_samples=750, centers=4, n_features=2, cluster_std=0.5) >>> >>> # Fit best clusters >>> results = ce.fit(X) >>> >>> # Make plot >>> ce.plot() >>> >>> # silhouette plot >>> ce.plot_silhouette() >>> >>> # Scatter plot >>> ce.scatter() >>> >>> # Dendrogram >>> ce.dendrogram()
- dendrogram(X=None, labels=None, leaf_rotation=90, leaf_font_size=12, orientation='top', show_contracted=True, max_d=None, showfig=True, metric=None, linkage=None, truncate_mode=None, update_results=False, figsize=(15, 10), savefig={'dpi ': None, 'facecolor': 'auto', 'fname': None, 'orientation': 'portrait', <built-in function format>: 'png'})
Plot Dendrogram.
- Parameters:
X (numpy-array (default : None)) – Input data.
labels (list, (default: None)) – Plot the labels. When None: the index of the original observation is used to label the leaf nodes.
leaf_rotation (int, (default: 90)) – Rotation of the labels [0-360].
leaf_font_size (int, (default: 12)) – Font size labels.
orientation (string, (default: 'top')) – Direction of the dendrogram: ‘top’, ‘bottom’, ‘left’ or ‘right’
show_contracted (bool, (default: True)) – The heights of non-singleton nodes contracted into a leaf node are plotted as crosses along the link connecting that leaf node.
max_d (Float, (default: None)) – Height of the dendrogram to make a horizontal cut-off line.
showfig (bool, (default = True)) – Plot the dendrogram.
metric (str, (default: 'euclidean').) – Distance measure for the clustering, such as ‘euclidean’,’hamming’, etc.
linkage (str, (default: 'ward')) – Linkage type for the clustering. ‘ward’,’single’,’,complete’,’average’,’weighted’,’centroid’,’median’.
truncate_mode (string, (default: None)) – Truncation is used to condense the dendrogram, which can be based on: ‘level’, ‘lastp’ or None
figsize (tuple, (default: (15, 10).) – Size of the figure (height,width).
savefig (dict.) – https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html {‘dpi’:’figure’, ‘format’:None, ‘metadata’:None, ‘bbox_inches’: None, ‘pad_inches’:0.1, ‘facecolor’:’auto’, ‘edgecolor’:’auto’, ‘backend’:None}
- Returns:
results –
labx : int : Cluster labels based on the input-ordering.
order_rows : string : Order of the cluster labels as presented in the dendrogram (left-to-right).
max_d : float : maximum distance to set the horizontal threshold line.
max_d_lower : float : maximum distance lowebound
max_d_upper : float : maximum distance upperbound
- Return type:
dict
- enrichment(X=None)
Enrichment analysis.
- Parameters:
X (DataFrame) – The input dataframe that is used to determine whether there is any enrichment with the detected cluster labels.
- Returns:
Dataframe containing significant associations with the cluster labels.
- Return type:
pd.DataFrame
- fit(X, savemem=False)
Cluster validation.
- Parameters:
X (Numpy-array.) – The rows are the features and the colums are the samples.
savemem (bool, (default: False)) – Save memmory when working with large datasets. Setting this value to True will not store the dataset in memory, and K-means will be optimized.
- Returns:
- evaluate: str
evaluate name that is used for cluster evaluation.
- score: pd.DataFrame()
The scoring values per clusters [silhouette, dbindex] provide this information.
- labx: list
Cluster labels.
- fig: list
Relevant information to make the plot.
- Return type:
dict. with various keys. Note that the underneath keys can change based on the used evaluation method.
- import_example(data='titanic', url=None, sep=',', params={})
Import example dataset from github source.
Import one of the few datasets from github source or specify your own download url link.
- Parameters:
data (str) –
‘blobs’ (numeric)
’moons’ (numeric)
’circles’ (numeric)
’anisotropic’ (numeric)
’globular’ (numeric)
’uniform’ (numeric)
’densities’ (numeric)
’sprinkler’ (categorical)
’titanic’ (mixed)
’student’ (categorical)
’fifa’
’cancer’
’waterpump’
’retail’
’breast’
’iris’
url (str) – url link to to dataset.
sep (str) – Delimiter of the data set.
- Returns:
Dataset containing mixed features.
- Return type:
pd.DataFrame()
References
- load(filepath='clusteval.pkl')
Restore previous results.
- Parameters:
filepath (str) – Pathname to stored pickle files.
- Return type:
Object.
- plot(title=None, xlabel='Nr. clusters', ylabel='Score', figsize=(15, 8), savefig={'dpi ': None, 'facecolor': 'auto', 'fname': None, 'orientation': 'portrait', <built-in function format>: 'png'}, font_properties={'size_title': 18, 'size_x_axis': 18, 'size_y_axis': 18}, ax=None, showfig=True, verbose='info')
Make a plot.
- Parameters:
figsize (tuple, (default: (15, 8).) – Size of the figure (height,width).
savefig (dict.) – https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html {‘dpi’:’figure’, ‘format’:None, ‘metadata’:None, ‘bbox_inches’: None, ‘pad_inches’:0.1, ‘facecolor’:’auto’, ‘edgecolor’:’auto’, ‘backend’:None}
verbose (int, (default: 'info')) – Print progress to screen. The default is ‘info’. 60: None, 40: Error, 30: Warn, 20: Info, 10: Debug
- Returns:
tuple
- Return type:
(fig, ax)
- plot_silhouette(X=None, dot_size=25, jitter=None, embedding=None, cmap='tab20c', figsize=(15, 8), savefig={'dpi ': None, 'facecolor': 'auto', 'fname': None, 'orientation': 'portrait', <built-in function format>: 'png'}, showfig=True)
Make a plot.
- Parameters:
X (array-like, (default: None)) – Input dataset used in the .fit() function. You can also provide PCA or tSNE coordinates.
dot_size (int, (default: 50)) – Size of the dot in the scatterplot
jitter (float, default: None) – Add jitter to data points as random normal data. Values of 0.01 is usually good for one-hot data seperation.
embedding (str (default: None)) – In case high dimensional data, a embedding with t-SNE can be performed. * None * ‘tsne’
cmap (string (default: 'tab20c')) – Colourmap.
figsize (tuple, (default: (15, 8).) – Size of the figure (height,width).
savefig (dict.) – https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html {‘dpi’:’figure’, ‘format’:None, ‘metadata’:None, ‘bbox_inches’: None, ‘pad_inches’:0.1, ‘facecolor’:’auto’, ‘edgecolor’:’auto’, ‘backend’:None}
- Return type:
None.
- save(filepath='clusteval.pkl', overwrite=False)
Save model in pickle file.
- Parameters:
filepath (str, (default: 'clusteval.pkl')) – Pathname to store pickle files.
overwrite (bool, (default=False)) – Overwite file if exists.
- Returns:
bool – Status whether the file is saved.
- Return type:
[True, False]
- scatter(X=None, s=25, embedding=None, n_feat=2, legend=False, jitter=None, cmap='tab20c', figsize=(25, 15), fontsize=16, fontcolor='k', density=False, grid=True, showfig=True, savefig={'dpi ': None, 'facecolor': 'auto', 'fname': None, 'orientation': 'portrait', <built-in function format>: 'png'}, params_scatterd={'alpha': 0.8, 'dpi': 100, 'edgecolor': None, 'gradient': None, 'marker': 'o'}, interactive=False)
Scatterplot.
Create a scatterplot for the first two features or with an t-SNE embedding. The clusters are colored based on the cluster labels and labeld with the features from the enrichment analysis.
- Parameters:
X (DataFrame or Numpy-array.) – The rows are the features and the colums are the samples.
s (int, optional) – Dotsize. The default is 50. After an enrichment analysis, the size is based on the -log(P) value for the enriched cluster label.
embedding (str (default: None)) – In case high dimensional data, a embedding with t-SNE can be performed. * None * ‘tsne’
n_feat (int, (default: 2)) – Number of top significant features for the particular label to be plotted. This requires doing an enrichment analysis first.
legend (bool, (default: False)) – Plot the legend.
jitter (float, default: None) – Add jitter to data points as random normal data. Values of 0.01 is usually good for one-hot data seperation.
cmap (String, optional) – ‘Set1’ (default) ‘Set2’ ‘rainbow’ ‘bwr’ Blue-white-red ‘binary’ or ‘binary_r’ ‘seismic’ Blue-white-red ‘Blues’ white-to-blue ‘Reds’ white-to-red ‘Pastel1’ Discrete colors ‘Paired’ Discrete colors ‘Set1’ Discrete colors
figsize (tuple, optional) – Figure size. The default is (25, 15).
fontsize (str (default: 18)) – The fontsize of the y labels that are plotted in the graph.
fontcolor (list/array of RGB colors with same size as X (default : None)) – None : Use same colorscheme as for c [0,0,0] : If the input is a single color, all fonts will get this color.
savefig (dict.) – https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html {‘dpi’:’figure’, ‘format’:None, ‘metadata’:None, ‘bbox_inches’: None, ‘pad_inches’:0.1, ‘facecolor’:’auto’, ‘edgecolor’:’auto’, ‘backend’:None}
showfig (bool, optional) – Show the figure.
- Returns:
Figure and axis of the figure.
- Return type:
tuple, (fig, ax)
- clusteval.clusteval.import_example(data='titanic', url=None, sep=',', params={}, logger=None)
Import example dataset from github source.
Import one of the few datasets from github source or specify your own download url link.
- Parameters:
data (str) – Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’, ‘breast’, ‘iris’
url (str) – url link to to dataset.
sep (str) – Delimiter of the data set.
- Returns:
Dataset containing mixed features.
- Return type:
pd.DataFrame()
- class clusteval.clusteval.wget
Retrieve file from url.
- download(writepath)
Download.
- Parameters:
url (str.) – Internet source.
writepath (str.) – Directory to write the file.
- Return type:
None.
- filename_from_url()
Return filename.