API References

Python package clustimage is for unsupervised clustering of images.

Name : clustimage.py Author : E.Taskesen Contact : erdogant@gmail.com github : https://github.com/erdogant/clustimage Licence : See licences

class clustimage.clustimage.Clustimage(method='pca', embedding='tsne', grayscale=False, dim=(128, 128), dim_face=(64, 64), dirpath=None, use_image_cache=True, use_thumbnail_cache=True, tempdir=None, ext=['png', 'tiff', 'tif', 'jpg', 'jpeg', 'heic'], params_pca={'n_components': 0.95}, params_hog={'cells_per_block': (1, 1), 'orientations': 8, 'pixels_per_cell': (8, 8)}, params_hash={'hash_size': 8, 'threshold': 0}, params_exif={'exif_location': False, 'max_workers': None, 'min_samples': 2, 'radius_meters': 1000, 'timeframe': 5}, verbose='info')

Clustering of images.

Clustering input images after following steps of pre-processing, feature-extracting, feature-embedding and cluster-evaluation. Taking all these steps requires setting various input parameters. Not all input parameters can be changed across the different steps in clustimage. Some parameters are choosen based on best practice, some parameters are optimized, while others are set as a constant.

The following 4 steps are taken:

Step 1. Pre-processing.
Images are imported with specific extention ([‘png’, ‘tiff’, ‘tif’, ‘jpg’, ‘jpeg’, ‘heic’]), Each input image can then be grayscaled. Setting the grayscale parameter to True can be especially usefull when clustering faces. Final step in pre-processing is resizing all images in the same dimension such as (128,128). Note that if an array-like dataset [Samples x Features] is given as input, setting these dimensions are required to restore the image in case of plotting.
Step 2. Feature-extraction.
Features are extracted from the images using Principal component analysis (PCA), Histogram of Oriented Gradients (HOG) or the raw values are used.
Step 3. Embedding:
The feature-space non-lineair transformed using t-SNE and the coordinates are stored. The embedding is only used for visualization purposes.
Step 4. Cluster evaluation.
The feature-space is used as an input in the cluster-evaluation method. The cluster evaluation method determines the optimal number of clusters and return the cluster labels.
Step 5: Save.
The results are stored in the object and returned by the model. Various different (scatter) plots can be made to evaluate the results.

Parameters:

method (str, (default: 'pca')) –
Method to be usd to extract features from images.
- None : No feature extraction
- ’pca’ : PCA feature extraction
- ’hog’ : hog features extraced
- ’pca-hog’ : PCA extracted features from the HOG desriptor
- ’exif’: Use EXIF information from file to cluster on datetime (params_exif)
hashmethod : str (default: ‘ahash’) * ‘ahash’: Average hash * ‘phash’: Perceptual hash * ‘dhash’: Difference hash * ‘whash-haar’: Haar wavelet hash * ‘whash-db4’: Daubechies wavelet hash * ‘colorhash’: HSV color hash * ‘crop-resistant-hash’: Crop-resistant hash
embedding (str, (default: 'tsne')) –
Perform embedding on the extracted features. The xycoordinates are used for plotting purposes. For UMAP; all default settings are used, and with densmap=True.
- ’tsne’
- ’umap’
- None
grayscale (Bool, (default: False)) – Colorscaling the image to gray. This can be usefull when clustering e.g., faces.
dim (tuple, (default: (128,128))) – Rescale images. This is required because the feature-space need to be the same across samples.
dirpath (str, (default: 'clustimage')) – Directory to write images. The default is the system tempdirectory.
ext (list, (default: ['png', 'tiff', 'tif', 'jpg', 'jpeg', 'heic'])) – Images with the file extentions are used.
params_pca (dict, default: {'n_components':50, 'detect_outliers':None}) – Parameters to initialize the pca model.
params_hog (dict, default: {'orientations':9, 'pixels_per_cell':(16,16), 'cells_per_block':(1,1)}) – Parameters to extract hog features.
params_exif (dict, default: {'timeframe': 5, 'radius_meters': 1000, 'min_samples': 2, 'exif_location': False, 'max_workers': None}) – Parameters to proces exif information. - ‘timeframe’: Timeframe in hours that a photo is grouped together. - ‘radius_meters’: The radius that is used to cluster the images when using metric=’datetime’ - ‘min_samples’: Minimun number of samples per cluster - ‘exif_location’: This function makes requests to derive the location such as streetname etc. Note that the request rate per photo limited to 1 sec to prevent time-outs. It requires photos with lat/lon coordinates.
use_image_cache (bool (Default: True)) – In case a image array is provided as input. Images are then stored on disk which allows using all functionalities for plotting. True: Image arrays are stored on disk. False: Original images are used.
use_thumbnail_cache (bool (Default: True)) – True: To speed up the proces of image plotting and comparison, thumbnails are stored in the temp directory and used when available. False: Original images are used.
verbose (int, (default: 'info')) – Print progress to screen. The default is 20. 60: None, 40: error, 30: warning, 20: info, 10: debug

Returns:

Object.
model (dict) – dict containing keys with results. feat : array-like.

Features extracted from the input-images

xycoordarray-like.
x,y coordinates after embedding or alternatively the first 2 features.

pathnameslist of str.
Full path to images that are used in the model.

filenameslist of str.
Filename of the input images.

labelslist.
Cluster labels

Example

>>> from clustimage import Clustimage
>>>
>>> # Init with default settings
>>> cl = Clustimage(method='pca')
>>>
>>> # load example with faces
>>> X, y = cl.import_example(data='mnist')
>>>
>>> # Cluster digits
>>> results = cl.fit_transform(X)
>>>
>>> # Cluster evaluation
>>> cl.clusteval.plot()
>>> cl.clusteval.scatter(cl.results['xycoord'])
>>> cl.clusteval.plot_silhouette(cl.results['xycoord'])
>>> cl.pca.plot()
>>>
>>> # Unique
>>> cl.plot_unique(img_mean=False)
>>> cl.results_unique.keys()
>>>
>>> # Scatter
>>> cl.scatter(img_mean=False, zoom=3)
>>> cl.scatter(zoom=8, plt_all=True, figsize=(150,100))
>>>
>>> # Plot clustered images
>>> cl.plot(labels=8)
>>>
>>> # Plot dendrogram
>>> cl.dendrogram()
>>>
>>> # Find images
>>> results_find = cl.find(X[0,:], k=None, alpha=0.05)
>>> cl.plot_find()
>>> cl.scatter()
>>>

check_verbosity(): Check the verbosity.

clean_files(clean_tempdir=False)

Clean files.

Return type:: None.

clean_init(): Clean or removing previous results and models to ensure correct working.

cluster(cluster='agglomerative', evaluate='silhouette', metric='euclidean', linkage='ward', min_clust=3, max_clust=25, cluster_space='high')

Detect the optimal number of clusters given the input set of features.

This function is build on clusteval, which is a python package that provides various evalution methods for unsupervised cluster validation.

Parameters:

cluster_space (str, (default: 'high')) –
Selection of the features that are used for clustering. This can either be on high or low feature space.
- ’high’ : Original feature space.
- ’low’ : Input are the xycoordinates that are determined by “embedding”. Thus either tSNE coordinates or the first two PCs or HOGH features.
cluster (str, (default: 'agglomerative')) –
Type of clustering.
- ’agglomerative’
- ’kmeans’
- ’dbscan’
- ’hdbscan’
evaluate (str, (default: 'silhouette')) –
Cluster evaluation method.
- ’silhouette’
- ’dbindex’
- ’derivative’
metric (str, (default: 'euclidean').) –
Distance measures. All metrics from sklearn can be used such as:
- ’euclidean’
- ’hamming’
- ’cityblock’
- ’correlation’
- ’cosine’
- ’jaccard’
- ’mahalanobis’
- ’seuclidean’
- ’sqeuclidean’
- ’datetime’: Use photo exif data to cluster photos on datetime (set params_exif)
- ’latlon’: Use photo exif data to cluster photos on lon/lat coordinates (set params_exif)
linkage (str, (default: 'ward')) –
Linkage type for the clustering.
- ’ward’
- ’single’
- ’complete’
- ’average’
- ’weighted’
- ’centroid’
- ’median’
min_clust (int, (default: 3)) – Number of clusters that is evaluated greater or equals to min_clust.
max_clust (int, (default: 25)) – Number of clusters that is evaluated smaller or equals to max_clust.

Returns:

.results[‘labels’] : Cluster labels. .clusteval : model parameters for cluster-evaluation and plotting.

Return type:

array-like

Example

>>> from clustimage import Clustimage
>>>
>>> # Init
>>> cl = Clustimage(method='hog')
>>>
>>> # load example with digits (mnist dataset)
>>> pathnames = cl.import_example(data='flowers')
>>>
>>> # Find clusters
>>> results = cl.fit_transform(pathnames)
>>>
>>> # Evaluate plot
>>> cl.clusteval.plot()
>>> cl.scatter(dotsize=50, img_mean=False)
>>>
>>> # Change the clustering evaluation approach, metric, minimum expected nr. of clusters etc.
>>> labels = cl.cluster(min_clust=5, max_clust=25)
>>>
>>> # Evaluate plot
>>> cl.clusteval.plot()
>>> cl.scatter(dotsize=50, img_mean=False)
>>>
>>> # If you want to cluster on the low-dimensional space.
>>> labels = cl.cluster(min_clust=5, max_clust=25, cluster_space='low', cluster='dbscan')
>>> cl.scatter(dotsize=50, img_mean=False)
>>>

compute_hash(img, hash_size=None)

Compute hash.

Parameters:: img (numpy-array) – Image.
Returns:: imghash – Hash.
Return type:: numpy-array

dendrogram(max_d=None, figsize=(15, 10), update_labels=True)

Plot Dendrogram.

Parameters:

max_d (Float, (default: None)) – Height of the dendrogram to make a horizontal cut-off line.
figsize (tuple, (default: (15, 10).) – Size of the figure (height,width).

Returns:

results – Cluster labels.

Return type:

list

Return type:

None.

embedding(X, metric='euclidean', embedding=None)

Compute embedding for the extracted features.

Parameters:

X (array-like) – NxM array for which N are the samples and M the features.
metric (str, (default: 'euclidean').) –
Distance measures. All metrics from sklearn can be used such as:
- ’euclidean’
- ’hamming’
- ’cityblock’
- ’correlation’
- ’cosine’
- ’jaccard’
- ’mahalanobis’
- ’seuclidean’
- ’sqeuclidean’
embedding (str, (default: retrieve from init)) –
Perform embedding on the extracted features. The xycoordinates are used for plotting purposes. For UMAP parameters set set to default with densmap=True.
- ’tsne’
- ’umap’
- None: Return the first to axis of input data X.

Returns:

xycoord – x,y coordinates after embedding or alternatively the first 2 features.

Return type:

array-like.

extract_faces(pathnames)

Detect and extract faces from images.

To cluster faces on images, we need to detect, and extract the faces from the images which is done in this function. Faces and eyes are detected using haarcascade_frontalface_default.xml and haarcascade_eye.xml in python-opencv.

Parameters:

pathnames (list of str.) – Full path to images that are used in the model.

Returns:

Object.
model (dict) – dict containing keys with results. pathnames : list of str.

Full path to images that are used in the model.

filenameslist of str.
Filename of the input images.

pathnames_facelist of str.
Filename of the extracted faces that are stored to disk.

imgarray-like.
NxMxC for which N are the Samples, M the features and C the number of channels.

coord_facesarray-like.
list of lists containing coordinates fo the faces in the original image.

coord_eyesarray-like.
list of lists containing coordinates fo the eyes in the extracted (img and pathnames_face) image.

Example

>>> from clustimage import Clustimage
>>>
>>> # Init with default settings
>>> cl = Clustimage(method='pca', grayscale=True)
>>>
>>> # Detect faces
>>> face_results = cl.extract_faces(r'c://temp//my_photos//')
>>> pathnames_face = face_results['pathnames_face']
>>>
>>> # Plot facces
>>> cl.plot_faces(faces=True, eyes=True)
>>>
>>> # load example with faces
>>> pathnames_face, y = cl.import_example(data='faces')
>>>
>>> # Cluster the faces
>>> results = cl.fit_transform(pathnames_face)
>>>
>>>

extract_feat(Xraw)

Extract features based on the input data X.

Parameters:: Xraw (dict containing keys:) – img : array-like. pathnames : list of str. filenames : list of str.
Returns:: X – Extracted features.
Return type:: array-like

extract_hog(X, orientations=8, pixels_per_cell=(16, 16), cells_per_block=(1, 1), flatten=True)

Extract HOG features.

Parameters:: X (array-like) – NxM array for which N are the samples and M the features.
Returns:: feat – NxF array for which N are the samples and F the reduced feature space.
Return type:: array-like

Examples

>>> import matplotlib.pyplot as plt
>>> from clustimage import Clustimage
>>>
>>> # Init
>>> cl = Clustimage(method='hog')
>>>
>>> # Load example data
>>> pathnames = cl.import_example(data='flowers')
>>> # Read image according the preprocessing steps
>>> img = cl.imread(pathnames[0], dim=(128,128))
>>>
>>> # Extract HOG features
>>> img_hog = cl.extract_hog(img)
>>>
>>> plt.figure();
>>> fig,axs=plt.subplots(1,2)
>>> axs[0].imshow(img.reshape(128,128,3))
>>> axs[0].axis('off')
>>> axs[0].set_title('Preprocessed image', fontsize=10)
>>> axs[1].imshow(img_hog.reshape(128,128), cmap='binary')
>>> axs[1].axis('off')
>>> axs[1].set_title('HOG', fontsize=10)

extract_pca(X)

Extract Principal Components.

Parameters:: X (array-like) – NxM array for which N are the samples and M the features.
Returns:: feat – NxF array for which N are the samples and F the reduced feature space.
Return type:: array-like

find(Xnew, metric=None, k=None, alpha=0.05)

Find images that are similar to that of the input image.

Finding images can be performed in two manners:

Based on the k-nearest neighbour

Based on significance after probability density fitting

In both cases, the adjacency matrix is first computed using the distance metric (default Euclidean). In case of the k-nearest neighbour approach, the k nearest neighbours are determined. In case of significance, the adjacency matrix is used to to estimate the best fit for the loc/scale/arg parameters across various theoretical distribution. The tested disributions are [‘norm’, ‘expon’, ‘uniform’, ‘gamma’, ‘t’]. The fitted distribution is basically the similarity-distribution of samples. For each new (unseen) input image, the probability of similarity is computed across all images, and the images are returned that are P <= alpha in the lower bound of the distribution. If case both k and alpha are specified, the union of detected samples is taken. Note that the metric can be changed in this function but this may lead to confusions as the results will not intuitively match with the scatter plots as these are determined using metric in the fit_transform() function.

Parameters:

pathnames (list of str.) – Full path to images that are used in the model.
metric (str, (default: the input of fit_transform()).) –
Distance measures. All metrics from sklearn can be used such as:
- ’euclidean’
- ’hamming’
- ’cityblock’
- ’correlation’
- ’cosine’
- ’jaccard’
- ’mahalanobis’
- ’seuclidean’
- ’sqeuclidean’
k (int, (default: None)) – The k-nearest neighbour.
alpha (float, default: 0.05) – Significance alpha.

Returns:

y_idxlist.: Index of the detected/predicted images.
distancelist.: Absolute distance to the input image.
y_probalist: Probability of similarity to the input image.
y_filenameslist.: filename of the detected image.
y_pathnameslist.: Pathname to the detected image.
x_pathnameslist.: Pathname to the input image.

Return type:

dict containing keys with each input image that contains the following results.

Example

>>> from clustimage import Clustimage
>>>
>>> # Init with default settings
>>> cl = Clustimage(method='pca')
>>>
>>> # load example with faces
>>> X, y = cl.import_example(data='mnist')
>>>
>>> # Cluster digits
>>> results = cl.fit_transform(X)
>>>
>>> # Find images
>>> results_find = cl.find(X[0,:], k=None, alpha=0.05)
>>> cl.plot_find()
>>> cl.scatter(zoom=3)
>>>

fit_transform(X, cluster='agglomerative', evaluate='silhouette', metric='euclidean', linkage='ward', min_clust=3, max_clust=25, cluster_space='high', black_list=None, recursive=True)

Group samples into clusters that are similar in their feature space.

The fit_transform function allows to detect natural groups or clusters of images. It works using a multi-step proces of pre-processing, extracting the features, and evaluating the optimal number of clusters across the feature space. The optimal number of clusters are determined using well known methods suchs as silhouette, dbindex, and derivatives in combination with clustering methods, such as agglomerative, kmeans, dbscan and hdbscan. Based on the clustering results, the unique images are also gathered.

Parameters:

X ([str of list] or [np.array].) –
The input can be:
- ”c://temp//” : Path to directory with images
- [‘c://temp//image1.png’, ‘c://image2.png’, …] : List of exact pathnames.
- [[.., ..], [.., ..], …] : np.array matrix in the form of [sampels x features]
cluster (str, (default: 'agglomerative')) –
Type of clustering.
- ’agglomerative’
- ’kmeans’
- ’dbscan’
- ’hdbscan’
evaluate (str, (default: 'silhouette')) –
Cluster evaluation method.
- ’silhouette’
- ’dbindex’
- ’derivative’
metric (str, (default: 'euclidean').) –
Distance measures. All metrics from sklearn can be used such as:
- ’euclidean’
- ’hamming’
- ’cityblock’
- ’correlation’
- ’cosine’
- ’jaccard’
- ’mahalanobis’
- ’seuclidean’
- ’sqeuclidean’
- ’datetime’: Use photo exif data to cluster photos on datetime (set params_exif)
- ’latlon’: Use photo exif data to cluster photos on lon/lat coordinates (set params_exif)
linkage (str, (default: 'ward')) –
Linkage type for the clustering.
- ’ward’
- ’single’
- ’complete’
- ’average’
- ’weighted’
- ’centroid’
- ’median’
min_clust (int, (default: 3)) – Number of clusters that is evaluated greater or equals to min_clust.
max_clust (int, (default: 25)) – Number of clusters that is evaluated smaller or equals to max_clust.
cluster_space (str, (default: 'high')) –
Selection of the features that are used for clustering. This can either be on high or low feature space.
- ’high’ : Original feature space.
- ’low’ : Input are the xycoordinates that are determined by “embedding”. Thus either tSNE coordinates or the first two PCs or HOGH features.
black_list (list, (default: None)) – Exclude directory with all subdirectories from processing. * [‘undouble’]
recursive (bool, optional) – Whether to scan subdirectories recursively. Default is True.

Returns:

Object.
model (dict) – dict containing keys with results. feat : array-like.

Features extracted from the input-images

xycoordarray-like.
x,y coordinates after embedding or alternatively the first 2 features.

pathnameslist of str.
Full path to images that are used in the model.

filenameslist of str.
Filename of the input images.

labelslist.
Cluster labels

Example

>>> from clustimage import Clustimage
>>>
>>> # Init with default settings
>>> cl = Clustimage(method='pca', grayscale=True)
>>>
>>> # load example with faces
>>> pathnames, y = cl.import_example(data='faces')
>>> # Detect faces
>>> face_results = cl.extract_faces(pathnames)
>>>
>>> # Cluster extracted faces
>>> results = cl.fit_transform(face_results['pathnames_face'])
>>>
>>> # Cluster evaluation
>>> cl.clusteval.plot()
>>> cl.clusteval.scatter(cl.results['xycoord'])
>>>
>>> # Unique
>>> cl.plot_unique(img_mean=False)
>>> cl.results_unique.keys()
>>>
>>> # Scatter
>>> cl.scatter(dotsize=50, img_mean=False)
>>>
>>> # Plot clustered images
>>> cl.plot(labels=8)
>>> # Plot facces
>>> cl.plot_faces()
>>>
>>> # Plot dendrogram
>>> cl.dendrogram()
>>>
>>> # Find images
>>> results_find = cl.find(face_results['pathnames_face'][2], k=None, alpha=0.05)
>>> cl.plot_find()
>>> cl.scatter()
>>> cl.pca.plot()
>>>

get_dim(Xraw, dim=None)

Determine dimension for image vector.

Parameters:

Xraw (array-like float) – Image vector.
dim (tuple (int, int)) – Dimension of the image.

Return type:

None.

import_data(Xraw, flatten=True, black_list=None, recursive=True, use_thumbnail_cache=False)

Import images and return in an consistent manner.

The input for the import_data() can have multiple forms; path to directory, list of strings and and array-like input. This requires that each of the input needs to be processed in its own manner but each should return the same structure to make it compatible across all functions. The following steps are used for the import:

Images are imported with specific extention ([‘png’, ‘tiff’, ‘tif’, ‘jpg’, ‘jpeg’, ‘heic’]).

Each input image can then be grayscaled. Setting the grayscale parameter to True can be especially usefull when clustering faces.

Final step in pre-processing is resizing all images in the same dimension such as (128,128). Note that if an array-like dataset [Samples x Features] is given as input, setting these dimensions are required to restore the image in case of plotting.

Images are saved to disk in case a array-like input is given.

Independent of the input, a dict is returned in a consistent manner.

Processing the input depends on the input:

Parameters:

Xraw (str, list or array-like.) –
The input can be:
- ”c://temp//” : Path to directory with images
- [‘c://temp//image1.png’, ‘c://image2.png’, …] : List of exact pathnames.
- [[.., ..], [.., ..], …] : Array-like matrix in the form of [sampels x features]
flatten (Bool, (default: True)) – Flatten the processed NxMxC array to a 1D-vector
black_list (list, (default: None)) – Exclude directory with all subdirectories from processing.
recursive (bool, optional) – Whether to scan subdirectories recursively. Default is True.
use_thumbnail_cache (bool (Default: True)) – True: To speed up the proces of image plotting and comparison, thumbnails are stored in the temp directory and used when available. False: Original images are used.

Returns:

Object.
model (dict) – dict containing keys with results. img : array-like.

Pre-processed images

pathnameslist of str.
Full path to images that are used in the model.

filenameslist of str.
Filename of the input images.

import_example(data='flowers', url=None, sep=',')

Import example dataset from github source.

Import one of the datasets from github source or specify your own download url link.

Parameters:

data (str) –
Images:
- ’faces’
- ’mnist’
Files with images:
- ’southern_nebula’
- ’flowers’
- ’scenes’
- ’cat_and_dog’
url (str) – url link to to dataset.

Returns:

list of str containing filepath to images.

Return type:

list of str

imread(filepath, colorscale=1, dim=(128, 128), flatten=True, return_succes=False, use_thumbnail_cache=False)

Read and pre-processing of images.

The pre-processing has 4 steps and are exectued in this order.

1. Import data.
1. Conversion to gray-scale (user defined)
1. Scaling color pixels between [0-255]
1. Resizing

Parameters:

filepath (str) – Full path to the image that needs to be imported.
colorscale (int, default: 1 (gray)) – colour-scaling from opencv. * 0: cv2.IMREAD_GRAYSCALE * 1: cv2.IMREAD_COLOR * 2: cv2.IMREAD_ANYDEPTH * 8: cv2.COLOR_GRAY2RGB * -1: cv2.IMREAD_UNCHANGED
dim (tuple, (default: (128,128))) – Rescale images. This is required because the feature-space need to be the same across samples.
flatten (Bool, (default: True)) – Flatten the processed NxMxC array to a 1D-vector
return_succes (Bool, (default: True)) – Also return the succes state
use_thumbnail_cache (bool (Default: True)) – True: To speed up the proces of image plotting and comparison, thumbnails are stored in the temp directory and used when available. False: Original images are used.

Returns:

img – Imported and processed image.

Return type:

array-like

Examples

>>> # Import libraries
>>> from clustimage import Clustimage
>>> import matplotlib.pyplot as plt
>>>
>>> # Init
>>> cl = Clustimage()
>>>
>>> # Load example dataset
>>> pathnames = cl.import_example(data='flowers')
>>> # Preprocessing of the first image
>>> img = cl.imread(pathnames[0], dim=(128,128), colorscale=1)
>>>
>>> # Plot
>>> fig, axs = plt.subplots(1,2, figsize=(15,10))
>>> axs[0].imshow(cv2.imread(pathnames[0])); plt.axis('off')
>>> axs[1].imshow(img.reshape(128,128,3)); plt.axis('off')
>>> fig
>>>

load(filepath='clustimage.pkl', verbose=3)

Restore previous results.

Parameters:

filepath (str) – Pathname to stored pickle files.
verbose (int, optional) – Show message. A higher number gives more information. The default is 3.

Return type:

Object.

move_to_dir(target_labels=None, savedir=None, action='move', overwrite=False, user_input=True)

Move or copy image files into directories based on cluster labels.

Parameters:

target_labels (dict, optional) – A dictionary where keys are cluster labels, and values are the target folder names. None: folders are automatically generated with names such as “group_<label>”.
savedir (str, optional) – The base directory where the images will be moved. If None, the images will be moved * ‘c:/temp/’ * None: to the parent directory of their current location.
action (str, 'copy' default) –
- ‘copy’: copy files
- ’move’: move files
overwrite (Bool, False default) –
- True: Overwrite files
- False: Do not overwrite files
user_input (bool, default: True) – True: The user should decide for each directory whether to proceed. False: All files are moved without questions.

Notes

If target_labels is not provided, the function will automatically generate folder names based on the cluster labels, e.g., group_0, group_1, etc.
The method will move image files associated with each cluster into the corresponding folder.
Before moving files, the user is prompted to confirm the action for each cluster.
The function assumes that self.results contains the necessary data, including ‘labels’ (cluster labels) and ‘pathnames’ (file paths of the images).

Examples

>>> # Assuming `self.results` contains a 'labels' and 'pathnames' column
>>> self.move_to_dir(target_labels={0: 'screenshots', 1: 'various'}, 2: 'holiday break'})
>>> # Images from cluster 0 will be moved to "group_0" and images from cluster 1 to "group_1"

plot(labels=None, show_hog=False, ncols=None, cmap=None, min_samples=2, figsize=(15, 10), blacklist=None, invert_colors=False)

Plot the results.

Parameters:

labels (list, (default: None)) – Cluster label to plot. In case of None, all cluster labels are plotted.
ncols (int, (default: None)) – Number of columns to use in the subplot. The number of rows are estimated based on the columns.
images. (Colorscheme for the) – ‘gray’, ‘binary’, None (uses rgb colorscheme)
show_hog (bool, (default: False)) – Plot the hog features next to the input image.
min_samples (int, (default: 1)) – Plots are created for clusters with >= min_samples
figsize (tuple, (default: (15, 10).) – Size of the figure (height,width).
blacklist (list) – None: Show all cluster labels [-2]: do not show the samples without lat/lon coordinates (when using exif method) [-1]: do not show the samples that fall outside the clusters (noise or rest-group in DBSCAN, when using exif method) [-2, -1]: do not show multiple clusters.
invert_colors (Invert colors for the plot.) – True: RGB-> BGR False: Keep as is

Return type:

None.

plot_faces(faces=True, eyes=True, cmap=None)

Plot detected faces.

Plot the detected faces in images after using the fit_transform() function. * For each input image, rectangles are drawn over the detected faces. * Each face is plotted seperately for which rectlangles are drawn over the detected eyes.

Parameters:

faces (Bool, (default: True)) – Plot the seperate faces.
eyes (Bool, (default: True)) – Plot rectangles over the detected eyes.
cmap (str, (default: None)) –
Colorscheme for the images.
- ’gray’
- ’binary’
- None : uses rgb colorscheme

plot_find(cmap=None, figsize=(15, 10), invert_colors=False)

Plot the input image together with the predicted images.

Parameters:

cmap (str, (default: None)) – Colorscheme for the images. ‘gray’, ‘binary’, None (uses rgb colorscheme)
figsize (tuple, (default: (15, 10).) – Size of the figure (height,width).
invert_colors (Invert colors for the plot.) – True: RGB-> BGR False: Keep as is

Return type:

None.

plot_map(cluster_icons=None, polygon=None, dim='default', blacklist_polygon=[-1], clutter_threshold=0.0001, save_path=None, open_in_browser=True, tempdir=None)

Plot a map with clustered images using their EXIF metadata.

This function generates an interactive map using folium, where images are plotted based on their geographic coordinates extracted from EXIF data. Images are rescaled to thumbnails for display. The function supports saving the map as an HTML file and optionally opening it in a web browser.

Parameters:

cluster_icons (bool, optional) – Cluster icons on the map. - None: automaticaly set the boolean based on metric - True: Cluster icons when zooming. Note that the location is not exact anymore. - False: Do not cluster icons and show the exact location on the map.
polygon (bool, optional) – Create a line through the list of geographic points defining a polygon to overlay on the map. - None: automaticaly set the boolean based on metric - True: Create polygon line - False: Do not create polygon line
dim ((int, int), optional) –
- ‘default’: The size of the thumbnails (in pixels) to display on the map.
- None: No thumbnails are created
- (200, 200): Thumbnail size
blacklist_polygon (list, optional) – Shows the polygon line for all clusters except the blacklisted ones. [-1]: Default as these are the rest or noise images from DBSCAN.
clutter_threshold (float: 1e-4) – The maximum distance below which points are considered overlapping. So this will prevent that icons are exactly on top of each other.
save_path (str, optional) – The file path (including filename) where the map will be saved as an HTML file. If None, the map is saved in a temporary directory. Default is None.
open_in_browser (bool, optional) – True: automatically opens the generated map in the default web browser. False: Do not open in browser automatically.
tempdir (str, optional) – The temp directory where thumbnails are stored. This will speed up loading times when multiple times the same image needs to be loaded. * None : Use the default temporary directory that is used during initialization. * r’c:/temp/clustimage/’

Returns:

A tuple containing: - m (folium.Map): The generated folium map object. - save_path (str): The file path where the map was saved.

Return type:

tuple

Notes

This function requires the exif method to be used in the params. If another method is used, the function will log an error and return None.
The save_path must include both the directory and filename. If only the directory is provided or the directory does not exist, an error is logged, and the function returns None.

Examples

>>> cl = Clustimage(method='exif',
                params_exif = {'timeframe': 5, 'radius_meters': 1000, 'min_samples': 2, 'exif_location': False},
                ext=["jpg", "jpeg", "png", "tiff", "bmp", "gif", "webp", "psd", "raw", "cr2", "nef", "heic", "sr2", "tif"],
                verbose='info')
>>> #
>>> # Fit and transform
>>> results = cl.fit_transform(r'c:/temp/', metric='datetime', recursive=True)
>>> #
>>> # Plot
>>> cl.plot_map(
...     cluster_icons=False,
...     polygon=True,
...     dim=(300, 300),
...     save_path="C:/temp/map.html",
...     open_in_browser=True
... )

plot_unique(cmap=None, img_mean=True, show_hog=False, figsize=(15, 10), invert_colors=False)

Plot unique images.

Parameters:

cmap (str, (default: None)) – Colorscheme for the images. ‘gray’, ‘binary’, None (uses rgb colorscheme)
img_mean (bool, (default: False)) – Plot the image mean.
show_hog (bool, (default: False)) – Plot the hog features.
figsize (tuple, (default: (15, 10).) – Size of the figure (height, width).
invert_colors (Invert colors for the plot.) – True: RGB-> BGR False: Keep as is

Return type:

None.

preprocessing(pathnames, grayscale, dim, flatten=True, use_thumbnail_cache=False)

Pre-processing the input images and returning consistent output.

Parameters:

pathnames (list of str.) – Full path to images that are used in the model.
grayscale (Bool, (default: False)) – Colorscaling the image to gray. This can be usefull when clustering e.g., faces.
dim (tuple, (default: (128,128))) – Rescale images. This is required because the feature-space need to be the same across samples.
flatten (Bool, (default: True)) – Flatten the processed NxMxC array to a 1D-vector
use_thumbnail_cache (bool (Default: True)) – True: To speed up the proces of image plotting and comparison, thumbnails are stored in the temp directory and used when available. False: Original images are used.

Returns:

Xraw – img : array-like. pathnames : list of str. filenames : list of str.

Return type:

dict containing keys:

save(filepath='clustimage.pkl', overwrite=False)

Save model in pickle file.

Parameters:

filepath (str, (default: 'clustimage.pkl')) – Pathname to store pickle files.
overwrite (bool, (default=False)) – Overwite file if exists.
verbose (int, optional) – Show message. A higher number gives more informatie. The default is 3.

Returns:

bool – Status whether the file is saved.

Return type:

[True, False]

scatter(dotsize=15, legend=False, zoom=0.3, img_mean=True, text=True, plt_all=False, density=False, figsize=(15, 10), ax=None, args_scatter={})

Plot the samples using a scatterplot.

Parameters:

plt_all (bool, (default: False)) – False: Only plot the controid images. True: Plot all images on top of the scatter.
dotsize (int, (default: 15)) – Dot size of the scatterpoints.
legend (bool, (default: False)) – Plot the legend.
zoom (bool, (default: 0.3)) – Plot the image in the scatterplot. None : Do not plot the image.
text (bool, (default: True)) – Plot the cluster labels.
density (bool, (default: Fale)) – Plot the density over the clusters.
figsize (tuple, (default: (15, 10).) – Size of the figure (height,width).
args_scatter (dict, default: {}.) – Arguments for the scatter plot. The following are default: {‘title’: ‘’, ‘fontsize’: 18, ‘fontcolor’: [0, 0, 0], ‘xlabel’: ‘x-axis’, ‘ylabel’: ‘y-axis’, ‘cmap’: ‘Set2’, ‘density’: False, ‘gradient’: None, }

Return type:

tuple (fig, ax)

Examples

>>> # Import library
>>> from clustimage import Clustimage
>>>
>>> # Initialize with default settings.
>>> cl = Clustimage()
>>>
>>> # Import example dataset
>>> X, y = cl.import_example(data='mnist')
>>>
>>> # Run the model to find the optimal clusters.
>>> results = cl.fit_transform(X)
>>>
>>> # Make scatter plots
>>> cl.scatter()
>>>
>>> # More input arguments for the scatterplot
>>> cl.scatter(dotsize=35, args_scatter={'fontsize':24, 'density':'#FFFFFF', 'cmap':'Set2'})

unique(metric=None)

Compute the unique images.

The unique images are detected by first computing the center of the cluster, and then taking the image closest to the center.

Parameters:

metric (str, (default: 'euclidean').) –

Distance measures. All metrics from sklearn can be used such as:

’euclidean’
’hamming’
’cityblock’
’correlation’
’cosine’
’jaccard’
’mahalanobis’
’seuclidean’
’sqeuclidean’
etc

Returns:

labelslist.: Cluster label of the detected image.
idxlist.: Index of the original image.
xycoord_centerarray-like: Coordinates of the sample that is most centered.
pathnameslist.: Path location to the file.
img_meanarray-like.: Averaged image in the cluster.

Return type:

dict containing keys with results.

Example

>>> from clustimage import Clustimage
>>>
>>> # Init with default settings
>>> cl = Clustimage()
>>>
>>> # load example with faces
>>> X, y = cl.import_example(data='mnist')
>>>
>>> # Cluster digits
>>> _ = cl.fit_transform(X)
>>>
>>> # Unique
>>> cl.plot_unique(img_mean=False)
>>> cl.results_unique.keys()
>>>

clustimage.clustimage.basename(label): Extract basename from path.

clustimage.clustimage.cluster_datetimes(datetimes, eps_hours=1, min_samples=2, metric='euclidean', dt_format='%Y:%m:%d %H:%M:%S')

Clusters datetime values in a DataFrame using a time-based window with DBSCAN. The time window is given in hours.

Parameters:: df (pd.DataFrame): Input DataFrame containing the datetime column. datetime_column (str): The name of the column containing datetime values. eps_hours (float): The maximum time gap (in hours) to consider points in the same cluster. min_samples (int): The minimum number of samples in a neighborhood to form a cluster. dt_format: ‘%Y:%m:%d %H:%M:%S’
Returns:: pd.DataFrame: DataFrame with an added column for cluster labels.

Examples

# Example usage: data = {

“datetime”: [
“2024:02:16 19:35:38”, “2023:12:17 13:54:10”, “2023:12:17 11:27:52”, “2023:12:17 11:40:22”, “2023:12:16 20:11:36”, “2024:02:16 19:37:00”, “2024:02:16 19:37:34”, “2024:02:16 19:36:52”, “2024:02:16 19:37:34”, “2024:02:16 19:37:16”

]

} df = pd.DataFrame(data)

# Cluster with a 1-hour window and minimum of 2 samples per cluster clustered_df = cluster_datetimes(df, “datetime”, eps_hours=1, min_samples=2) print(clustered_df)

clustimage.clustimage.cluster_latlon(latlon, radius_meters=1000, min_samples=2)

Cluster geolocation data points based on proximity using Haversine distance.

Parameters:

latlon (pandas.DataFrame) – A DataFrame containing ‘lat’ (latitude) and ‘lon’ (longitude) columns. Rows with missing values in either ‘lat’ or ‘lon’ are ignored.
radius_meters (float, optional) – The radius (in meters) within which points are grouped into a single cluster. Default is 1000 meters.

Returns:

An array of cluster labels for each row in the input latlon DataFrame. Rows without valid latitude or longitude will have a label of 0.

Return type:

numpy.ndarray

Notes

The function uses the DBSCAN algorithm with the Haversine metric for clustering.
Input coordinates are converted to radians as required by the Haversine distance computation.
The radius is converted from meters to kilometers, as the Haversine metric operates in kilometers.
DBSCAN assigns cluster labels starting from -1 for noise points. This implementation labels rows without valid coordinates as -2.

Examples

>>> import pandas as pd
>>> latlon = pd.DataFrame({
...     'lat': [52.5200, 52.5201, 52.5300, 48.8566, 48.8567],
...     'lon': [13.4050, 13.4051, 13.4060, 2.3522, 2.3523]
... })
>>> cluster_labels = cluster_latlon(latlon, radius_meters=500)
>>> cluster_labels
array([1, 1, 2, 3, 3])

clustimage.clustimage.create_dir(pathname, savedir=None)

Create directory.

Parameters:

pathname (str) – Absolute path location of the image of interest.
savedir (str) – Target directory.

Returns:

movedir (str) – Absolute path to directory.
dirname (str) – Absolute path to directory.
filename (str) – Name of the file.
ext (str) – Extension.

clustimage.clustimage.disable_tqdm(): Set the logger for verbosity messages.

clustimage.clustimage.get_logger(): Return logger status.

clustimage.clustimage.get_params_hash(hashmethod, params_hash={})

Get image hash function.

Parameters:: hashmethod (str (default: 'ahash')) – ‘ahash’: Average hash ‘phash’: Perceptual hash ‘dhash’: Difference hash ‘whash-haar’: Haar wavelet hash ‘whash-db4’: Daubechies wavelet hash ‘colorhash’: HSV color hash ‘crop-resistant-hash’: Crop-resistant hash
Returns:: hashfunc
Return type:: Object

clustimage.clustimage.img_flatten(img): Flatten image.

clustimage.clustimage.import_example(data='flowers', url=None, sep=',', verbose='info')

Import example dataset from github source.

Import the few datasets from github source or specify your own download url link.

Parameters:

data (str) –
Images:
- ’faces’
- ’mnist’
Files with images:
- ’southern_nebula’
- ’flowers’
- ’scenes’
- ’cat_and_dog’
url (str) – url link to to dataset.

Returns:

list of str containing filepath to images.

Return type:

list of str

Return type:

list or numpy array

clustimage.clustimage.imresize(img, dim=(128, 128)): Resize image.

clustimage.clustimage.imscale(img)

Normalize image by scaling.

Scaling in range [0-255] by img*(255/max(img))

Parameters:: img (array-like) – Input image data.
Returns:: img – Scaled image.
Return type:: array-like

clustimage.clustimage.listdir(dirpath, ext=['png', 'tiff', 'tif', 'jpg', 'jpeg', 'heic'], black_list=None, recursive=True)

Collect recursive images from path.

Parameters:

dirpath (str) – Path to directory; “/tmp” or “c://temp/”
ext (list, default: ['png', 'tiff', 'tif', 'jpg', 'jpeg', 'heic']) – extentions to collect form directories.
black_list (list, (default: None)) – Exclude directory with all subdirectories from processing. * [‘undouble’]
recursive (bool, (default: True)) – Walk recursively trhough all subdirectories

Returns:

getfiles – Full pathnames to images.

Return type:

list of str.

Example

>>> import clustimage as cl
>>> pathnames = cl.listdir('c://temp//flower_images')

clustimage.clustimage.move_files(pathnames, savedir, action='move', overwrite=False)

Move or copy image files into directories based on cluster labels.

Parameters:

pathnames (list or numpy array) – A list or numpy array with files that needs to be moved. [‘c:/temp/file.jpg’, ‘c:/file2.jpg’]
savedir (str, optional) – The base directory where the images will be moved. If None, the images will be moved * ‘c:/my_new_dir/’ * None: to the parent directory of their current location.
action (str, 'copy' default) –
- ‘copy’: copy files
- ’move’: move files
overwrite (Bool, False default) –
- True: Overwrite files
- False: Do not overwrite files

clustimage.clustimage.seperate_path(pathname)

Seperate path.

Parameters:

pathnames (list of str) – pathnames to the images.

Returns:

dirname (str) – directory path.
filename (str) – filename.
ext – Extension.

clustimage.clustimage.set_logger(verbose: [<class 'str'>, <class 'int'>] = 'info')

Set the logger for verbosity messages.

Parameters:

verbose ([str, int], default is 'info' or 20) – Set the verbose messages using string or integer values. * [0, 60, None, ‘silent’, ‘off’, ‘no’]: No message. * [10, ‘debug’]: Messages from debug level and higher. * [20, ‘info’]: Messages from info level and higher. * [30, ‘warning’]: Messages from warning level and higher. * [50, ‘critical’]: Messages from critical level and higher.

Returns:

None.
> # Set the logger to warning
> set_logger(verbose=’warning’)
> # Test with different messages
> logger.debug(“Hello debug”)
> logger.info(“Hello info”)
> logger.warning(“Hello warning”)
> logger.critical(“Hello critical”)

clustimage.clustimage.store_to_disk(Xraw, dim, tempdir, files=None): Store to disk.

clustimage.clustimage.unique_no_sort(x): Uniques without sort.

clustimage.clustimage.url2disk(urls, save_dir)

Write url locations to disk.

Images can also be imported from url locations. Each image is first downloaded and stored on a (specified) temp directory. In this example we will download 5 images from url locations. Note that url images and path locations can be combined.

Parameters:

urls (list) – list of url locations with image path.
save_dir (str) – location to disk.

Returns:

urls – list to url locations that are now stored on disk.

Return type:

list of str.

Examples

>>> # Init with default settings
>>> import clustimage as cl
>>>
>>> # Importing the files files from disk, cleaning and pre-processing
>>> url_to_images = ['https://erdogant.github.io/datasets/images/flower_images/flower_orange.png',
>>>                  'https://erdogant.github.io/datasets/images/flower_images/flower_white_1.png',
>>>                  'https://erdogant.github.io/datasets/images/flower_images/flower_white_2.png',
>>>                  'https://erdogant.github.io/datasets/images/flower_images/flower_yellow_1.png',
>>>                  'https://erdogant.github.io/datasets/images/flower_images/flower_yellow_2.png']
>>>
>>> # Import into model
>>> results = cl.url2disk(url_to_images, r'c:/temp/out/')
>>>