API References
Python package undouble is to detect (near-)identical images.
- class undouble.undouble.Undouble(method='phash', targetdir='', grayscale=False, dim=(128, 128), hash_size=8, ext=['png', 'tiff', 'jpg', 'jfif', 'jpeg'], verbose=20)
Detect duplicate images.
Python package undouble is to detect (near-)identical images based on image hashes.
- The following steps are taken:
Read recursively all images from directory with the specified extensions.
Compute image hash.
Group similar images.
Move if desired.
- Parameters:
method (str, (default: 'phash')) – Image hash method. * ‘ahash’: Average hash * ‘phash’: Perceptual hash * ‘dhash’: Difference hash * ‘whash-haar’: Haar wavelet hash * ‘crop-resistant-hash’: Crop resistant hash
targetdir (str, (default: None)) – Directory to read the images.
hash_size (integer (default: 8)) – The hash_size will be used to scale down the image and create a hash-image of length: hash_size*hash_size.
ext (list, (default: ['png','tiff','jpg'])) – Images with the file extentions are used.
grayscale (Bool, (default: True)) – Colorscaling the image to gray.
dim (tuple, (default: (128,128))) – Rescale images. This is required because the feature-space need to be the same across samples.
verbose (int, (default: 20)) – Print progress to screen. The default is 20. 10:Debug, 20:Info, 30:Warn 40:Error, 60:None
- Returns:
Object.
dict containing keys –
- pathnameslist of str.
Full path to images that are used in the model.
- filenameslist of str.
Filename of the input images.
Example
>>> # Import library >>> from undouble import Undouble >>> >>> # Init with default settings >>> model = Undouble(method='phash', hash_size=8) >>> >>> # Import example data >>> targetdir = model.import_example(data='flowers') >>> >>> # Importing the files files from disk, cleaning and pre-processing >>> model.import_data(targetdir) >>> >>> # Compute image-hash >>> model.compute_hash() >>> >>> # Find images with image-hash <= threshold >>> model.group(threshold=0) >>> >>> # Plot the images >>> model.plot() >>> >>> # Move the images >>> model.move()
References
Blog: https://towardsdatascience.com/detection-of-duplicate-images-using-image-hash-functions-4d9c53f04a75
Documentation: https://erdogant.github.io/undouble/
https://content-blockchain.org/research/testing-different-image-hash-functions/
- bin2hex()
Binary to hex.
- Returns:
Hex of image hash.
- Return type:
str
- clean_files(clean_tempdir=False)
Remove the entire temp directory with all its contents.
- clean_init(params=True, results=True)
Clean or removing previous results and models to ensure correct working.
- compute_hash(method=None, hash_size=8, return_dict=False)
Compute the hash for each image.
- Parameters:
method (str, (default: 'phash')) – Image hash method. * ‘ahash’: Average hash * ‘phash’: Perceptual hash * ‘dhash’: Difference hash * ‘whash-haar’: Haar wavelet hash * ‘crop-resistant-hash’ : Crop resistance hash
hash_size (integer (default: 8)) – The hash_size will be used to scale down the image and create a hash-image of length: hash_size*hash_size.
- Return type:
None.
- compute_imghash(img, hash_size=None, to_array=False)
Compute image hash per image.
- Parameters:
img (Object or RGB-image.) – Image.
hash_size (integer (default: None)) – The hash_size will be used to scale down the image and create a hash-image of length: hash_size*hash_size.
to_array (Bool (default: False)) – True: Return the hash-array in the same size as the scaled image. False: Return the hash-image vector.
Examples
>>> from undouble import Undouble >>> import matplotlib.pyplot as plt >>> >>> # Initialize with method >>> model = Undouble(method='ahash') >>> >>> # Import flowers example >>> X = model.import_example(data='flowers') >>> imgs = model.import_data(X, return_results=True) >>> >>> # Compute hash for a single image >>> hashs = model.compute_imghash(imgs['img'][0], to_array=False, hash_size=8) >>> >>> # The hash is a binairy array or vector. >>> print(hashs) >>> >>> # Plot the image using the undouble plot_hash functionality >>> model.results['img_hash_bin'] >>> model.plot_hash(idx=0) >>> >>> # Plot the image >>> fig, ax = plt.subplots(1, 2, figsize=(8,8)) >>> ax[0].imshow(imgs['img'][0]) >>> ax[1].imshow(hashs[0]) >>> >>> # Compute hash for multiple images >>> hashs = model.compute_imghash(imgs['img'][0:10], to_array=False, hash_size=8)
- Returns:
imghash – Hash.
- Return type:
numpy-array
- group(threshold=0, return_dict=False)
Find similar images using the hash signatures.
- Parameters:
threshold (float, (default: 0)) – Threshold on the hash value to determine similarity.
- Return type:
None.
- import_data(targetdir, black_list=['undouble'], return_results=False)
Preprocessing.
- Parameters:
black_list (list, (default: ['undouble'])) – Exclude directory with all subdirectories from processing.
Example
>>> # Import library >>> from undouble import Undouble >>> # >>> # Init with default settings >>> model = Undouble() >>> # >>> # >>> # Import example flower data set >>> list_of_filepaths = model.import_example(data='flowers') >>> # >>> # Read from file names >>> model.import_data(input_list_of_files) >>> # >>> # >>> # Read from directory >>> input_directory, _ = os.path.split(input_list_of_files[0]) >>> model.import_data(input_directory) >>> # >>> # >>> # Import from numpy array >>> IMG = model.import_example(data='mnist') >>> # Compute hash >>> model.compute_hash() >>> # >>> # >>> # Find images with image-hash <= threshold >>> model.group(threshold=0) >>> # >>> # Plot the images >>> model.plot()
- Return type:
model.results
- import_example(data='flowers', url=None, sep=',')
Import example dataset from github source.
Import one of the few datasets from github source or specify your own download url link.
- Parameters:
data (str) –
- Images:
’faces’
’mnist’
- Files:
’southern_nebula’
’flowers’
’scenes’
’cat_and_dog’
url (str) – url link to to dataset.
- Returns:
images
- Return type:
pd.DataFrame()
References
- move(filters=None, targetdir=None)
Move images.
Files are moved that are listed by the group() functionality.
- Parameters:
filters (list, (Default: ['location'])) – ‘location’ : Only move images that are seen in the same directory.
targetdir (str (default: None)) – Moving similar files to this directory. None: A subdir, named “undouble” is created within each directory.
- Return type:
None.
- plot(cmap=None, figsize=(15, 10))
Plot the results.
- Parameters:
labels (list, (default: None)) – Cluster label to plot. In case of None, all cluster labels are plotted.
ncols (int, (default: None)) – Number of columns to use in the subplot. The number of rows are estimated based on the columns.
images. (Colorscheme for the) – ‘gray’, ‘binary’, None (uses rgb colorscheme)
show_hog (bool, (default: False)) – Plot the hog features next to the input image.
min_clust (int, (default: 1)) – Plots are created for clusters with > min_clust samples
figsize (tuple, (default: (15, 10).) – Size of the figure (height,width).
- Return type:
None.
- plot_hash(idx=None, filenames=None)
Plot the image-hash.
- Parameters:
idx (list of int, optional) – The index of the images to plot.
filenames (list of str, optional) – The (list of) filenames to plot.
- Returns:
fig (Figure)
ax (Axis)
Examples
>>> # Import library >>> from undouble import Undouble >>> >>> # Init with default settings >>> model = Undouble() >>> >>> # Import example data >>> targetdir = model.import_example(data='flowers') >>> >>> # Importing the files files from disk, cleaning and pre-processing >>> model.import_data(r'./undouble/data/flower_images/') >>> >>> # Compute image-hash >>> model.compute_hash(method='phash', hash_size=6) >>> >>> # Hashes are stored in the result dict. >>> model.results['img_hash_bin'] >>> >>> Plot the image-hash for a set of indexes >>> model.plot_hash(idx=[0, 1]) >>> >>> Plot the image-hash for a set of filenames >>> filenames = model.results['filenames'][0:2] >>> filenames = ['0001.png', '0002.png'] >>> model.plot_hash(filenames=filenames) >>>
- undouble.undouble.compute_blur(pathname)
Compute amount of blur in image.
load the image, convert it to grayscale, and compute the focus measure of the image using the Variance of Laplacian method. The returned scores <100 are generally more blurry.
- Parameters:
pathname (str) – Absolute path location to image.
- Returns:
fm_score – Score the depicts the amount of blur. Scores <100 are generally more blurry.
- Return type:
float
- undouble.undouble.create_targetdir(pathname, targetdir)
Create directory.
- Parameters:
pathname (str) – Absolute path location of the image of interest.
targetdir (str) – Target directory.
- Returns:
movedir (str) – Absolute path to directory.
dirname (str) – Absolute path to directory.
filename (str) – Name of the file.
ext (str) – Extension.
- undouble.undouble.disable_tqdm()
Set the logger for verbosity messages.
- undouble.undouble.filter_checks(pathnames, filters)
Filter checks.
- Parameters:
pathnames (list of str) – pathnames to the images.
filters (list, (Default: ['location'])) – ‘location’ : Only move images that are seen in the same directory.
- Returns:
When all filters are true.
- Return type:
bool
- undouble.undouble.get_existing_pathnames(pathnames)
Get existing pathnames.
- Parameters:
pathnames (list of str) – pathnames to the images.
- undouble.undouble.seperate_path(pathname)
Seperate path.
- Parameters:
pathnames (list of str) – pathnames to the images.
- Returns:
dirname (str) – directory path.
filename (str) – filename.
ext – Extension.
- undouble.undouble.set_logger(verbose=20)
Set the logger for verbosity messages.
- undouble.undouble.sort_images(pathnames, hash_scores=None, sort_first_img=False)
Sort images.
- Sort images on the following conditions:
Resolution
Amount of blur
- Parameters:
pathnames (list of str.) – Absolute locations to image path.
- Returns:
images sorted on conditions.
- Return type:
list of str