API References

Python package undouble is to detect (near-)identical images.

class undouble.undouble.Undouble(method='phash', targetdir='', grayscale=False, dim=(128, 128), hash_size=8, ext=['png', 'tiff', 'jpg', 'jfif', 'jpeg'], verbose=20)

Detect duplicate images.

Python package undouble is to detect (near-)identical images based on image hashes.

The following steps are taken:
  1. Read recursively all images from directory with the specified extensions.

  2. Compute image hash.

  3. Group similar images.

  4. Move if desired.

Parameters:
  • method (str, (default: 'phash')) – Image hash method. * ‘ahash’: Average hash * ‘phash’: Perceptual hash * ‘dhash’: Difference hash * ‘whash-haar’: Haar wavelet hash * ‘crop-resistant-hash’: Crop resistant hash

  • targetdir (str, (default: None)) – Directory to read the images.

  • hash_size (integer (default: 8)) – The hash_size will be used to scale down the image and create a hash-image of length: hash_size*hash_size.

  • ext (list, (default: ['png','tiff','jpg'])) – Images with the file extentions are used.

  • grayscale (Bool, (default: True)) – Colorscaling the image to gray.

  • dim (tuple, (default: (128,128))) – Rescale images. This is required because the feature-space need to be the same across samples.

  • verbose (int, (default: 20)) – Print progress to screen. The default is 20. 10:Debug, 20:Info, 30:Warn 40:Error, 60:None

Returns:

  • Object.

  • dict containing keys

    pathnameslist of str.

    Full path to images that are used in the model.

    filenameslist of str.

    Filename of the input images.

Example

>>> # Import library
>>> from undouble import Undouble
>>>
>>> # Init with default settings
>>> model = Undouble(method='phash', hash_size=8)
>>>
>>> # Import example data
>>> targetdir = model.import_example(data='flowers')
>>>
>>> # Importing the files files from disk, cleaning and pre-processing
>>> model.import_data(targetdir)
>>>
>>> # Compute image-hash
>>> model.compute_hash()
>>>
>>> # Find images with image-hash <= threshold
>>> model.group(threshold=0)
>>>
>>> # Plot the images
>>> model.plot()
>>>
>>> # Move the images
>>> model.move()

References

bin2hex()

Binary to hex.

Returns:

Hex of image hash.

Return type:

str

clean_files(clean_tempdir=False)

Remove the entire temp directory with all its contents.

clean_init(params=True, results=True)

Clean or removing previous results and models to ensure correct working.

compute_hash(method=None, hash_size=8, return_dict=False)

Compute the hash for each image.

Parameters:
  • method (str, (default: 'phash')) – Image hash method. * ‘ahash’: Average hash * ‘phash’: Perceptual hash * ‘dhash’: Difference hash * ‘whash-haar’: Haar wavelet hash * ‘crop-resistant-hash’ : Crop resistance hash

  • hash_size (integer (default: 8)) – The hash_size will be used to scale down the image and create a hash-image of length: hash_size*hash_size.

Return type:

None.

compute_imghash(img, hash_size=None, to_array=False)

Compute image hash per image.

Parameters:
  • img (Object or RGB-image.) – Image.

  • hash_size (integer (default: None)) – The hash_size will be used to scale down the image and create a hash-image of length: hash_size*hash_size.

  • to_array (Bool (default: False)) – True: Return the hash-array in the same size as the scaled image. False: Return the hash-image vector.

Examples

>>> from undouble import Undouble
>>> import matplotlib.pyplot as plt
>>>
>>> # Initialize with method
>>> model = Undouble(method='ahash')
>>>
>>> # Import flowers example
>>> X = model.import_example(data='flowers')
>>> imgs = model.import_data(X, return_results=True)
>>>
>>> # Compute hash for a single image
>>> hashs = model.compute_imghash(imgs['img'][0], to_array=False, hash_size=8)
>>>
>>> # The hash is a binairy array or vector.
>>> print(hashs)
>>>
>>> # Plot the image using the undouble plot_hash functionality
>>> model.results['img_hash_bin']
>>> model.plot_hash(idx=0)
>>>
>>> # Plot the image
>>> fig, ax = plt.subplots(1, 2, figsize=(8,8))
>>> ax[0].imshow(imgs['img'][0])
>>> ax[1].imshow(hashs[0])
>>>
>>> # Compute hash for multiple images
>>> hashs = model.compute_imghash(imgs['img'][0:10], to_array=False, hash_size=8)
Returns:

imghash – Hash.

Return type:

numpy-array

group(threshold=0, return_dict=False)

Find similar images using the hash signatures.

Parameters:

threshold (float, (default: 0)) – Threshold on the hash value to determine similarity.

Return type:

None.

import_data(targetdir, black_list=['undouble'], return_results=False)

Preprocessing.

Parameters:

black_list (list, (default: ['undouble'])) – Exclude directory with all subdirectories from processing.

Example

>>> # Import library
>>> from undouble import Undouble
>>> #
>>> # Init with default settings
>>> model = Undouble()
>>> #
>>> #
>>> # Import example flower data set
>>> list_of_filepaths = model.import_example(data='flowers')
>>> #
>>> # Read from file names
>>> model.import_data(input_list_of_files)
>>> #
>>> #
>>> # Read from directory
>>> input_directory, _ = os.path.split(input_list_of_files[0])
>>> model.import_data(input_directory)
>>> #
>>> #
>>> # Import from numpy array
>>> IMG = model.import_example(data='mnist')
>>> # Compute hash
>>> model.compute_hash()
>>> #
>>> #
>>> # Find images with image-hash <= threshold
>>> model.group(threshold=0)
>>> #
>>> # Plot the images
>>> model.plot()
Return type:

model.results

import_example(data='flowers', url=None, sep=',')

Import example dataset from github source.

Import one of the few datasets from github source or specify your own download url link.

Parameters:
  • data (str) –

    Images:
    • ’faces’

    • ’mnist’

    Files:
    • ’southern_nebula’

    • ’flowers’

    • ’scenes’

    • ’cat_and_dog’

  • url (str) – url link to to dataset.

Returns:

images

Return type:

pd.DataFrame()

References

move(filters=None, targetdir=None)

Move images.

Files are moved that are listed by the group() functionality.

Parameters:
  • filters (list, (Default: ['location'])) – ‘location’ : Only move images that are seen in the same directory.

  • targetdir (str (default: None)) – Moving similar files to this directory. None: A subdir, named “undouble” is created within each directory.

Return type:

None.

plot(cmap=None, figsize=(15, 10))

Plot the results.

Parameters:
  • labels (list, (default: None)) – Cluster label to plot. In case of None, all cluster labels are plotted.

  • ncols (int, (default: None)) – Number of columns to use in the subplot. The number of rows are estimated based on the columns.

  • images. (Colorscheme for the) – ‘gray’, ‘binary’, None (uses rgb colorscheme)

  • show_hog (bool, (default: False)) – Plot the hog features next to the input image.

  • min_clust (int, (default: 1)) – Plots are created for clusters with > min_clust samples

  • figsize (tuple, (default: (15, 10).) – Size of the figure (height,width).

Return type:

None.

plot_hash(idx=None, filenames=None)

Plot the image-hash.

Parameters:
  • idx (list of int, optional) – The index of the images to plot.

  • filenames (list of str, optional) – The (list of) filenames to plot.

Returns:

  • fig (Figure)

  • ax (Axis)

Examples

>>> # Import library
>>> from undouble import Undouble
>>>
>>> # Init with default settings
>>> model = Undouble()
>>>
>>> # Import example data
>>> targetdir = model.import_example(data='flowers')
>>>
>>> # Importing the files files from disk, cleaning and pre-processing
>>> model.import_data(r'./undouble/data/flower_images/')
>>>
>>> # Compute image-hash
>>> model.compute_hash(method='phash', hash_size=6)
>>>
>>> # Hashes are stored in the result dict.
>>> model.results['img_hash_bin']
>>>
>>> Plot the image-hash for a set of indexes
>>> model.plot_hash(idx=[0, 1])
>>>
>>> Plot the image-hash for a set of filenames
>>> filenames = model.results['filenames'][0:2]
>>> filenames = ['0001.png', '0002.png']
>>> model.plot_hash(filenames=filenames)
>>>
undouble.undouble.compute_blur(pathname)

Compute amount of blur in image.

load the image, convert it to grayscale, and compute the focus measure of the image using the Variance of Laplacian method. The returned scores <100 are generally more blurry.

Parameters:

pathname (str) – Absolute path location to image.

Returns:

fm_score – Score the depicts the amount of blur. Scores <100 are generally more blurry.

Return type:

float

undouble.undouble.create_targetdir(pathname, targetdir)

Create directory.

Parameters:
  • pathname (str) – Absolute path location of the image of interest.

  • targetdir (str) – Target directory.

Returns:

  • movedir (str) – Absolute path to directory.

  • dirname (str) – Absolute path to directory.

  • filename (str) – Name of the file.

  • ext (str) – Extension.

undouble.undouble.disable_tqdm()

Set the logger for verbosity messages.

undouble.undouble.filter_checks(pathnames, filters)

Filter checks.

Parameters:
  • pathnames (list of str) – pathnames to the images.

  • filters (list, (Default: ['location'])) – ‘location’ : Only move images that are seen in the same directory.

Returns:

When all filters are true.

Return type:

bool

undouble.undouble.get_existing_pathnames(pathnames)

Get existing pathnames.

Parameters:

pathnames (list of str) – pathnames to the images.

undouble.undouble.seperate_path(pathname)

Seperate path.

Parameters:

pathnames (list of str) – pathnames to the images.

Returns:

  • dirname (str) – directory path.

  • filename (str) – filename.

  • ext – Extension.

undouble.undouble.set_logger(verbose=20)

Set the logger for verbosity messages.

undouble.undouble.sort_images(pathnames, hash_scores=None, sort_first_img=False)

Sort images.

Sort images on the following conditions:
  1. Resolution

  2. Amount of blur

Parameters:

pathnames (list of str.) – Absolute locations to image path.

Returns:

images sorted on conditions.

Return type:

list of str