This section describes how to predict new unseen data points with a readily fitted model.

The ``find`` function :func:`clustimage.clustimage.Clustimage.find` allows to find images that are similar for what is readily being seen by the model. Finding images can be performed in two manners as described below. In both cases, the adjacency matrix is first computed using the distance metric (default Euclidean).

k-nearest neighbour
'''''''''''''''''''
The k-nearest neighbour approach searches the k nearest neighbours to that of the input image using the (default) Euclidean distance metric. This approach dus not return a P-value the distances to the closest neighbors. If case both *k* and *alpha* are specified, the union of detected samples is taken.

Example to find similar samples for an unseen dataset using k-nearest neighbour approach.

.. code:: python

	from clustimage import Clustimage
	import numpy as np

	# Init with default settings
	cl = Clustimage(method='pca')
	# load example with digits
	X, y = cl.import_example(data='mnist')

	# Make 1st subset
	idx = np.unique(np.random.randint(0,X.shape[0], 25))
	X1 = X[idx, :]
	X = X[np.setdiff1d(range(0, X.shape[0]), idx), :]

	# Cluster dataset X
	results = cl.fit_transform(X)
	# Results are also stored in object results
	cl.results.keys()
	# Scatter results
	cl.scatter(zoom=3, dotsize=50, figsize=(25, 15), legend=False, text=False)

	# Find images for 1st subset of images
	X1_results = cl.find(X1, k=5, alpha=None)
	# Make scatter
	cl.scatter(zoom=5, dotsize=100, text=False, figsize=(35, 20))

	# Print first key
	keys = list(X1_results.keys())[1:]
	print(X1_results.get(keys[0]).columns)
	# ['y_idx', 'distance', 'y_proba', 'labels', 'y_filenames', 'y_pathnames', 'x_pathnames']

	print(X1_results.get(keys[0])[['labels', 'distance','y_proba']])
	#    labels    distance  y_proba
	# 0       9  189.436546      NaN
	# 1       9  305.387050      NaN
	# 2       9  338.403554      NaN
	# 3       9  342.050496      NaN
	# 4       9  351.139465      NaN

	# Get most often seen class label for key
	for key in keys:
	    uiy, ycounts = np.unique(X1_results.get(key)['labels'], return_counts=True)
	    y_predict = uiy[np.argmax(ycounts)]
	    print('class:[%s] - %s' %(y_predict, key))

	# class:[9] - b2ea44d9-de55-421b-8bd1-6d13509533f5.png
	# class:[4] - 60dabaf0-7e1a-4c57-bb57-5d464fd2d8fb.png
	# class:[6] - dbc30522-5c83-4563-9c9f-40a86f24a091.png
	# class:[1] - deb11282-a992-4282-8212-ba62ef1e26fc.png
	# class:[3] - 29104a7c-775b-462f-84ba-e824c7c4b9c7.png
	# class:[9] - 45e6f5fd-4423-4743-850e-921914bfb9c9.png
	# class:[9] - 300a866e-5440-444a-b284-d2bfb1b1178b.png
	# class:[8] - 2dd0defc-d72a-4189-abae-bd33a94ee044.png
	# class:[7] - a308d13e-fa3e-4428-8b5e-edb26554d723.png
	# class:[2] - 6dc9c2b5-e1eb-4b6e-891a-db657c663013.png
	# class:[9] - 30477a7a-e7a0-44f2-8bb2-222719ebe12b.png
	# class:[5] - 50261737-f812-4665-b46f-cf8afb3cc88c.png
	# class:[0] - c83ab55a-2983-44a6-9e2e-4964dc13c1b0.png
	# class:[9] - 3eafd007-b6ed-4d17-a9db-3477854525df.png
	# class:[8] - 411c207e-4804-4e30-a1c5-546876e36c51.png
	# class:[4] - 661ac0f2-6f41-493a-bb3a-2c068c632d86.png
	# class:[9] - 9fc65ef1-ff93-4ac6-9cdb-f4605ee9661a.png
	# class:[7] - 6d9e017e-fc6a-424f-8f40-52995db771dc.png
	# class:[4] - dfc5cf51-2157-437f-b1ce-920100b74119.png
	# class:[4] - 507c365d-d35d-4623-985a-e9512c511d11.png
	# class:[0] - 058f3759-2608-490e-ac79-e5fde8d10f7e.png
	# class:[2] - a10dbae8-81f1-4613-8fd9-0bfa4815f1ad.png
	# class:[2] - a889af11-961c-42f6-8b34-3d0f7050c34c.png
	# class:[0] - 2e92d26c-3bf4-4c25-9d7e-4ceaa2e06d69.png
	# class:[4] - b7706e78-c653-4e26-9d7a-bcb512751526.png


Probability density fitting
'''''''''''''''''''''''''''
The probability density fitting method fits a model on the input features to determine the loc/scale/arg parameters across various theoretical distribution. In case of PCA, these are the principal components. The tested disributions are *['norm', 'expon', 'uniform', 'gamma', 't']*. The fitted distribution is the similarity-distribution of samples.

For each new (unseen) input image, the probability of similarity is computed across the images, and images with P <= *alpha*(lower bound) are returned. Note that the metric can be changed in this function but this may lead to confusions as the results will not intuitively match with the scatter plots as these are determined using metric in the fit_transform() function.

Example to find similar samples for an unseen dataset using probability density fitting.

.. code:: python

	from clustimage import Clustimage
	import numpy as np

	# Init with default settings
	cl = Clustimage(method='pca')
	# load example with digits
	X, y = cl.import_example(data='mnist')

	# Make 1st subset
	idx = np.unique(np.random.randint(0,X.shape[0], 25))
	X1 = X[idx, :]
	X = X[np.setdiff1d(range(0, X.shape[0]), idx), :]

	# Cluster dataset X
	results = cl.fit_transform(X)
	# Results are also stored in object results
	cl.results.keys()
	# Scatter results
	cl.scatter(zoom=3, dotsize=50, figsize=(25, 15), legend=False, text=False)

	# Find images for 1st subset of images
	X1_results = cl.find(X1, alpha=0.05)
	# Make scatter
	cl.scatter(zoom=5, dotsize=100, text=False, figsize=(35, 20))

	# Print first key
	keys = list(X1_results.keys())[1:]
	print(X1_results.get(keys[0]).columns)
	# ['y_idx', 'distance', 'y_proba', 'labels', 'y_filenames', 'y_pathnames', 'x_pathnames']
	print(X1_results.get(keys[0])[['labels', 'distance','y_proba']])

	#     labels    distance   y_proba
	# 0        8  189.373756  0.000035
	# 1        8  305.290164  0.000822
	# 2        8  338.588849  0.001812
	# 3        8  341.933366  0.001956
	# 4        8  351.231864  0.002412
	# ..     ...         ...       ...
	# 57       8  506.617886  0.044141
	# 58       8  507.522983  0.044750
	# 59       8  508.852247  0.045657
	# 60       8  509.517201  0.046116
	# 61       8  511.862011  0.047765

	# Get most often seen class label for key
	for key in keys:
	    uiy, ycounts = np.unique(X1_results.get(key)['labels'], return_counts=True)
	    y_predict = uiy[np.argmax(ycounts)]
	    print('class:[%s] - %s' %(y_predict, key))

	# class:[9] - b2ea44d9-de55-421b-8bd1-6d13509533f5.png
	# class:[4] - 60dabaf0-7e1a-4c57-bb57-5d464fd2d8fb.png
	# class:[6] - dbc30522-5c83-4563-9c9f-40a86f24a091.png
	# class:[1] - deb11282-a992-4282-8212-ba62ef1e26fc.png
	# class:[3] - 29104a7c-775b-462f-84ba-e824c7c4b9c7.png
	# class:[9] - 45e6f5fd-4423-4743-850e-921914bfb9c9.png
	# class:[9] - 300a866e-5440-444a-b284-d2bfb1b1178b.png
	# class:[8] - 2dd0defc-d72a-4189-abae-bd33a94ee044.png
	# class:[7] - a308d13e-fa3e-4428-8b5e-edb26554d723.png
	# class:[2] - 6dc9c2b5-e1eb-4b6e-891a-db657c663013.png
	# class:[9] - 30477a7a-e7a0-44f2-8bb2-222719ebe12b.png
	# class:[5] - 50261737-f812-4665-b46f-cf8afb3cc88c.png
	# class:[0] - c83ab55a-2983-44a6-9e2e-4964dc13c1b0.png
	# class:[9] - 3eafd007-b6ed-4d17-a9db-3477854525df.png
	# class:[8] - 411c207e-4804-4e30-a1c5-546876e36c51.png
	# class:[4] - 661ac0f2-6f41-493a-bb3a-2c068c632d86.png
	# class:[9] - 9fc65ef1-ff93-4ac6-9cdb-f4605ee9661a.png
	# class:[7] - 6d9e017e-fc6a-424f-8f40-52995db771dc.png
	# class:[4] - dfc5cf51-2157-437f-b1ce-920100b74119.png
	# class:[4] - 507c365d-d35d-4623-985a-e9512c511d11.png
	# class:[0] - 058f3759-2608-490e-ac79-e5fde8d10f7e.png
	# class:[2] - a10dbae8-81f1-4613-8fd9-0bfa4815f1ad.png
	# class:[2] - a889af11-961c-42f6-8b34-3d0f7050c34c.png
	# class:[0] - 2e92d26c-3bf4-4c25-9d7e-4ceaa2e06d69.png
	# class:[4] - b7706e78-c653-4e26-9d7a-bcb512751526.png


More examples
'''''''''''''

.. code:: python

	from clustimage import Clustimage
	import matplotlib.pyplot as plt
	import pandas as pd

	# Init with default settings
	cl = Clustimage(method='pca')

	# load example with digits
	X, y = cl.import_example(data='mnist')

	# Cluster digits
	results = cl.fit_transform(X)

	# Lets search for the following image:
	plt.figure(); plt.imshow(X[0,:].reshape(cl.params['dim']), cmap='binary')

	# Find images
	results_find = cl.find(X[0:3,:], k=None, alpha=0.05)

	# Show whatever is found. This looks pretty good.
	cl.plot_find()
	cl.scatter(zoom=3)

	# Extract the first input image name
	filename = [*results_find.keys()][1]

	# Plot the probabilities
	plt.figure(figsize=(8,6))
	plt.plot(results_find[filename]['y_proba'],'.')
	plt.grid(True)
	plt.xlabel('samples')
	plt.ylabel('Pvalue')

	# Extract the cluster labels for the input image
	results_find[filename]['labels']

	# The majority of labels is for class 0
	print(pd.value_counts(results_find[filename]['labels']))
	# 0    171
	# 7      8
	# Name: labels, dtype: int64


.. |figCF1| image:: ../figs/find_digit.png
.. |figCF2| image:: ../figs/find_in_pca.png
.. |figCF3| image:: ../figs/find_proba.png
.. |figCF4| image:: ../figs/find_results.png

.. table:: Find results for digits.
   :align: center

   +----------+----------+
   | |figCF1| | |figCF2| | 
   +----------+----------+
   | |figCF3| | |figCF4| | 
   +----------+----------+


** Example to find similar images based on the pathname as input.**

.. code:: python

        from clustimage import Clustimage

        # Init with default settings
        cl = Clustimage(method='pca')

        # load example with flowers
        pathnames = cl.import_example(data='flowers')

        # Cluster flowers
        results = cl.fit_transform(pathnames[1:])
        
        # Lets search for the following image:
        img = cl.imread(pathnames[10], colorscale=1)
        plt.figure(); plt.imshow(img.reshape((128,128,3)));plt.axis('off')

        # Find images
        results_find = cl.find(pathnames[10], k=None, alpha=0.05)

        # Show whatever is found. This looks pretty good.
        cl.plot_find()
        cl.scatter()


.. |figCF5| image:: ../figs/find_flowers.png
.. |figCF6| image:: ../figs/find_flowers_scatter.png

.. table:: Find results for the flower using pathname as input.
   :align: center

   +----------+----------+
   | |figCF5| | |figCF6| | 
   +----------+----------+
   

.. _clusteval: https://github.com/erdogant/clusteval


.. include:: add_bottom.add