# Easy clusters

Let’s start with finding clusters using a very easy dataset.

## Generate data

Install requried libraries

```pip install scatterd
pip install sklearn
```
```# Imports
from sklearn.datasets import make_blobs

# Generate random data
X, _ = make_blobs(n_samples=750, centers=4, n_features=2, cluster_std=0.5)

# Scatter samples
scatterd(X[:,0], X[:,1],label=y, figsize=(6,6));
```  ## Cluster Evaluation

# Determine the optimal number of clusters

```# Import
from clusteval import clusteval

# Initialize
ce = clusteval(evaluate='silhouette')

# Fit
ce.fit(X)

# Plot
ce.plot()
ce.dendrogram()
ce.scatter(X)
```   ## Cluster Comparison

```import clusteval

plt.figure()
fig, axs = plt.subplots(2,4, figsize=(25,10))

# dbindex
results = clusteval.dbindex.fit(X)
_ = clusteval.dbindex.plot(results, title='dbindex', ax=axs, visible=False)
axs.scatter(X[:,0], X[:,1],c=results['labx']);axs.grid(True)

# silhouette
results = clusteval.silhouette.fit(X)
_ = clusteval.silhouette.plot(results, title='silhouette', ax=axs, visible=False)
axs.scatter(X[:,0], X[:,1],c=results['labx']);axs.grid(True)

# derivative
results = clusteval.derivative.fit(X)
_ = clusteval.derivative.plot(results, title='derivative', ax=axs, visible=False)
axs.scatter(X[:,0], X[:,1],c=results['labx']);axs.grid(True)

# dbscan
results = clusteval.dbscan.fit(X)
_ = clusteval.dbscan.plot(results, title='dbscan', ax=axs, visible=False)
axs.scatter(X[:,0], X[:,1],c=results['labx']);axs.grid(True)

plt.show()
``` # Snake clusters

The definition of a cluster depends on, among others, the aim. In this experiment I will evaluate the goodness of clusters when the aim is to find circular or snake clusters.

```pip install scatterd
pip install sklearn
```
```# Import some required libraries for this experiment
from sklearn.datasets import make_circles
from scatterd import scatterd

# Generate data
X,y = make_circles(n_samples=2000, factor=0.3, noise=0.05, random_state=4)
scatterd(X[:,0], X[:,1],label=y, figsize=(6,6));
```  ## Ward distance

If we aim to determine snake clusters, it is best to use the `single` linkage type as it hierarchically connects samples with the closest group. For demonstration purposes, I will show what happens when a metric such as `ward` distance is used. Note that this metric uses the centroids to group samples.

```# Load library
from clusteval import clusteval

# Initialize
ce = clusteval(cluster='agglomerative', linkage='ward', evaluate='silhouette')

# Fit
results = ce.fit(X)

# Plot
ce.plot()
ce.scatter(X)
```  As can be seen, the inner circle is detected as one cluster but the outside circle is devided into multiple smaller clusters.

## Single distance

If the aim is to cluster samples that are in a snake pattern, it is best to use the `single` linkage type as it hierarchically connects samples to the closest group.

```from clusteval import clusteval

# Initialize
ce = clusteval(cluster='agglomerative', linkage='single', evaluate='silhouette')

# Fit
results = ce.fit(X)

# Plot
ce.plot()
ce.scatter(X)
```  As can be seen, the inner circle is detected as one cluster but the outside circle is devided into multiple smaller clusters.

# Different Density Clusters

Many cluster evaluation methods can easily determine the optimal number of clusters when clusters are evenly distributed for which the groups of samples have similar density. Here I will generate groups of samples with various densities to demontrate the performance of the cluster evaluation methods.

## Generate Dataset

Let’s generate 5 groups of samples, each with 200 but with different standard deviations per cluster.

```# Import libraries
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np

# Create random blobs
X, y = make_blobs(n_samples=200, n_features=2, centers=2, random_state=1)

# Make more blobs with different densities
c = np.random.multivariate_normal([40, 40], [[20, 1], [1, 30]], size=[200,])
d = np.random.multivariate_normal([80, 80], [[30, 1], [1, 30]], size=[200,])
e = np.random.multivariate_normal([0, 100], [[200, 1], [1, 100]], size=[200,])

# Concatenate the data
X = np.concatenate((X, c, d, e),)
y = np.concatenate((y, len(c)*, len(c)*, len(c)*),)

plt.figure(figsize=(15,10))
plt.scatter(X[:,0], X[:,1])
plt.grid(True); plt.xlabel('Feature 1'); plt.ylabel('Feature 2')
```  ## Derivative Method

The optimal number of clusters is 4 but the original dataet consits out of 5 clusters. When we scatter plot the samples with the etimated cluster labels, it can be seen that this approach has trouble in finding the correct labels for the smaller high density groups. Note that the results did not change in case of using the different clustering methods, such as ‘agglomerative’, and ‘kmeans’.

```# Intialize model
ce = clusteval(cluster='agglomerative', evaluate='derivative')

# Cluster evaluation
results = ce.fit(X)

# The clustering label can be found in:
print(results['labx'])

# Make plots
ce.plot()
ce.scatter(X)
```  ## Silhouette Method

The silhouette method detects an optimum of 4 clusters. The scatterplot shows that it has troubles in finding the high density clusters. Note that when using `cluster='agglomerative'`, similar results are detected.

```# Intialize model
ce = clusteval(cluster='agglomerative', evaluate='silhouette')

# Cluster evaluation
results = ce.fit(X)

# The clustering label can be found in:
print(results['labx'])

# Make plots
ce.plot()
ce.scatter(X)
```  ## DBindex method

The DBindex method finds 4 cluster scores lowers gradually and stops at 22 clusters. This is (almost) the maximum default search space. The search space can be altered using `min_clust` and `max_clust` in the function function `clusteval.clusteval.clusteval.fit()`. It is recommended to set `max_clust=10` to find the local minima.

```# Intialize model
ce = clusteval(cluster='agglomerative', evaluate='dbindex')

# Cluster evaluation
results = ce.fit(X)

# The clustering label can be found in:
print(results['labx'])

# Make plots
ce.plot()
ce.scatter(X)
ce.dendrogram()
```   Set the `max_clust=10` for find the local optimal minima.

```# Intialize model
ce = clusteval(cluster='agglomerative', evaluate='dbindex', max_clust=10)

# Cluster evaluation
results = ce.fit(X)

# Make plots
ce.plot()
ce.scatter(X)
ce.dendrogram()
```   ## DBSCAN

The `eps` parameter is gridsearched together with a varying number of clusters. The global maximum is found at the expected 5 clusters. When we scatter the samples with the new cluster labels, it can be seen that this approach works pretty well. However, for the cluster with large deviation, many outliers are marked with the default parameters.

```# Intialize model
ce = clusteval(cluster='dbscan')

# Parameters can be changed for dbscan:
# ce = clusteval(cluster='dbscan', params_dbscan={'epsres' :100, 'norm':True})

# Cluster evaluation
results = ce.fit(X)

# [clusteval] >Fit using dbscan with metric: euclidean, and linkage: ward
# [clusteval] >Gridsearch across epsilon..
# [clusteval] >Evaluate using silhouette..
# 100%|██████████| 245/245 [00:11<00:00, 21.73it/s][clusteval] >Compute dendrogram threshold.
# [clusteval] >Optimal number clusters detected: .
# [clusteval] >Fin.

# The clustering label can be found in:
print(results['labx'])

# Make plots
ce.plot()
ce.scatter(X)
```  ## HDBSCAN

Install the library first because this approach is not installed by default in `clusteval`.

```pip install hdbscan
```
```# Intialize model
ce = clusteval(cluster='hdbscan')

# Cluster evaluation
results = ce.fit(X)

# Make plots
ce.plot()
ce.scatter(X)
```   ## Comparison methods

A comparison of all four methods when using kemans is as shown underneath. The best approach is `dbscan` in case of having various density groups.

```import matplotlib.pyplot as plt
import clusteval

plt.figure()
fig, axs = plt.subplots(2,4, figsize=(25,10))

# dbindex
results = clusteval.dbindex.fit(X)
_ = clusteval.dbindex.plot(results, title='dbindex', ax=axs, visible=False)
axs.scatter(X[:,0], X[:,1],c=results['labx']);axs.grid(True)

# silhouette
results = clusteval.silhouette.fit(X)
_ = clusteval.silhouette.plot(results, title='silhouette', ax=axs, visible=False)
axs.scatter(X[:,0], X[:,1],c=results['labx']);axs.grid(True)

# derivative
results = clusteval.derivative.fit(X)
_ = clusteval.derivative.plot(results, title='derivative', ax=axs, visible=False)
axs.scatter(X[:,0], X[:,1],c=results['labx']);axs.grid(True)

# dbscan
results = clusteval.dbscan.fit(X)
_ = clusteval.dbscan.plot(results, title='dbscan', ax=axs, visible=False)
axs.scatter(X[:,0], X[:,1],c=results['labx']);axs.grid(True)

plt.show()
``` 