Machine Learning Clustering: DBSCAN Determine The Optimal Value For Epsilon (EPS) Python Example
Density-Based Spatial Clustering of Applications with Noise, or DBSCAN for short, is an unsupervised machine learning algorithm. Unsupervised machine learning algorithms are used to classify unlabeled data. In other words, the samples used to train our model do not come with predefined categories. In comparison to other clustering algorithms, DBSCAN is particularly well suited for problems which require:
- Minimal domain knowledge to determine the input parameters (i.e. K in k-means and Dmin in hierarchical clustering)
- Discovery of clusters with arbitrary shapes
- Good efficiency on large databases
If you’re interested in reading up on DBSCAN, the original paper can be found here.
As is the case in most machine learning algorithms, the model’s behaviour is dictated by several parameters. In the proceeding article, we’ll touch on three.
- eps: Two points are considered neighbors if the distance between the two points is below the threshold epsilon.
- min_samples: The minimum number of neighbors a given point should have in order to be classified as a core point. It’s important to note that the point itself is included in the minimum number of samples.
- metric: The metric to use when calculating distance between instances in a feature array (i.e. euclidean distance).
The algorithm works by computing the distance between every point and all other points. We then place the points into one of three categories.
Core point: A point with at least min_samples points whose distance with respect to the point is below the threshold defined by epsilon.
Border point: A point that isn’t in close proximity to at least min_samples points but is close enough to one or more core point. Border points are included in the cluster of the closest core point.
Noise point: Points that aren’t close enough to core points to be considered border points. Noise points are ignored. That is to say, they aren’t part of any cluster.
Let’s take a look at how we could go about implementing DBSCAN in python. To get started, import the following libraries.
import numpy as np from sklearn.datasets.samples_generator import make_blobs from sklearn.neighbors import NearestNeighbors from sklearn.cluster import DBSCAN from matplotlib import pyplot as plt import seaborn as sns sns.set()
As opposed to importing data, we can use
scikit-learn to generate nicely defined clusters.
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) plt.scatter(X[:,0], X[:,1])
As mentioned previously, we must provide a value for epsilon which defines the maximum distance between two points. The following paper, describes an approach for automatically determining the optimal value for Eps.
In layman’s terms, we find a suitable value for epsilon by calculating the distance to the nearest n points for each point, sorting and plotting the results. Then we look to see where the change is most pronounced (think of the angle between your arm and forearm) and select that as epsilon.
We can calculate the distance from each point to its closest neighbour using the
NearestNeighbors. The point itself is included in
kneighbors method returns two arrays, one which contains the distance to the closest
n_neighbors points and the other which contains the index for each of those points.
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X)
distances, indices = nbrs.kneighbors(X)
Next, we sort and plot results.
distances = np.sort(distances, axis=0)
distances = distances[:,1]
The optimal value for epsilon will be found at the point of maximum curvature.
We train our model, selecting
eps and setting
m = DBSCAN(eps=0.3, min_samples=5)
labels_ property contains the list of clusters and their respective points.
clusters = m.labels_
Then, we map every individual cluster to a color.
colors = ['royalblue', 'maroon', 'forestgreen', 'mediumorchid', 'tan', 'deeppink', 'olive', 'goldenrod', 'lightcyan', 'navy']
vectorizer = np.vectorize(lambda x: colors[x % len(colors)])
The model classified the densely populated areas. As we can see, all the dark blue points were categorized as noise.
plt.scatter(X[:,0], X[:,1], c=vectorizer(clusters))
Unlike k-means, DBSCAN will figure out the number of clusters. DBSCAN works by determining whether the minimum number of points are close enough to one another to be considered part of a single cluster. DBSCAN is very sensitive to scale since epsilon is a fixed value for the maximum distance between two points.