Vagilact ntc how to use it

DBSCAN: what is it? When to use it How to use it.

DBSCAN (Density-Based Spatial Clustering of Noise Applications) is a popular unsupervised learning method used in modeling and machine learning. Before we go any further, we need to define what an "unsupervised" method of learning is. Unsupervised learning methods exist when there is no clear goal or outcome that we are looking for. Instead, we group the data based on the similarity of the observations. Let's take Netflix as an example for clarity. Based on previous shows you've seen in the past, Netflix recommends shows you want to watch next. Anyone who's ever watched or stood on Netflix has seen the following screen of recommendations (Yes, this picture is taken directly from my Netflix account, and if you've never seen Shameless I suggest you do so start as soon as possible).

Since I've watched Shameless, Netflix recommends several similar shows. But where does Netflix get these recommendations from? Considering the attempt to predict the future with the show I'll watch next, Netflix cannot substantiate the predictions or recommendations (no clear final destination). Instead, Netflix looks at other users who have also viewed "Shameless" in the past and the content that those users viewed in addition to "Shameless". This is how Netflix groups its users based on similar interests. This is exactly how unsupervised learning works. Simply group observations based on similarities in hopes of drawing accurate conclusions based on the groups.

Back to DBSCAN. DBSCAN is a clustering method used in machine learning to separate high-density clusters from low-density clusters. Given that DBSCAN is a density-based clustering algorithm, it is an excellent job to find areas in the data that have a high density of observation compared to areas of the data that are not very dense with observations. DBSCAN can also sort data into clusters of different shapes, another great benefit. DBSCAN works as such:

  • Divides the data set into n dimensions
  • For each point in the data set, DBSCAN creates an n-dimensional shape around this data point and then counts how many data points fall into this shape.
  • DBSCAN counts this form as a cluster. DBSCAN iteratively expands the cluster by going through every single point in the cluster and counting the number of other data points in the vicinity. Take the graphic below as an example:

DBSCAN carries out the above process step by step and first divides the data into n dimensions. After DBSCAN does this, it starts at a random point (in this case we assume it was one of the red dots) and counts how many other dots are nearby. DBSCAN continues this process until there are no more data points nearby and then looks for a second cluster.

As you may have noticed from the graphic, there are some parameters and specifications that we need to give DBSCAN to before it does its job. As such, the two parameters we need to provide are:

What is the minimum number of data points required to identify a single cluster?
How far can a point be from the next point within the same cluster?

Referring again to the graph, the epsilon is the radius used to test the distance between data points. If a point falls within the epsilon distance of another point, those two points are in the same cluster.

Additionally, the minimum required number of points is set to 4 in this scenario. When running through each data point, a cluster is formed as long as DBSCAN finds 4 points at an epsilon distance from one another.

IMPORTANT: In order for a point to be considered a "core point", it must contain the minimum number of points within the epsilon distance. So the visualization actually only has TWO key points. Read the documentation here and specifically look at the min_samples parameter.

You will also notice that the blue dot on the graph is not in any cluster. DBSCAN does NOT necessarily categorize every data point and is therefore ideal for handling outliers in the data set. Let's look at the following graphic:

The picture on the left shows a more traditional cluster method like K-Means, which does not take multidimensionality into account. While the right picture shows how DBSCAN can twist the data into different shapes and dimensions to find similar clusters. In the right image we also see that the points on the outer edge of the data set are not classified, which suggests that they are outliers among the data.

Advantages of DBSCAN:

  • Can better separate high-density clusters from low-density clusters in a given data set.
  • Is great at dealing with outliers in the data set.

Disadvantages of DBSCAN:

  • Does not work well with clusters of different densities. While DBSCAN does a good job of separating high-density clusters from low-density clusters, DBSCAN struggles with clusters of similar density.
  • Struggles with high dimensional data. I know I have explained throughout this article how great DBSCAN is at taking the data into different dimensions and shapes. However, DBSCAN can only go so far if DBSCAN suffers from data with too many dimensions

Below I've described how to implement DBSCAN in Python. Then I explain the metrics and evaluate your DBSCAN model

DBSCAN implementation in Python

1. Assign the data to our X values
2. Instantiating our DBSCAN model. In the following code, epsilon = 3 and min_samples is the minimum number of points required to form a cluster.
3. Save the labels created by the DBSCAN
4. Identify which points make up our "key points"
5. Calculating the number of clusters
6. Calculation of the Silhouette Score

Metrics to measure DBSCAN performance:

Silhouette Score: The Silhouette Score is calculated from the mean cluster distance between points AND the mean cluster distance. For example, a cluster with many data points that are very close together (high density) AND far from the closest cluster (which suggests that the cluster is very unique compared to the closest cluster) will have a strong silhouette. A Silhouette score ranges from -1 to 1, with -1 being the worst and 1 being the best. Silhouette values ​​of 0 indicate overlapping clusters.

Inertia: Inertia measures the internal cluster sum of squares (sum of squares is the sum of all residuals). Inertia is used to measure how related clusters are to one another. The lower the inertia, the better. However, it is important to note that the inertia depends heavily on the assumption that the clusters are convex (spherical). DBSCAN doesn't necessarily break data into spherical clusters, so inertia is not a good metric for evaluating DBSCAN models (which is why I didn't account for inertia in the code above). Inertia is more commonly used in other clustering methods, such as clustering. B. K-means clustering.

Other resources:

Naftali Harris's blog is a tremendous additional resource