Machine Learning “Advent Calendar” Day 10: DBSCAN in Excel

It's day 10 of my machine learning “advent calendar”. Thank you very much for your support.

I've been creating these Google Sheets files for years. They have evolved little by little. But when it comes time to publish, it always takes hours to reorganize everything, organize the layout, and make it readable.

Today we move to DBSCAN.

DBSCAN does not learn parametric models

Similar to LOF, DBSCAN do not have parametric model. There are no formulas, no rules, no centroids to store, nothing compact to reuse later.

we, Entire dataset Because the density structure depends on all points.

Its official name is Density-based spatial clustering for noisy applications.

However, please note that this “density” is not a Gaussian density.

it is count base Density concept. Just “How many neighbors live near me?”

What makes DBSCAN special?

As its name suggests, DBSCAN: two things at the same time:

find cluster
Mark anomalies (points that do not belong to any cluster).

This is exactly why we introduce the algorithms in this order.

k-means and GMM teeth clustering model. These output compact objects (centroids for K-means, mean and variance for GMM).
isolated forest and LOF teeth Pure anomaly detection model. Their only goal is to find anything out of the ordinary.
DBSCAN sitting in between. achieve both Clustering and anomaly detectionis based solely on the concept of neighborhood density.

A small dataset to keep things intuitive

Use the same small dataset used for LOF. 1, 2, 3, 7, 8, 12

Looking at these numbers, we can already see two compact groups.
alone around 1-2-3one more around 7～8and 12 Living alone.

DBSCAN captures exactly this intuition.

3 Step Summary

DBSCAN asks 3 quick questions For each point:

How many neighbors are there within a small radius (eps)?
Are there enough neighboring points to become core points (minPts)?
Once you know your core points, which connection group do you belong to?

An overview of the DBSCAN algorithm is as follows. 3 steps:

Let's start one step at a time.

3-step DBSCAN

Now that you understand the concepts of density and neighborhood, it becomes much easier to explain DBSCAN.
Everything the algorithm does is 3 easy steps.

Step 1 – Count neighbors

The objective is to find out how many points are adjacent to each point.

Take a small radius of eps.

For each point, examine all other points and mark the points whose distance is less than eps.
these are, neighbor.

This will give you a first idea about density.
Points with many neighbors are in dense regions;
Points with few neighboring points live in sparse regions.

For a one-dimensional toy example like ours, common choices are:
eps=2

Draw a small interval of radius 2 around each point.

why is it called that eps?

name eps derived from Greek letters ε (epsilon)which is traditionally in mathematics, small amount or small radius Around the point.
At DBSCAN, eps It's literally a “small radius of a neighborhood.”

It answers the questions:
How far around each point do you see?

So the first step in Excel is pairwise distance matrixthen count how many neighbors each point has in the eps.

Step 2 – Core points and density connections

Now that we know the neighborhood from step 1, let's apply it. minute points which point to decide core.

Here minPts means the minimum number of points.

This is the minimum number of neighbors a point must have (within an eps radius) to be considered a point. core point.

A point is core if it has at least the following elements minute points neighbor inside eps.
Otherwise, you could end up with something like this: border or noise.

and eps=2 and Minimum points = 2there are 12 non-cores.

Once you know your core points, just check which points are your core points. achievable density From them. Points belong to the same group if they can be reached by moving from one core point to another within the eps.

In Excel, this can be represented as a simple connectivity table showing which points are linked through core neighborhoods.

This connection is what DBSCAN uses to form the cluster in step 3.

Step 3 – Assign cluster labels

The goal is to turn the connections into actual clusters.

Once the connectivity matrix is ready, the clusters will appear naturally.
DBSCAN simply groups all connected points together.

We use very intuitive rules to give each group a simple and reproducible name.

The cluster label is the smallest point in a connected group.

for example:

Group {1, 2, 3} becomes a cluster 1
Group {7, 8} becomes cluster 7
points like 12 If there are no core neighbors, noise

This is exactly what you want to display in Excel using formulas.

final thoughts

DBSCAN is great for teaching the concept of local density.

There are no probabilities, no Gaussian formulas, and no estimation steps.
Just distance, neighbors, and a small radius.

However, this simplicity also has its limits.
DBSCAN uses one fixed radius for everyone, so it cannot adapt when the dataset contains clusters of different scales.

HDBSCAN maintains the same intuition, but all Adjust the radius and keep it steady.
This is much more robust and much closer to how humans naturally recognize clusters.

With DBSCAN, it's a natural time to step back and summarize the unsupervised models we've been considering, as well as a few others we haven't covered yet.

Now is a good time to link these algorithms together and draw a little map of where each fits in the broader context.

distance-based model
K-means, K-medoid, and hierarchical clustering (HAC) work by comparing distances between points or groups.
Density-based model
Mean shift and Gaussian mixture models (GMMs) estimate smooth densities and extract clusters from that structure.
Neighborhood-based model
DBSCAN, OPTICS, HDBSCAN, and LOF define clusters and anomalies from local connectivity rather than global distance.
graph-based model
Spectral clustering, Louvain and Leiden, relies on structure within the similarity graph.

Each group reflects a different philosophy of what a “cluster” is.
The choice of algorithm often depends more on the shape of the data, the scale of its density, and the type of structure one expects to detect than on theory.

Here's how these methods are connected to each other.

K-means is generalized to GMM by replacing hard assignments with probability densities.
DBSCAN is generalized to OPTICS when the need for a single eps value is removed.
OPTICS naturally leads to HDBSCAN, turning density connectivity into a stable hierarchy.
HAC and spectral clustering both build clusters from pairwise distances, but spectral adds a graph-based view.
LOF uses the same neighborhood as DBSCAN, but only for anomaly detection.

There are many other models, but this will give you a sense of the landscape and where DBSCAN fits into it.

The Landscape of Distance-Based Unsupervised Learning – Images by Author

Tomorrow we will continue our advent calendar with a more “classic” and widely used model in everyday machine learning.
Thank you for continuing this journey. See you tomorrow.

Source link