Clustering: How AI Finds Patterns Without Labels

Clustering: How AI Finds Patterns Without Labels 🔵

Every algorithm you've studied so far has been supervised: you feed it labelled examples and it learns to predict labels for new ones. But what happens when you have a mountain of data and no labels at all?

This is where unsupervised learning comes in — and clustering is one of its most powerful tools.

🔍 What Is Clustering?

Clustering is the task of grouping data points together so that points in the same group (cluster) are more similar to each other than to points in other groups.

Crucially, nobody tells the algorithm how many groups there are or what they represent. It finds the structure on its own.

Think of it like sorting a pile of mixed sweets with your eyes closed, just using touch. You'd group them by shape — round ones together, long ones together, chewy ones apart from hard ones — without anyone defining the categories in advance.

🤯

Clustering is used by astronomers to group galaxies by shape and composition, by geneticists to identify disease subtypes, and by Spotify to generate personalised playlists — all without anyone labelling the data by hand.

📍 K-Means: The Most Famous Clustering Algorithm

K-Means is simple, fast, and surprisingly powerful. Here's the full algorithm in plain English:

Choose K — decide in advance how many clusters you want.
Place K centroids randomly — scatter K points randomly across the data space. A centroid is just the "centre" of a cluster.
Assign each data point to its nearest centroid — every point belongs to whichever centroid it's closest to.
Recalculate each centroid — move each centroid to the average position of all points assigned to it.
Repeat steps 3 and 4 — keep reassigning and recalculating until the assignments stop changing.

Three iterations of K-Means: random centroids, first assignment, final converged clusters shown in different colours — K-Means iterates between assigning points to centroids and recalculating centroid positions until clusters stabilise.

The analogy: imagine placing K magnets on a map of customer addresses. Each customer is attracted to the nearest magnet. Then you move each magnet to the centre of its customers. Repeat until the magnets stop moving.

Lesson 16 of 160% complete

←Decision Trees: The Algorithm You Can Draw on Paper

Discussion

Suggest an edit to this lesson

🤔

Think about it:

K-Means depends on the initial random placement of centroids. Two runs on the same data can produce different clusters. How would you decide which result is better? What would you even measure?

🔢 Choosing K: The Elbow Method

A practical problem: how do you choose K? If you set K equal to the number of data points, every point is its own cluster — perfect but useless. If K = 1, everything is one big blob — also useless.

The elbow method helps: run K-Means for K = 1, 2, 3, … N and plot how much "error" (within-cluster variance) decreases as K increases. More clusters always reduce error, but there's usually a K where the improvement starts to slow dramatically — the "elbow" in the curve. That's a good candidate for the right K.

It's not a hard rule, but it gives you a principled starting point.

🌿 Hierarchical Clustering

K-Means requires you to fix K upfront. Hierarchical clustering does not.

Instead, it builds a tree (called a dendrogram) of clusters by either:

Agglomerative (bottom-up): start with every point as its own cluster. Repeatedly merge the two most similar clusters until everything is one big cluster.
Divisive (top-down): start with everything in one cluster. Repeatedly split the most dissimilar cluster until every point is its own cluster.

The resulting dendrogram lets you "cut" at any level to get any number of clusters, which is extremely flexible. The downside: hierarchical clustering is computationally expensive on large datasets.

🌍 Real-World Applications

Clustering turns up everywhere:

| Application | What Gets Clustered | |---|---| | Customer segmentation | Customers grouped by purchase behaviour | | Image compression | Pixel colours reduced to K representative colours | | Document organisation | Articles grouped by topic without manual labels | | Anomaly detection | Outliers that don't fit any cluster = suspicious | | Genetics | Gene expression patterns grouped by disease subtype | | Social networks | Communities identified by connection patterns |

⚠️ Limitations of Clustering

Clustering is powerful but has real limitations:

Interpreting clusters is hard — the algorithm groups the data, but you still have to work out what each group means.
K-Means assumes roughly spherical clusters — it struggles with elongated, irregular, or nested cluster shapes.
Sensitive to scale — features with large numerical ranges dominate distance calculations; always normalise features before clustering.
No ground truth — without labels, evaluating quality is subjective. Two analysts might draw different conclusions from the same clustering.

🤯

K-Means was first proposed by Stuart Lloyd in 1957 (as an internal Bell Labs technical note) and independently re-described by Forgy in 1965. Despite its age, it remains one of the most widely-used algorithms in data science.

AI Foundations

AI Mastery

Career Ready

Lab