🔵Clustering & K-Means
Finding groups without labels
Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.
Pause and experiment as you go.
Before We Begin
What we are learning today
Clustering is about finding structure when no answer key has been provided. Instead of predicting a known label, the model searches for natural groupings based on similarity. K-means does this with a simple loop: assign points to the nearest center, move the centers, and repeat.
How this lesson fits
This module is where the course shifts from explicit rules to learned patterns. Instead of telling the machine exactly what to do in every case, we give it examples, define success, and let it infer a decision rule from the data.
The big question
How can a machine study examples, extract useful patterns, and make predictions on cases it has never seen before?
Why You Should Care
Students often assume all machine learning depends on labeled data. Clustering breaks that assumption and shows that one major use of ML is exploratory: revealing patterns, segments, or hidden organization that humans did not label ahead of time.
Where this is used today
- ✓Customer segmentation, where companies group users with similar behaviors or needs
- ✓Color quantization in image compression, where many shades are grouped into a smaller palette
- ✓Exploratory analysis that groups documents, search results, or biological samples by similarity
Think of it like this
Imagine sorting a mixed box of LEGO bricks without instructions. You could organize them by color, by size, or by shape. There may be several reasonable groupings, and the point is to choose one that reveals useful structure.
Easy mistake to make
K-means does not uncover one final, objective truth hiding in the data. Different values of K and different definitions of similarity can produce different but still useful groupings.
By the end, you should be able to say:
- Explain why clustering is unsupervised and what information is missing compared with labeled training
- Describe the alternating assignment and centroid-update steps in K-means
- Interpret the elbow method as a practical but imperfect way to choose K
Think about this first
If you had to sort a pile of mixed objects with no labels, what clues would you rely on first, and how would you decide whether two objects belong together?
Words we will keep using
Clustering: Finding Hidden Groups
Clustering is like sorting a bucket of mixed LEGOs when you lost the instruction manual. You don't know what the groups are supposed to be, so you organize them by what looks similar—color, size, or shape.
K-Means Algorithm
- Guess: Drop K center points (centroids) randomly on the map.
- Assign: Every data point joins the team of the closest centroid.
- Update: Each team finds its new center of gravity and moves the centroid there.
- Repeat until nothing moves anymore.
Step-by-Step K-Means
Press start and watch the two repeating moves: assign points, then move centroids.
Phase: Init
Points: 0/90 assigned
Choosing K — The Elbow Method
How many clusters should you use? The "Elbow Method" is a rule of thumb: keep adding clusters until the improvement slows down. It's like eating pizza—the first slice is amazing, the fifth one is just okay.
Red dot = elbow at K=3. Adding more clusters beyond this gives diminishing returns.
- Silhouette score: asks whether points are close to their own cluster and far from other clusters
- Gap statistic: compares your clustering result to what random data would look like
- Domain knowledge: sometimes you already know how many groups make sense