Member-only story

“K-Means Clustering in Spark: Finding Your Perfect Clusters with Davies-Bouldin and WSS”

Ajay Gurav
4 min readSep 28, 2024

--

Let’s talk about clustering, but not the kind where you sit in a group chat arguing about what to eat. We’re talking K-Means Clustering — an essential tool in machine learning for organizing data into meaningful groups or “clusters.” And to make sure you find the right number of clusters, we’ll use two metrics: the Davies-Bouldin Score and WSS (Within-Cluster Sum of Squares).

What’s K-Means Clustering? (aka “Data Group Therapy”)

Imagine you have a bunch of data points — each representing something like customers, products, or even movies. K-Means helps you split those points into groups (or clusters) based on how similar they are. It’s like sorting your friends into movie night, hiking buddies, and brunch pals.

The big question is: How many clusters should you have? Too few, and you’re lumping everything together. Too many, and you’re over-complicating things. Enter our trusty metrics!

Step 1: WSS (Within-Cluster Sum of Squares) — Keeping Things Tight

WSS is like checking how tight your clusters are. The smaller the WSS, the closer your data points are to the center of their respective clusters. Think of it as minimizing the mess in each group.

How WSS Works:

  • For each data point, you calculate the distance from the point to…

--

--

Ajay Gurav
Ajay Gurav

Written by Ajay Gurav

Senior Data Scientist \ AI Engineer

No responses yet