Sign Up to Our Newsletter

Be the first to know the latest tech updates

[mc4wp_form id=195]

The Machine Learning “Advent Calendar” Day 4: k-Means in Excel

The Machine Learning “Advent Calendar” Day 4: k-Means in Excel


4 of the Machine Learning Advent Calendar.

During the first three days, we explored distance-based models for supervised learning:

In all these models, the idea was the same: we measure distances, and we decide the output based on the nearest points or nearest centers.

Today, we stay in this same family of ideas. But we use the distances in an unsupervised way: k-means.

Now, one question for those who already know this algorithm: k-means looks more similar to which model, the k-NN classifier, or the Nearest Centroid classifier?

And if you remember, for all the models we have seen so far, there was not really a “training” phase or hyperparameter tuning.

  • For k-NN, there is no training at all.
  • For LDA, QDA, or GNB, training is just computing means and variances. And there are also no real hyperparameters.

Now, with k-means, we are going to implement a training algorithm that finally looks like “real” machine learning.

We start with a tiny 1D example. Then we move to 2D.

Goal of k-means

In the training dataset, there are no initial labels.

The goal of k-means is to create meaningful labels by grouping points that are close to each other.

Let us look at the illustration below. You can clearly see two groups of points. Each centroid (the red square and the green square) is in the middle of its cluster, and every point is assigned to the closest one.

This gives a very intuitive picture of how k-means discovers structure using only distances.

And here, k means the number of centers we try to find.

k-means in Excel – image by author

Now, let us answer the question: Which algorithm is k-means closer to, the k-NN classifier or the Nearest Centroid classifier?

Do not be fooled by the k in k-NN and k-means.
They do not mean the same thing:

  • in k-NN, k is the number of neighbors, not the number of classes;
  • in k-means, k is the number of centroids.

K-means is much closer to the Nearest Centroid classifier.

Both models are represented by centroids, and for a new observation we simply compute the distance to each centroid to decide to which one it belongs.

The difference, of course, is that in the Nearest Centroid classifier, we already know the centroids because they come from labeled classes.

In k-means, we do not know the centroids. The whole goal of the algorithm is to discover suitable ones directly from the data.

The business problem is completely different: instead of predicting labels, we are trying to create them.

And in k-means, the value of k (the number of centroids) is unknown. So it becomes a hyperparameter that we can tune.

k-means with only One feature

We start with a tiny 1D example so that everything is visible on one axis. And we will choose the values in such a trivial way that we can instantly see the two centroids.

1, 2, 3, 11, 12, 13

Yes, 2, and 12.

But how would the computer know? The machine will “learn” by guessing step by step.

Here comes the algorithm called Lloyd’s algorithm.

We will implement it in Excel with the following loop:

  1. choose initial centroids
  2. compute the distance from each point to each centroid
  3. assign each point to the nearest centroid
  4. recompute the centroids as the average of the points in each cluster
  5. repeat steps 2 to 4 until the centroids no longer move

1. Choose initial centroids

Pick two initial centers, for example:

They should be within the data range (between 1 and 13).

k-means in Excel – image by author

2. Compute distances

For each data point x:

  • compute the distance to c_1,
  • compute the distance to c_2.

Typically, we use absolute distance in 1D.

We now have two distance values for each point.

k-means in Excel – image by author

3. Assign clusters

For each point:

  • compare the two distances,
  • assign the cluster of the smallest one (1 or 2).

In Excel, this is a simple IF or MIN based logic.

k-means in Excel – image by author

4. Compute the new centroids

For each cluster:

  • take the points assigned to that cluster,
  • compute their average,
  • this average becomes the new centroid.
k-means in Excel – image by author

5. Iterate until reaching convergence

Now in Excel, thanks to the formulas, we can simply paste the new centroid values into the cells of the initial centroids.

The update is immediate, and after doing this a few times, you will see that the values stop changing. That is when the algorithm has converged.

k-means in Excel – image by author

We can also record each step in Excel, so we can see how the centroids and clusters evolve over time.

k-means in Excel – image by author

k-means with Two Features

Now let us use two features. The process is exactly the same, we simply use the Euclidean distance in 2D.

You can either do the copy-paste of the new centroids as values (with just a few cells to update),

k-means in Excel – image by author

or you can display all the intermediate steps to see the full evolution of the algorithm.

k-means in Excel – image by author

Visualizing the Moving Centroids in Excel

To make the process more intuitive, it is helpful to create plots that show how the centroids move.

Unfortunately, Excel or Google Sheets are not ideal for this kind of visualization, and the data tables quickly become a bit complex to organize.

If you want to see a full example with detailed plots, you can read this article I wrote almost three years ago, where each step of the centroid movement is shown clearly.

k-means in Excel – image by author

As you can see in this picture, the worksheet became quite unorganized, especially compared to the earlier table, which was very straightforward.

k-means in Excel – image by author

Choosing the optimal k: The Elbow Method

So now, it is possible to try k = 2 and k = 3 in our case, and compute the inertia for each one. Then we simply compare the values.

We can even begin with k=1.

For each value of k:

  • we run k-Means until convergence,
  • compute the inertia, which is the sum of squared distances between each point and its assigned centroid.

In Excel:

  • For each point, take the distance to its centroid and square it.
  • Sum all these squared distances.
  • This gives the inertia for this k.

For example:

  • for k = 1, the centroid is just the overall mean of x1 and x2,
  • for k = 2 and k = 3, we take the converged centroids from the sheets where you ran the algorithm.

Then we can plot inertia as a function of k, for example for (k = 1, 2, 3).

For this dataset

  • from 1 to 2, the inertia drops a lot,
  • from 2 to 3, the improvement is much smaller.

The “elbow” is the value of k after which the decrease in inertia becomes marginal. In the example, it suggests that k = 2 is sufficient.

k-means in Excel – image by author

Conclusion

K-means is a very intuitive algorithm once you see it step by step in Excel.

We start with simple centroids, compute distances, assign points, update the centroids, and repeat. Now, we can see how “machines learn”, right?

Well, this is only the beginning, we will see that different models “learn” in really different ways.

And here is the transition for tomorrow’s article: the unsupervised version of the Nearest Centroid classifier is indeed k-means.

So what would be the unsupervised version of LDA or QDA? We will answer this in the next article.

k-means – image by author



Source link

angela shi

About Author

TechToday Logo

Your go-to destination for the latest in tech, AI breakthroughs, industry trends, and expert insights.

Get Latest Updates and big deals

Our expertise, as well as our passion for web design, sets us apart from other agencies.

Digitally Interactive  Copyright 2022-25 All Rights Reserved.