The Machine Learning “Advent Calendar” Day 2: k-NN Classifier in Excel

the k-NN Regressor and the idea of prediction based on distance, we now look at the k-NN Classifier.

The principle is the same, but classification allows us to introduce several useful variants, such as Radius Nearest Neighbors, Nearest Centroid, multi-class prediction, and probabilistic distance models.

So we will first implement the k-NN classifier, then discuss how it can be improved.

You can use this Excel/Google sheet while reading this article to better follow all the explanations.

k-NN classifier in Excel – image by author

Titanic survival dataset

We will use the Titanic survival dataset, a classic example where each row describes a passenger with features such as class, sex, age, and fare, and the goal is to predict whether the passenger survived.

Principle of k-NN for Classification

k-NN classifier is so similar to k-NN regressor that I could almost write one single article to explain them both.

In fact, when we look for the k nearest neighbors, we do not use the value y at all, let alone its nature.

BUT, there are still some interesting facts about how classifiers (binary or multi-class) are built, and how the features can be handled differently.

We begin with the binary classification task, and then the multi-class classification.

One Continuous Feature for Binary Classification

So, very quick, we can do the same exercise for one continuous feature, with this dataset.

For the value of y, we usually use 0 and 1 to distinguish the two classes. But you can notice, or you will notice that it can be a source of confusion.

k-NN classifier in Excel – One continuous feature – image by author

Now, think about it: 0 and 1 are also numbers, right? So, we can exactly do the same process as if we are doing a regression.

That’s right. Nothing changes in the computation, as you see in the following screenshot. And you can of course try to modify the value of the new observation yourself.

k-NN classifier in Excel – prediction for one continuous feature – image by author

The only difference is how we interpret the result. When we take the “average” of the neighbors’ y values, this number is understood as the probability that the new observation belongs to class 1.

So in reality, the “average” value is not the good interpretation, but it is rather the proportion of class 1.

We can also manually create this plot, to show how the predicted probability changes over a range of x values.

Traditionally, to avoid ending up with a 50 percent probability, we choose an odd value for k, so that we can always decide with majority voting.

k-NN classifier in Excel – predictions for one continuous feature – image by author

Two-feature for Binary classification

If we have two features, the operation is also almost the same as in k-NN regressor.

k-NN classifier in Excel – two continuous features – image by author

One feature for multi-class classification

Now, let’s take an example of three classes for the target variable y.

Then we can see that we cannot use the notion of “average” anymore, since the number that represents the category is not actually a number. And we should better call them “category 0”, “category 1”, and “category 2”.

k-NN classifier in Excel – multi-class classifer – image by author

From k-NN to Nearest Centroids

When k Becomes too Large

Now, let’s make k large. How large? As large as possible.

Remember, we also did this exercise with k-NN regressor, and the conclusion was that if k equals the total number of observations in the training dataset, then k-NN regressor is the simple average-value estimator.

For the k-NN classifier, it is almost the same. If k equals the total number of observations, then for each class, we will get its overall proportion inside the entire training dataset.

Some people, from a Bayesian point of view, call these proportions the priors!

But this does not help us much to classify a new observation, because these priors are the same for every point.

The Creation of Centroids

So let us take one more step.

For each class, we can also group together all the feature values x that belong to that class, and compute their average.

These averaged feature vectors are what we call centroids.

What can we do with these centroids?

We can use them to classify a new observation.

Instead of recalculating distances to the entire dataset for every new point, we simply measure the distance to each class centroid and assign the class of the nearest one.

With the Titanic survival dataset, we can start with a single feature, age, and compute the centroids for the two classes: passengers who survived and passengers who did not.

k-NN classifier in Excel – Nearest Centroids – image by author

Now, it is also possible to use multiple continuous features.

For example, we can use the two features age and fare.

And we can discuss some important characteristics of this model:

The scale is important, as we discussed before for k-NN regressor.
The missing values are not a problem here: when we compute the centroids per class, each one is calculated with the available (non-empty) values
We went from the most “complex” and “large” model (in the sense that the actual model is the entire training dataset, so we have to store all the dataset) to the simplest model (we only use one value per feature, and we only store these values as our model)

From highly nonlinear to naively linear

But now, can you think of one major drawback?

Whereas the basic k-NN classifier is highly nonlinear, the Nearest Centroid method is extremely linear.

In this 1D example, the two centroids are simply the average x values of class 0 and class 1. Because these two averages are close, the decision boundary becomes just the midpoint between them.

So instead of a piecewise, jagged boundary that depends on the exact location of many training points (as in k-NN), we obtain a straight cutoff that only depends on two numbers.

This illustrates how Nearest Centroids compresses the entire dataset into a simple and very linear rule.

k-NN classifier in Excel – Nearest Centroids linearity – image by author

A note on regression: why centroids do not apply

Now, this kind of improvement is not possible for the k-NN regressor. Why?

In classification, each class forms a group of observations, so computing the average feature vector for each class makes sense, and this gives us the class centroids.

But in regression, the target y is continuous. There are no discrete groups, no class boundaries, and therefore no meaningful way to compute “the centroid of a class”.

A continuous target has infinitely many possible values, so we cannot group observations by their y value to form centroids.

The only possible “centroid” in regression would be the global mean, which corresponds to the case k = N in k-NN regressor.

And this estimator is far too simple to be useful.

In short, Nearest Centroids Classifier is a natural improvement for classification, but it has no direct equivalent in regression.

Further statistical improvements

What else can we do with the basic k-NN classifier?

Average and variance

With Nearest Centroids Classifier, we used the simplest statistic that is the average. A natural reflex in statistics is to add the variance as well.

So, now, distance is no longer Euclidean, but Mahalanobis distance. Using this distance, we get the probability based on the distribution characterized by the mean and variance of each class.

Categorical Features handling

For categorical features, we cannot compute averages or variances. And for k-NN regressor, we saw that it was possible to do one-hot encoding or ordinal/label encoding. But the scale is important and not easy to determine.

Here, we can do something equally meaningful, in terms of probabilities: we can count the proportions of each category inside a class.

These proportions act exactly like probabilities, describing how likely each category is within each class.

This idea is directly linked to models such as Categorical Naive Bayes, where classes are characterized by frequency distributions over the categories.

Weighted Distance

Another direction is to introduce weights, so that closer neighbors count more than distant ones. In scikit-learn, there is the “weights” argument that allows us to do so.

We can also switch from “k neighbors” to a fixed radius around the new observation, which leads to radius-based classifiers.

Radius Nearest Neighbors

Sometimes, we can find this following graphic to explain k-NN classifier. But actually, with a radius like this, it reflects more the idea of Radius Nearest Neighbors.

One advantage is the control of the neighborhood. It is especially interesting when we know the concrete meaning of the distance, such as the geographical distance.

Radius Nearest Neighbors classifier – image by author

But the drawback is that you have to know the radius in advance.

By the way, this notion of radius nearest neighbors is also suitable for regression.

Recap of different variants

All these small changes give different models, each one trying to improve the basic idea of comparing neighbors according to a more complex definition of distance, with a control parameter what allows us to get local neighbors, or more global characterization of neighborhood.

We will not explore all these models here. I simply cannot help myself from going a bit too far when a small variation naturally leads to another idea.

For now, consider this as an announcement of the models we will implement later this month.

Variants and improvements of k-NN classifier – image by author

Conclusion

In this article, we explored the k-NN classifier from its most basic form to several extensions.

The central idea is not really changed: a new observation is classified by looking at how similar it is to the training data.

But this simple idea can take many different shapes.

With continuous features, similarity is based on geometric distance.
With categorical features, we look instead at how often each category appears among the neighbors.

When k becomes very large, the entire dataset collapses into just a few summary statistics, which leads naturally to the Nearest Centroids Classifier.

Understanding this family of distance-based and probability-based ideas helps us see that many machine-learning models are simply different ways of answering the same question:

Which class does this new observation most resemble?

In the next articles, we will continue exploring density-based models, which can be understood as global measures of similarity between observations and classes.

Source link

Sign Up to Our Newsletter

Top Categories

Uncategorized

Tech News

Tech

Software development

Popular Tech News