The Machine Learning “Advent Calendar” Day 14: Softmax Regression in Excel

With Logistic Regression, we learned how to classify into two classes.

Now, what happens if there are more than two classes.

n is simply the multiclass extension of this idea. And we will discuss this model for Day 14 of my Machine Learning “Advent Calendar” (follow this link to get all the information about the approach and the files I use).

Instead of one score, we now create one score per class. Instead of one probability, we apply the Softmax function to produce probabilities that sum to 1.

Understanding the Softmax model

Before training the model, let us first understand what the model is.

Softmax Regression is not about optimization yet.
It is first about how predictions are computed.

A tiny dataset with 3 classes

Let us use a small dataset with one feature x and three classes.

As we said before, the target variable y should not be treated as numerical.
It represents categories, not quantities.

A common way to represent this is one-hot encoding, where each class is represented by its own indicator.

From this point of view, Softmax Regression can be seen as three Logistic Regressions running in parallel, one per class.

Small datasets are ideal for learning.
You can see every formula, every value, and how each part of the model contributes to the final result.

Softmax regression in Excel – All images by author

Description of the Model

So what is the model, exactly?

Score per class

In logistic regression, the model score is a simple linear expression: score = a * x + b.

Softmax Regression does exactly the same, but one score per class:

score_0 = a0 * x + b0
score_1 = a1 * x + b1
score_2 = a2 * x + b2

At this stage, these scores are just real numbers.
They are not probabilities yet.

Turning scores into probabilities: the Softmax step

Softmax converts the three scores into three probabilities. Each probability is positive, and all three sum to 1.

The computation is direct:

Exponentiate each score
Compute the sum of all exponentials
Divide each exponential by this sum

This gives us p0, p1, and p2 for each row.

These values represent the model confidence for each class.

At this point, the model is fully defined.
Training the model will simply consist in adjusting the coefficients ak and bk so that these probabilities match the observed classes as well as possible.

Visualizing the Softmax model

At this point, the model is fully defined.

We have:

one linear score per class
a Softmax step that turns these scores into probabilities

Training the model simply consists in adjusting the coefficients aka_kak and bkb_kbk so that these probabilities match the observed classes as well as possible.

Once the coefficients have been found, we can visualize the model behavior.

To do this, we take a range of input values, for example x from 0 to 7, and we compute: score0,score1,score2 and the corresponding probabilities p0,p1,p2.

Plotting these probabilities gives three smooth curves, one per class.

The result is very intuitive.

For small values of x, the probability of class 0 is high.
As x increases, this probability decreases, while the probability of class 1 increases.
For larger values of x, the probability of class 2 becomes dominant.

At every value of x, the three probabilities sum to 1.
The model does not make abrupt decisions; instead, it expresses how confident it is in each class.

This plot makes the behavior of Softmax Regression easy to understand.

You can see how the model transitions smoothly from one class to another
Decision boundaries correspond to intersections between probability curves
The model logic becomes visible, not abstract

This is one of the key benefits of building the model in Excel:
you do not just compute predictions, you can see how the model thinks.

Now that the model is defined, we need a way to evaluate how good it is, and a method to improve its coefficients.

Both steps reuse ideas we already saw with Logistic Regression.

Evaluating the model: Cross-Entropy Loss

Softmax Regression uses the same loss function as Logistic Regression.

For each data point, we look at the probability assigned to the correct class, and we take the negative logarithm:

loss = – log (p true class)

If the model assigns a high probability to the correct class, the loss is small.
If it assigns a low probability, the loss becomes large.

In Excel, this is very simple to implement.

We select the correct probability based on the value of y, and apply the logarithm:

loss = -LN( CHOOSE(y + 1, p0, p1, p2) )

Finally, we compute the average loss over all rows.
This average loss is the quantity we want to minimize.

Computing residuals

To update the coefficients, we start by computing residuals, one per class.

For each row:

residual_0 = p0 minus 1 if y equals 0, otherwise 0
residual_1 = p1 minus 1 if y equals 1, otherwise 0
residual_2 = p2 minus 1 if y equals 2, otherwise 0

In other words, for the correct class, we subtract 1.
For the other classes, we subtract 0.

These residuals measure how far the predicted probabilities are from what we expect.

Computing the gradients

The gradients are obtained by combining the residuals with the feature values.

For each class k:

the gradient of ak is the average of residual_k * x
the gradient of bk is the average of residual_k

In Excel, this is implemented with simple formulas such as SUMPRODUCT and AVERAGE.

At this point, everything is explicit:
you see the residuals, the gradients, and how each data point contributes.

Updating the coefficients

Once the gradients are known, we update the coefficients using gradient descent.

This step is identical as we saw before, fore Logistic Regression or Linear regression.
The only difference is that we now update six coefficients instead of two.

To visualize learning, we create a second sheet with one row per iteration:

the current iteration number
the six coefficients (a0, b0, a1, b1, a2, b2)
the loss
the gradients

Row 2 corresponds to iteration 0, with the initial coefficients.

Row 3 computes the updated coefficients using the gradients from row 2.

By dragging the formulas down for hundreds of rows, we simulate gradient descent over many iterations.

You can then clearly see:

the coefficients gradually stabilizing
the loss decreasing iteration after iteration

This makes the learning process tangible.
Instead of imagining an optimizer, you can watch the model learn.

Logistic Regression as a Special Case of Softmax Regression

Logistic Regression and Softmax Regression are often presented as different models.

In reality, they are the same idea at different scales.

Softmax Regression computes one linear score per class and turns these scores into probabilities by comparing them.
When there are only two classes, this comparison depends only on the difference between the two scores.

This difference is a linear function of the input, and applying Softmax in this case produces exactly the logistic (sigmoid) function.

In other words, Logistic Regression is simply Softmax Regression applied to two classes, with redundant parameters removed.

Once this is understood, moving from binary to multiclass classification becomes a natural extension, not a conceptual jump.

Softmax Regression does not introduce a new way of thinking.

It simply shows that Logistic Regression already contained everything we needed.

By duplicating the linear score once per class and normalizing them with Softmax, we move from binary decisions to multiclass probabilities without changing the underlying logic.

The loss is the same idea.
The gradients are the same structure.
The optimization is the same gradient descent we already know.

What changes is only the number of parallel scores.

Another Way to Handle Multiclass Classification?

Softmax is not the only way to deal with multiclass problems in weight-based models.

There is another approach, less elegant conceptually, but very common in practice:
one-vs-rest or one-vs-one classification.

Instead of building a single multiclass model, we train several binary models and combine their results.
This strategy is used extensively with Support Vector Machines.

Tomorrow, we will look at SVM.
And you will see that it can be explained in a rather unusual way… and, as usual, directly in Excel.

Source link

Sign Up to Our Newsletter

Top Categories

Uncategorized

Tech News

Tech

Software development

Popular Tech News

Top Categories

Uncategorized

Tech News

Tech

Software development

Popular Tech News

Understanding the Softmax model

A tiny dataset with 3 classes

Description of the Model

Score per class

Turning scores into probabilities: the Softmax step

Visualizing the Softmax model

Evaluating the model: Cross-Entropy Loss

Computing residuals

Computing the gradients

Updating the coefficients

Logistic Regression as a Special Case of Softmax Regression

Another Way to Handle Multiclass Classification?

Want to know which 2025 LG OLED TV to buy? I tested three models side-by-side, and there’s one clear winner

Toshiba outlines a route to 40TB and 55TB HDDs in coming years as it expands platter counts and refines its MAMR and HAMR technology

Team TeachToday

About Author

You may also like

Our Company

Categories

Get Latest Updates and big deals

Our expertise, as well as our passion for web design, sets us apart from other agencies.