Machine Learning “Advent Calendar” Day 14: Softmax Regression in Excel

Machine Learning


You learned how to use logistic regression to classify into two classes.

So what if you have more than two classes?

n is just a multiclass extension of this idea. And I will discuss this model on Day 14 of my Machine Learning “Advent Calendar” (click this link to get all the information about the approach and the files used).

Now creates one score per class instead of one score. Instead of a single probability, we apply a softmax function to generate probabilities that sum to 1.

Understanding the softmax model

Before training a model, let's first understand what is the model.

Softmax regression is not yet about optimization.
First of all, about that How predictions are calculated.

Small dataset with 3 classes

Let's use a small dataset with one feature x and three classes.

As mentioned before, the target variable y should be: do not have Treated as a number.
This represents a category rather than a quantity.

A common way to express this is One hot encodingHere each class is represented by its own indicator.

From this perspective, softmax regression looks like this: Run three logistic regressions in parallelone per class.

Small datasets are great for training.
You can see how every expression, every value, and every part of your model contributes to the final result.

Softmax Regression in Excel – All images by author

Model description

So what exactly is a model?

Score per class

In logistic regression, the model score is simply linear: score = a * x + b.

Softmax regression does exactly the same thing, but with one score per class.

score_0 = a0 * x + b0
score_1 = a1 * x + b1
score_2 = a2 * x + b2

At this stage, these scores are just real numbers.
They are not probabilities yet.

Turn scores into probabilities: Softmax steps

Softmax converts three scores into three probabilities. Each probability is positive and the sum of all three is 1.

The calculation is straightforward:

  1. Raise each score to a power
  2. calculate the sum of all indices
  3. Divide each index by this sum

This will give you p0, p1, and p2 for each row.

These values ​​represent the confidence of the model for each class.

At this point, the model is fully defined.
Training the model is simply adjusting the coefficients ak and bk so that these probabilities match the observed classes as closely as possible.

Softmax Regression in Excel – All images by author

Visualizing Softmax Models

At this point, the model is fully defined.

we have:

  • One linear score per class
  • Softmax step to convert these scores into probabilities

Training the model simply consists of adjusting the coefficients aka_kak and bkb_kbk so that these probabilities match the observed classes as closely as possible.

Once you find the coefficients, you can: Visualize model behavior.

To do this, we take a range of input values ​​(for example, x from 0 to 7) and calculate score0, score1, score2 and the corresponding probabilities p0, p1, p2.

Plotting these probabilities yields three smooth curves, one for each class.

Softmax Regression in Excel – All images by author

The results are very intuitive.

For small values ​​of x, the probability of class 0 increases.
As x increases, this probability decreases, but the probability of class 1 increases.
As the value of x increases, class 2 probabilities become dominant.

For all values ​​of x, the three probabilities sum to 1.
Models do not make sudden decisions. Instead, it expresses how confident are you in each class.

This plot helps you understand how softmax regression works.

  • You can see how the model transitions smoothly from one class to another.
  • The decision boundary corresponds to the intersection between the probability curves
  • Model logic becomes visible instead of abstract

This is one of the main advantages of building models in Excel.
In addition to calculating predictions, See how the model thinks.

Now that the model is defined, we need to: evaluate how good it isand method improve coefficients.

Both steps reuse ideas we've already seen in Logistic Regression.

Model evaluation: cross-entropy loss

In softmax regression, same loss function as a logistic regression.

For each data point, examine the probability assigned to the data point. correct classAnd then we take the negative logarithm.

loss = – log (p true class)

If the model assigns a high probability to the correct class, the loss will be small.
Assigning a lower probability increases the loss.

In Excel, this is very easy to implement.

Choose the correct probability based on the value of y and apply the logarithm.

Loss = -LN(CHOOSE(y + 1, p0, p1, p2) )

Finally, calculate: average loss Across all lines.
This average loss is the amount you want to minimize.

Softmax Regression in Excel – All images by author

Calculating residuals

To update the coefficients, first calculate them. residualone per class.

For each line:

  • If y equals 0 then residual_0 = p0 minus 1, otherwise 0
  • residual_1 = p1 minus 1 if y equals 1, otherwise 0
  • residual_2 = p2 minus 1 if y equals 2, otherwise 0

That is, subtract 1 for the correct class.
For other classes, subtract 0.

These residuals measure how far the predicted probabilities are from their expected values.

Gradient calculation

The gradient is obtained by combining the residual and feature values.

For each class k:

  • The slope of ak is the average of residual_k * x
  • The slope of bk is the average of: residual_k

In Excel, this is implemented with a simple formula like this: SUMPRODUCT and AVERAGE.

At this point everything is clear.
You can see the residuals, the slope, and how each data point contributes.

screenshot

Update coefficients

Once we know the slope, we use gradient descent to update the coefficients.

This step is the same as the logistic regression or linear regression described earlier.
The only difference is that an update has been made. 6 coefficients instead of 2.

To visualize your learning, create a second sheet with one row for each iteration.

  • current iteration number
  • 6 coefficients (a0, b0, a1, b1, a2, b2)
  • loss
  • gradation

Line 2 corresponds to iteration 0use the initial coefficients.

Line 3 uses the slope from line 2 to compute the updated coefficients.

Simulate gradient descent over and over again by dragging the formula down over hundreds of lines.

Then you can clearly see:

  • The coefficient gradually stabilizes
  • Loss reduction after iterations

This makes the learning process concrete.
Instead of imagining an optimizer, you can do: Observe model training.

Logistic regression as a special case of softmax regression

Logistic regression and softmax regression are often presented as different models.

In fact, they are the same idea on different scales.

Softmax regression calculates one linear score for each class and converts them into probabilities by comparing them.
If there are only two classes, this comparison is difference between two scores.

This difference is a linear function of the input, and applying Softmax in this case produces an exact logistic (sigmoid) function.

In other words, logistic regression is a simple softmax regression applied to two classes with redundant parameters removed.

Understanding this, moving from binary to multiclass classification becomes a natural extension rather than a conceptual leap.

Softmax regression does not introduce new ideas.

it simply shows that Logistic regression had everything I needed.

By replicating the linear scores once per class and normalizing them with Softmax, we move from binary decisions to multiclass probabilities without changing the underlying logic.

The same idea applies to losses.
Gradient has the same structure.
Optimization is the same gradient descent method we already know.

The only thing that changes is Number of parallel scores.

Another way to handle multi-class classification?

Softmax is not the only way to handle multiclass problems with weight-based models.

There is another approach that is conceptually less elegant, but very common in practice.
1 pair left or 1 to 1 Classification.

Instead of building a single multiclass model, train several binary models and combine their results.
This strategy is widely used support vector machine.

Tomorrow we'll look at SVM.
And it turns out that this can be explained in a rather unusual way… and as always, directly in Excel.



Source link