Machine Learning “Advent Calendar” Day 23: CNN in Excel

Machine Learning


was first introduced with images, which are often easier to understand.

Filters slide over pixels to detect edges, shapes, or textures. To understand how CNN works on images in Excel, you can read this article I wrote earlier.

The idea is the same for text.

Slide filter instead of pixels words.

Instead of visual patterns, language patterns.

And many of the important patterns in the text are very local. Let's look at these very simple examples.

  • “Good” means positive
  • “Bad” has a negative meaning
  • “Not good” is negative
  • “It’s not bad” is often positive

In the previous article, we learned how to represent words as numbers using embeddings.

We also found important limitations. Using the world average, Word order was completely ignored.

From the model's perspective, “bad” and “bad” look exactly the same.

So the next challenge is obvious. I want the model to consider word order..

1D convolutional neural networks are a natural tool for this because they scan sentences in small sliding windows and react when they recognize familiar local patterns.

1. Understanding 1D CNNs for Text: Architecture and Depth

1.1. Building a 1D CNN for text in Excel

In this article, you will build a 1D CNN architecture in Excel using the following components:

  • Embedding a dictionary
    Use 2D embedding. Because one dimension is not enough for this task.
    1D encoding emotionsthe second dimension encodes denial.
  • Conv1D layer
    This is a core component of CNN architecture.
    Consists of filters. Slide and move sentences The window length is 2 words. Choose two words for simplicity.
  • ReLU and global max pooling
    These steps only keep the strongest matches found by the filter.
    We also discuss the fact that ReLU is an option.
  • Logistic regression
    This is the final classification layer and combines the detected patterns into probabilities.
1D CNN in Excel – All images by author

This pipeline supports standard CNN text classifiers.
The only difference here is that we are explicitly writing and visualizing the forward pass in Excel.

1.2. What “deep learning” means in this architecture

Before we go any further, let's take a step back.
Yes, as I often do, seeing the model as a whole is very helpful in understanding it.

definition of deep learning It often becomes blurry.
For many people, deep learning simply means “many layers.”

Let's look at it from a slightly different perspective.

What really characterizes deep learning is not the number of layers; Depth of transformation applied to input data.

This definition looks like this:

  • A model with only one convolutional layer can also be considered deep learning.
  • Because the input is transformed into a more structured and abstract representation.

On the other hand, taking raw input data, applying one-hot encoding, and stacking many fully connected layers does not necessarily create a meaningfully deep model.
In theory, one layer is sufficient if there are no transformations.

In CNNs, the presence of multiple layers has very specific motivations.

Consider a sentence like the following:

this movie is not very good

Simple local patterns such as “very good + good” can be detected using a single convolutional layer and a small window.

However, higher level patterns such as “not + (very good)” are still not detected.

This is why CNNs are often stacked.

  • The first layer detects simple local patterns.
  • The second layer combines them into something more complex.

In this article, we intentionally focus on 1 convolutional layer.
This makes every step visible and easy to understand in Excel while keeping the logic the same as the deeper CNN architecture.

2. Turn words into embeddings

Let's start with simple words. We use these terms with other words (we don't model them) because we try to detect negation.

  • “good”
  • “bad”
  • “not good”
  • “Not bad”

The representation is intentionally small so that you can see every step.

We will only use a dictionary of three words: good, bad, and evil.

All other words have 0 as padding.

2.1 Why one dimension is not enough

In our previous article on emotion detection, we used a single dimension.
It worked in both “good” and “bad” cases.

But what we want to deal with now is denial.

Only one concept can be adequately represented in one dimension.
That's why it's necessary two dimensional:

  • Senti: polarity of emotions
  • neg: negation marker

2.2 Embedded dictionary

Each word becomes a 2D vector.

  • Good → (cm = +1, negative = 0)
  • Bad → (cm = -1, negative = 0)
  • Otherwise → (cm = 0, negative = +1)
  • Other words → (0, 0)

This is not what the actual embed looks like. Actual embeddings are learned, high-dimensional, and cannot be interpreted directly.

But to understand how Conv1D works, this toy embed is perfect.

In Excel, this is just a reference table.
In a real neural network, this embedding matrix is ​​trainable.

3. Conv1D filter as a sliding pattern detector

Now we arrive at the core idea of ​​1D CNN.

The Conv1D filter is nothing mysterious. it's just small weight set and bias It slides over the sentence.

because:

  • Each word embedding has two values ​​(senti, neg).
  • Our window contains two words

Each filter has:

  • 4 weights (2 dimensions x 2 positions)
  • 1 bias

That's it.

You can think of a filter as asking the same question over and over again in every position.

“Do these two adjacent words match the pattern I'm interested in?”

3.1 Sliding window: how Conv1D recognizes sentences

Consider the following sentence:

It's not bad at all

Select 2 words as the window size.

That is, the model examines all adjacent pairs.

  • (it is)
  • (That's right, it's not)
  • (No, bad)
  • (bad, de)
  • (at all)

Important points:
Filter slides anywhereEven if both words are neutral (all zeros).

3.2 4 intuitive filters

To make it easier to understand the behavior, we will use four filters.

Filter 1 – “I see.”

This filter only focuses on the other person's emotions. current word.

Plain text expression for one window:

z = centi (current word)

If the word is “good” then z = 1
If the word is “bad” then z = -1
If the word is neutral, z = 0

Starting with ReLU, negative values ​​become 0. However, this is optional.

Filter 2 – “I think it’s bad.”

This is symmetrical.

z = -senti(current word)

So:

  • “Bad” → z = 1
  • “Good” → z = -1 → ReLU → 0

Filter 3 – “I don’t think it’s good.”

This filter looks at two things at the same time:

  • Negation (previous word)
  • centi (current word)

equation:

z = neg(previous word) + Senti(current word) – 1

Why is it “-1”?
This acts like a threshold for both conditions to be true.

result:

  • “Not good” → 1 + 1 – 1 = 1 → Valid
  • “Good” → 0 + 1 – 1 = 0 → Invalid
  • “Not bad” → 1 – 1 – 1 = -1 → ReLU → 0

Filter 4 – “I don’t think it’s bad.”

Same idea, but slightly different symbols:

z = neg(previous word) + (-senti(current word)) – 1

result:

  • “Not bad” → 1 + 1 – 1 = 1
  • “Not good” → 1 – 1 – 1 = -1 → 0

This is a very important intuition.

CNN filters can operate as follows: local logical ruleslearned from the data.

3.3 Final result of sliding window

The final results of these four filters are:

4. ReLU and max pooling: from local to global

4.1 ReLU

After computing z for all windows, apply ReLU.

ReLU(z) = max(0, z)

meaning:

  • negative evidence is ignored
  • reliable evidence is kept

Each filter looks like this: presence detector.

By the way, this is the activation function in neural networks. Neural networks aren't that difficult after all.

4.2 Global Max Pooling

then come global max pooling.

For each filter, keep only the following:

Maximum activation across all windows

interpretation:
“We don't care where the pattern appears, we only care whether it appears strongly somewhere.”

At this point, the entire sentence can be summarized in four numbers.

  • Strongest “good” signal
  • Strongest “bad” signal
  • The strongest “bad” signal
  • The strongest “not bad” signal

4.3 What happens if I remove ReLU?

If you don't use ReLU:

  • Negative values ​​remain negative
  • Max pooling may choose negative values

This combines two ideas.

  • lack of pattern
  • opposite of pattern

The filter is no longer a clean detector, but a signed score.

The model still works mathematically, but it is harder to interpret.

5. The last layer is logistic regression

Next, combine these signals.

Calculate the score using a linear combination.

Score = 2 × F_good – 2 × F_bad – 3 × F_not_good – 3 × F_not_bad – Bias

Next, convert the scores to probabilities.

Probability = 1 / (1 + exp(-score))

It's just a logistic regression.

yes:

  • CNN extracts features. You can think of this step as feature engineering, right?
  • Logistic regression makes the final decision. This is the classic machine learning model that we are familiar with

6. Complete example using sliding filter

Example 1

“It’s no good, so it’s no good at all.”

The sentence contains:

After max pooling:

  • F_good = 1 (because “good” exists)
  • F_bad = 1
  • F_not_good = 1
  • F_not_bad = 0

The final score will be significantly negative.
Prediction: Negative emotions.

Example 2

“That's good. Yes, it's not bad.”

The sentence contains:

After max pooling:

  • F_good = 1
  • F_bad = 1 (because the word “bad” appears)
  • F_not_good = 0
  • F_not_bad = 1

The final linear layer learns that “not bad” must outweigh “bad.”

Prediction: Positive emotions.

This also shows something important. Max pooling keeps all strong signals.
The last layer determines how they are combined.

Limited example 3 to explain why CNN goes deeper

Try this sentence:

“It’s not that bad.”

In a window of size 2, the model displays:

The “not bad” filter never kicks in because you never realize that (not bad, not bad).

I'll explain why a real model would use:

  • big window
  • multiple convolutional layers
  • Or other architectures for longer dependencies

conclusion

Excel's strength is visibility.

You can check the following:

  • embedded dictionary
  • All filter weights and biases
  • all sliding windows
  • All ReLU activations
  • Maximum pooling result
  • Logistic regression parameters

Training is simply the process of adjusting these numbers.

Once you understand that, CNN becomes less mysterious.

They become what they really are. A structured, trainable pattern detector that slides over your data.



Source link