Machine Learning “Advent Calendar” Day 7: Decision Tree Classifier

how do we Decision tree regressor The optimal partition is selected by minimizing . Mean squared error (MSE).

Today marks day 7 of our machine learning “advent calendar” and we continue with the same approach. Decision tree classifiercorresponds to the classification of yesterday's model.

Intuitive experiments using two simple datasets

Let's start with a very small toy dataset that I generated that contains one numeric feature and one target variable with two classes (0 and 1).

The idea is to split the dataset into two parts based on one rule. But the problem is: What should this rule be? What are the criteria for determining which split is better?

Now, even if you don't know math yet, you can look at your data and deduce possible cut points.

And visually, 8 or 12right?

But the question is, which one is numerically better?

Decision Tree Classifier in Excel – Image by author

Intuitively, it looks like this:

There is a division in 8:
- Left side: no misclassification
- Right side: 1 misclassification
There is a division in 12:
- Right side: no misclassification
- Left side: two misclassifications

So clearly I feel that splitting by 8 is better.

Now let's take a look at an example 3 classes. I added more random data to create three classes.

Label them here 0, 1, 3Plot vertically.

But you have to be careful. These numbers are: just the category namenot a number. They should not be construed as “commanded”.

So the intuition is always: How homogeneous is each region after the split?

However, it is difficult to determine the optimal split visually.

Now we need a mathematical way to express this idea.

This is exactly the topic of the next chapter.

Measures against impurities as a basis for separation

The decision tree regressor already knows:

Regional forecasts are average of the target.
The quality of the split is measured as follows: MSE.

The decision tree classifier looks like this:

Regional forecasts are majority class of the region.
Split quality is measured in the following way: Impurity measures: Gini impurity or entropy.

Both are standard in textbooks and are also available in scikit-learn. By default, Gini is used.

But what exactly is this impurity countermeasure?

If you look at the curve, Gini and entropyboth work the same way.

they are 0 The node is pure (All samples have the same class).
they reach them maximum when there is a class evenly mixed (50 percent / 50 percent).
The curve is smoothis symmetric and increases with disorder.

This is an essential property of everything Impurity measures:

If the groups are clean, the impurity is low; if the groups are mixed, the impurity is high.

Decision Tree Classifier in Excel – Gini and Entropy – Image by author

Therefore, use these measures to decide which splits to create.

Split by one continuous feature

It follows the same structure as for decision tree regressors.

List of all possible splits

Just like the regressor version with one numeric feature, the only split we need to test is the midpoint between consecutive sorted x values.

For each split, calculate the impurities on each side.

For example, consider split values. x = 5.5.

Split the dataset into two regions.

Area L: x < 5.5
Region R: x ≥ 5.5

For each region:

Count the total number of observations
Calculate Gini impurities
Finally, we calculate the weighted impurity of the split.

Select the split with the least impurities

Same as for regressors:

List all possible splits
Calculate each impurity
The optimal division is minimal impurities

Composite table of all partitions

To automate everything in Excel,
organize all calculations 1 tablewhere:

Each row corresponds to one split candidate,
For each row, calculate:
- Gini left region,
- Gini right region,
- and Overall weighted Gini of division.

This table provides a short and concise overview of all possible splits.
And the best split is simply the one with the lowest value in the last column.

Multiple class classification

Up until now, I have been working in two classes. However, the impurity of Gini naturally extends to: 3 classesthe logic of the split remains exactly the same.

The structure of the algorithm remains unchanged.

List all possible splits.
Calculate the impurities on each side.
Taking the weighted average,
Select the split with the least impurities.

Only the Gini impurity formula is a little longer.

Gini impurity with three classes

If the region contains the ratios p1, p2, p3

For the three classes, the Gini impurities are:

Same idea as before:
A region is “pure” if one class dominates.
Mixing classes increases impurities.

left and right areas

For each split:

Region L contains some observations of classes 1, 2, and 3
Region R contains the remaining observations

For each region:

Count the number of points belonging to each class
Calculate the ratios p1, p2, p3
Calculate the Gini impurity using the above formula

Everything is exactly the same as in the binary case, except for one term.

3 class division summary table

As before, collect all calculations into one table.

Each row is one possible split
Count from the left as class 1, class 2, and class 3.
Count class 1, class 2, class 3 on the right side.
Compute the Gini (left), Gini (right), and weighted Gini.

split with least weighted impurity The one selected by the decision tree.

The algorithm can be easily generalized to K classes by calculating the gini or entropy using the following formula:

In reality, how different are impurity countermeasures?

Now, we always refer to Gini or entropy as a criterion, but Is it really different??When you look at the formula, some people may say:

The answer is not very much.

In theory, in almost any real-world situation:

Gini and entropy Select the same split
The tree structure is almost the same
The predictions are as follows same

why?

Because their curves are very similar.

Both peak at 50% mixing and decrease to zero purity.

The only difference is shape Curved:

Gini is quadratic function Penalize misclassifications more linearly.
entropy is logarithm Since it's a function, we penalize the uncertainty a little more strongly around 0.5.

But in reality the difference is small and you can do it in Excel.

What about other measures against impurities?

Another obvious question: Is it possible to devise/use other means?

Yes, you can invent your own functions as long as you meet the following conditions:

it is 0 If the node is pure
it is the biggest When classes are mixed
it is smooth And strictly speaking, “disability” is increasing.

Example: impurity = 4*p0*p1

This is also a valid impurity measure. and it is actually equal to Gini If there are only two classes, multiply by a constant.

Again, it gives same division. If you are not satisfied,

Here are some other measures you can use.

Decision Tree Classifier in Excel – Many Impurity Measurements – Image by Author

Exercises in Excel

Testing with other parameters and features

After building the first split, you can expand the file.

try entropy instead of Ginny
try adding categorical features
Try building next split
try changing maximum depth Observe underfitting and overfitting
Create a confusion matrix for prediction

These simple tests have already given us an intuition about how a real-world decision tree will behave.

Implementing rules for the Titanic Survival Dataset

A natural follow-up exercise is to recreate the well-known decision rule. Titanic survival dataset (CC0 / Public domain).

You can start with just two features: sex and year.

Implementing rules in Excel is long and a little tedious, but that's exactly the point. This will help you understand what the decision rules actually are.

They are just a series of things. IF/ELSE A statement repeated over and over again.

This is the essence of a decision tree. A stack of simple rules.