Data Science Lab
Nearest Centroid Classification of Numeric Data Using C#
This is a complete end-to-end demonstration of what Dr. James McCaffrey of Microsoft Research calls perhaps the simplest classification technique.
The goal of a machine learning classification system is to predict the value of a discrete variable. For example, you might want to predict a person's gender (male or female — binary classification) based on their age, income, etc. Or you might want to predict a person's political leanings (conservative, centrist, liberal — multi-class classification) based on their gender, age, etc.
There are many machine learning classification techniques, some common ones are logistic regression, neural network classification, naive Bayes classification, decision tree classification, and k-nearest neighbor classification.
This article provides a complete end-to-end demonstration of a technique called nearest centroid classification. Simply put, nearest centroid classification calculates the vector centroid (also known as the mean) in the training data for each class you want to predict. To classify a data item, you calculate the distance between the item and each centroid. The predicted class is the class associated with the nearest centroid.
Nearest centroid classification is probably the simplest classification technique. Four advantages of nearest centroid classification (NCC) compared to other techniques are that NCC is easy to implement, it can work with very small datasets, NCC has good interpretability, and NCC works for both binary and multiclass classification. Two disadvantages of NCC are that basic NCC only works with strictly numeric predictors (although there are new techniques that modify NCC to work with a mix of numeric and categorical predictors), and most importantly, NCC is the least effective classification technique because it does not consider interactions between predictors.
To get an idea of where this article is going, it's best to take a look at the screenshots. Figure 1The demo program loads a subset of the penguin dataset into memory. The goal is to predict the penguin species (0 = Adelie, 1 = Chinstrap, 2 = Gentoo) from the penguin's beak length, beak width, fin length, and weight. The demo uses only 30 data items for training and 10 data items for testing. Here are the predicted values for the first few raw training items:
[ 0] 50.0 16.3 230.0 5700.0 [ 1] 39.1 18.7 181.0 3750.0 [ 2] 38.8 17.2 180.0 3800.0 [ 3] 39.3 20.6 190.0 3650.0 . . .
The raw penguin data is normalized so that all predicted values are between 0 and 1. This prevents a weight predictor with a large value from overwhelming the other predictors. Data normalization is necessary for nearest centroid classification (and many other classification techniques). The predicted values for the normalized training data look like this:
[ 0] 0.851 0.257 1.000 0.941 [ 1] 0.249 0.600 0.020 0.176 [ 2] 0.232 0.386 0.000 0.196 [ 3] 0.260 0.871 0.200 0.137 . . .
Then, the centroids of each of the three classes/species are calculated.
Class centroids: [0] 0.2700 0.7786 0.2025 0.3174 [1] 0.8432 0.6418 0.2969 0.1931 [2] 0.7103 0.2095 0.6956 0.7996
The resulting model predicts the training data with an accuracy of 0.9333 (28 out of 30 correct) and the test data with an accuracy of 1.0000 (10 out of 10 correct).
The demo program finishes by predicting the species of a never-before-seen penguin using the raw values: beak length = 46.5, beak width = 17.9, fin length = 192, and weight = 3500. The raw predictions are normalized to (0.5962, 0.6316, 0.3182, -0.0182). The NCC model predicts the penguin to be class = 1 = chinstrap because the normalized values are closest to the centroid. [1] This is (0.8432, 0.6418, 0.2969, 0.1931).
This article assumes that you have intermediate or higher programming skills, but doesn't assume that you know anything about nearest centroid classification. The source code is too long to present in its entirety in this article, but the complete code and data is available in the accompanying file download. The code and data are also available online. The demo program is implemented using the C# language, but you should have no trouble refactoring the demo to another C-family language if you wish. All normal error checking has been removed from the demo program to make the main ideas as clear as possible.
Understanding Nearest Centroid Classification
Nearest centroid classification is best explained with a concrete example. Suppose your goal is to predict the species of a toad given its height and weight. You have a small training data set with seven items and three classes.
label ht wt
0 56 1800
0 20 5000
1 68 1000
1 64 4200
1 80 7400
2 68 9000
2 44 7400
min 20 1000
max 80 9000
The classes/labels/species are 0 = Asia, 1 = Bolivia, 2 = Canada. The first step is to normalize the data so that the large weight values don't overwhelm the small height values. The demo program uses min-max normalization. With min-max normalization, for each column, x' = (x – min) / (max – min), where x' is the normalized value, x is the raw value, min is the minimum value in the column, and max is the maximum value in the column. For example, the first height value of 56 is normalized to x' = (56 – 20) / (80 – 20) = 36 / 60 = 0.60. The first weight value is normalized to x' = (1800 – 1000) / (9000 – 1000) = 800 / 8000 = 0.10.
The resulting min-max normalized Toad data looks like this:
label ht' wt' 0 0.60 0.10 0 0.00 0.50 1 0.80 0.00 1 0.90 0.40 1 1.00 0.80 2 0.80 1.00 2 0.40 0.80
The values in each predictor column range from 0.0 to 1.0, with 0.0 corresponding to the minimum value in the column and 1.0 corresponding to the maximum value in the column.
The centroids for each of the three classes are then calculated from the normalized data.
centroids' 0 (0.30, 0.30) 1 (0.90, 0.40) 2 (0.60, 0.90)
Centroid is also known as vector mean or vector average. To calculate the centroid of a set of vectors, calculate the arithmetic mean of each vector element. For example, two vectors that belong to class 0 are (0.60, 0.10) and (0.00, 0.50). Therefore, the centroid of class 0 is ((0.60 + 0.00)/2, (0.10 + 0.50)/2) = (0.30, 0.30).
Similarly, the centroid of class 1 is ((0.80 + 0.90 + 1.00)/3, (0.00 + 0.40 + 0.80)/3) = (0.90, 0.40). The centroid of class 2 is ((0.80 + 0.40)/2, (1.00 + 0.80)/2) = (0.60, 0.90).
To classify a data item, we compute the Euclidean distance between the normalized item and each of the three centroids, and return the class label associated with the minimum distance. The Euclidean distance between two vectors is the square root of the sum of the squared differences between the vector elements.
For example, say the toad item to be classified is x = (50, 6600). The normalized item is x' = ((50-20)/(80-20), ((6600-1000)/(9000-1000)) = (0.50, 0.70). The distances to each centroid are:
Distance from x' to centroid[0] = sqrt((0.50 - 0.30)^2 + (0.70 - 0.30)^2)
= sqrt(0.20)
= 0.45
Distance from x' to centroid[1] = sqrt((0.50 - 0.90)^2 + (0.70 - 0.40)^2)
= sqrt(0.25)
= 0.50
Distance from x' to centroid[2] = sqrt((0.50 - 0.60)^2 + (0.70 - 0.90)^2)
= sqrt(0.05)
= 0.22
Since the minimum distance is between the normalized item and the centroid[2]predicted class/label/species = 2 = Canadian;
Demo Data
The demo program uses a subset of the penguin dataset. The complete penguin dataset has 345 items, of which 333 are valid (no missing values). The demo data uses only 30 training items (10 of each species) and 10 test items (4 Adelie, 3 chinstrap, and 3 gentoo). The test data files are:
0, 40.6, 18.6, 183.0, 3550.0 0, 40.5, 18.9, 180.0, 3950.0 0, 37.2, 18.1, 178.0, 3900.0 0, 40.9, 18.9, 184.0, 3900.0 1, 52.0, 19.0, 197.0, 4150.0 1, 49.5, 19.0, 200.0, 3800.0 1, 52.8, 20.0, 205.0, 4550.0 2, 49.2, 15.2, 221.0, 6300.0 2, 48.7, 15.1, 222.0, 5350.0 2, 50.2, 14.3, 218.0, 5700.0
The complete penguin dataset has additional columns. In addition to species, beak length, beak width, fin length, and weight, the dataset has categorical columns for island (where the penguin was found and measured) and sex (male or female). In the demo data, the island and sex columns are not used because standard nearest neighbor centroid classification does not work directly with categorical predictors.
The complete Penguin dataset and its documentation can be found here.
Program Structure
The overall structure of the demo program is as follows: Listing 1All the control logic is in the Main() method. All the nearest centroid classification functionality is in the NearestCentroidClassifier class. The Program class has helper functions to load data, normalize data, and display data.
Listing 1: Demo program configuration
using System;
using System.IO;
namespace NearestCentroidClassification
{
internal class NearestCentroidProgram
{
static void Main(string[] args)
{
Console.WriteLine("Begin nearest " +
"centroid classification demo ");
// load raw data into memory
// normalize training data
// load and normalize test data
// instantiate a NearestCentroidClassifier object
// train the model (compute centroids)
// evaluate model accuracy
// use model to make a prediction
Console.WriteLine("End demo ");
Console.ReadLine();
}
public static double[][] MatLoad(string fn,
int[] usecols, char sep, string comment) { . . }
public static int[] VecLoad(string fn, int usecol,
string comment) { . . }
public static double[][] MatMinMaxValues(double[][] X)
{ . . }
public static double[][] MatNormalizeUsing(double[][] X,
double[][] minsMaxs) { . . }
public static double[] VecNormalizeUsing(double[] x,
double[][] minsMaxs) { . . }
public static void MatShow(double[][] M, int dec,
int wid, int numRows, bool showIndices) { . . }
public static void VecShow(int[] vec,
int wid) { . . }
public static void VecShow(int[] vec, int wid,
int nItems) { . . }
public static void VecShow(double[] vec, int decimals,
int wid) { . . }
} // Program
public class NearestCentroidClassifier
{
public int numClasses;
public double[][] centroids; // of each class
public NearestCentroidClassifier(int numClasses) { . . }
public void Train(double[][] trainX, int[] trainY) { . . }
public int Predict(double[] x) { . . }
public double[] PredictProbs(double[] x) { . . }
public double Accuracy(double[][] dataX, int[] dataY) { . . }
private double EucDistance(double[] v1, double[] v2) { . . }
public int[][] ConfusionMatrix(double[][] dataX,
int[] dataY) { . . }
public void ShowConfusion(int[][] cm) { . . }
} // class
} // ns
The demo program begins by loading the raw training data into memory and displaying the first four rows.
string trainFile = "..\\..\\..\\Data\\penguin_train_30.txt";
double[][] trainX =
MatLoad(trainFile, new int[] { 1, 2, 3, 4 }, ',', "#");
MatShow(trainX, 1, 9, 4, true);
The raw training data file looks like this:
2, 50.0, 16.3, 230.0, 5700.0 0, 39.1, 18.7, 181.0, 3750.0 1, 38.8, 17.2, 180.0, 3800.0 2, 39.3, 20.6, 190.0, 3650.0 0, 39.2, 19.6, 195.0, 4675.0 . . .
The raw data is comma separated, with the species/class/label in the first column. [0]The arguments to the program-defined MatLoad() call mean the columns to be loaded. [1], [2], [3], [4] The data is comma separated and lines beginning with “#” indicate comment lines. The returned value is an array-style matrix of arrays of doubles.
The arguments in the call to the program-defined MatShow() function mean to display the value with a field width of 9, one decimal place in only the first four rows, and to display the row index.
Data Normalization
The demo program normalizes the data with this statement:
double[][] minsMaxs = MatMinMaxValues(trainX);
trainX = MatNormalizeUsing(trainX, minsMaxs);
Console.WriteLine("Done ");
Console.WriteLine("X training normalized: ");
MatShow(trainX, 4, 9, 4, true);
The program-defined MatMinMaxValues() function scans the training data and calculates the minimum and maximum values for each column. For the demo data, the values stored in the minsMaxs matrix are:
34.6 14.5 180.0 3300.0 52.7 21.5 230.0 5850.0
The minima of the four predictors are in a row [0] The maximum value is in the row [1]These values are needed to normalize the test data, and to normalize any new, never-before-seen data that you want to classify.
The MatNormalizeUsing() function returns a matrix of normalized data values. The demo program assigns these values to the original denormalized matrix and discards the denormalized values. One design choice would be to keep both versions of the data:
