Using Machine Learning to Predict 2023 Kentucky Derby Winning Race Times | Andrew Jocelyn

Can the weather forecast be used to predict race winning times?

My hypothesis is that weather has a big impact on Kentucky Derby winning times. This analysis uses the Kentucky Derby weather forecast to predict the time of the winning race using machine learning (ML). In previous articles, we discussed the importance of using explainable ML in a business environment to provide business insight and aid buy-in and change management. In this analysis, I’m purely in the pursuit of accuracy, so I’ll ignore this advice and go straight to the more complex and accurate black-box Gradient Boosted Machine (GBM).

The data used is from the National Weather Service.

# Load Libraries #
library(dplyr)
library(tidyr)
library(glmnet)
library(caret)
library(rpart)
library(gbm)
library(rsample)
library(plotly)
library(readr)
library(reticulate)# Read in Data #
data <- read.csv("...KD Data.csv")
# Declare Year Variables #
year <- data[,1]
# Declare numeric x variables #
numeric <- data[,c(2,3,4)]
# Scale numeric x variables
scaled_x <- scale(numeric)
# check that we get mean of 0 and sd of 1
colMeans(scaled_x)
apply(scaled_x, 2, sd)

Make sure numeric data columns are scaled.

#Declare y variable #
y <- data[,6]# One-Hot Encoding #
data$Weather <- as.factor(data$Weather)
xfactors <- model.matrix(data$Year ~ data$Weather)[, -1]
# Bring prepped data all back together #
scaled_df <- as.data.frame(cbind(year,y,scaled_x,xfactors))
# Isolate pre-2023 data #
old_data <- scaled_df[-1,]
new_data <- scaled_df[1,]
# Gradient Boosted Machine #
# Find Max Interaction Depth #
floor(sqrt(NCOL(old_data)))

GBM has a maximum depth of 3.

# Find Optimal n.trees #
tree_mod <- gbm(
formula = y ~ .,
distribution = "gaussian",
data = old_data,
shrinkage = 0.001, #Small dataset so small shrinkage
interaction.depth = 3, #Determined above
n.minobsinnode = 10, #Default
bag.fraction = 0.99, #Small dataset, so this has to be large
n.trees = 1000, 
n.cores = NULL, # will use all cores by default
verbose = FALSE
)  # find index for n trees with minimum CV error
best.iter <- gbm.perf(tree_mod, method="OOB", plot.it=TRUE, oobag.curve=TRUE, overlay=TRUE)
print(best.iter)

Plot showing the optimal number of trees to reduce the OOB change in squared error loss.

# Full GBM Model #
GBM <- gbm(y ~ .,
distribution = "gaussian",
data = old_data,
n.trees = 500,  
interaction.depth = 3, 
shrinkage = 0.001,
n.minobsinnode = 10, 
bag.fraction = 0.99, 
train.fraction = 1, 
n.cores = NULL, 
verbose = FALSE
)

# 2023 Kentucky Derby Data #
GBM_Prediction <- predict(GBM, new_data, 
n.trees = 500,
distribution = "gaussian",
shrinkage = 0.001,
interaction.depth = 3,
n.minobsinnode = 10,
bag.fraction = 0.99,
train.fraction = 1,
n.cores = NULL, 
verbose = FALSE
)

The 2023 Kentucky Derby winning time is predicted to be 122.12 seconds or 2 minutes and 2.12 seconds.

In this article, we chose a more accurate but complex black box model to predict the winning race time of the Kentucky Derby. This is not concerned with generating insights, gaining buy-in or doing change management, but rather using the most accurate models so that data-driven gambling can be done. Because I want to Most business cases give up accuracy for explainability, but in some cases (like this one), accuracy is a key requirement of the model.

This forecast is, of course, somewhat misleading, as it is based on the forecast for Saturday, May 6th, made on Thursday, May 4th. As we all know, even with a vast amount of technology, predicting the weather is very difficult. Using weather forecasts to predict the time of the winning race adds even more uncertainty. That said, I would like to go over or under to match the expected winning time of 122.12 seconds.

Source link