In the past, if you wanted to know which team would win the World Cup, you had to rely on a seer with a crystal ball, use tea leaf fortune telling, or hope that Paul the Octopus would tell you what was going to happen.
But modern data science can offer a better alternative. As part of a team of statisticians, I helped train machine learning algorithms to predict the most likely course of a tournament.
Probabilistic predictions and loaded dice
The algorithm we built proceeds in two steps.
The first combines sophisticated statistical models with expert insights from bookmakers and the transfer market to determine the strength of every team and its players. In the second step, a machine learning algorithm determines how to best combine the strength estimate with other information about the team.
This generated a probabilistic prediction for each possible match in the tournament. You can think of this as a loaded pair of dice. Rather than having the same probability of showing the numbers 1 to 6, these loaded dice have different probabilities for the number of goals for either team.
For example, according to our predictions, Mexico will roll the dice and score an average of 1.9 goals in their opening match, while their opponent South Africa will only score an average of 0.7 goals. However, this does not mean that Mexico will definitely win. Instead, a Mexican victory is the most likely outcome with a 65% chance. A draw is unlikely (21%) and a South African win is the least likely outcome (14%).
“Vuelve a casa, el fútbol vuelve a casa!”
Different pairs of loaded dice can be used to simulate the outcome of each World Cup match. We have taken into account all FIFA rules including official tournament combinations and the possibility of overtime and penalty shootouts. We ran the simulation 100,000 times to determine the most likely course for the tournament.
As a result, Spain is the favorite to win with a probability of 14.5%, followed closely by England and France with 12.4% each and Germany with 11.2%.
With the tournament expanded, this World Cup will feature 48 teams and a five-round knockout stage, but this group is packed with favorites to win. Portugal and Argentina also have a good chance of winning with 8.9% and 8.2% respectively.
Meanwhile, the U.S. has a higher chance of making it to the Round of 32, at 78%. This is the highest number in a group that has three other teams. But in the knockout stage, every match is win or die, and the odds of the U.S. team “surviving” decrease relatively quickly. There is a 1% chance that the home team will win the final game, which will be played at MetLife Stadium in New Jersey on July 19th.
Take a deeper look into the engine room
Our machine learning algorithms and subsequent simulations are powered by data, expertise, and statistical models.
First, all international matches over the past eight years form the basis of a “retrospective” estimation of a team’s strength. Secondly, estimates of “future” strength are obtained from the quoted odds of various international bookmakers, reflecting the opinion of experts regarding the upcoming tournament.
Thirdly, individual player evaluations are made based on their contribution to goals at club and international level. Finally, a player’s current quality and future potential are reflected in their expected market value. These are available from the Transfermarkt website, which uses a wisdom of the crowd approach to estimate the actual market value of the unknown.
These four variables are combined with a wider range of relevant information that reflects the current status of different teams and countries of origin. This includes team-specific details such as FIFA rank and number of players in this year’s Champions League semi-finals. We also took into account country-specific socio-economic factors, such as GDP per capita.
Machine learning algorithms were used to determine if and how these features were related to the actual results of the World Cup.
Here, a so-called random forest is trained, consisting of a number of decision trees that capture slightly different subsets of the data. The algorithm has been trained on every match played in a major soccer tournament since the 2006 World Cup. Therefore, we relate team strength, market value, and other factors to the number of goals scored in World Cup matches. This is the information that loads the dice for the simulation.
learn more
This is not the first time that our team, consisting of Andreas Grohl, Reuven Michels and their colleagues at the Technical University of Dortmund in Germany, Lars Magnus Wattam at the University of Molde in Norway, Günther Schauberger at the Technical University of Munich, and myself, have collaborated on World Cup predictions.
I correctly predicted that the United States would win the 2019 Women’s World Cup. Winners Spain and Argentina were expected to be strong contenders for the 2023 Women’s World Cup and 2022 Men’s World Cup, but they were not favorites.
The bottom line is that predictions are about probabilities. Although our program cannot predict the winner with 100% certainty, it may do better than the eight-limbed mollusk.
Achim Zeilais, Professor of Statistics, University of Innsbruck
This article is republished from The Conversation under a Creative Commons license. Read the original article.
![]()
