Why Machine Learning Is Not Suitable for Causal Inference | Quentin Gallea, PhD

We all know that “correlation does not imply causation,” but why?

There are two main scenarios. First, as shown in Case 1 below, the positive correlation between drowning incidents and ice cream sales is probably due to a common cause – weather. Although both occur when it's sunny, there is no direct causal relationship between drowning incidents and ice cream sales. This is called a spurious correlation. The second scenario is shown in Case 2. Education has a direct effect on performance, but cognitive ability affects both. Therefore, in this situation, the positive correlation between education and job performance is confounded with the effect of cognitive ability.

The main reasons why “correlation does not imply causation”. Arrows represent the direction of causation in a causal graph. Image by author.

As we said at the beginning, predictive inference relies on correlation. Therefore, anyone who knows that “correlation does not imply causation” should understand that machine learning is inherently unsuited for causal inference. Ice cream sales are Good forecaster Even if a drowning incident occurs on the same day No causal relationshipThis relationship is merely correlational and arises from a common cause: weather.

However, if we want to study the potential causal effect of ice cream sales on drownings, we must take this third variable (weather) into account. Otherwise, the well-known omitted variable bias will bias our causal estimates. If we include this third variable in our analysis, we will almost certainly find that ice cream sales have no effect on drownings. Often, a simple way to address this issue is to include this variable in the model so that it is not “omitted.” However, confounders are often unobserved and therefore cannot simply be included in the model. There are many ways in causal inference to address this problem of unobserved confounders, but these are beyond the scope of this article. If you would like to learn more about causal inference, follow these guides:

Thus, the main difference between causal and predictive inference is how we select our “features”.

In machine learning, we usually include features that may improve the prediction quality and the algorithm helps to select the best features based on their predictive power. However, in causal inference, some features (confounders/common causes) must be included even if they have low predictive power and their effect is not statistically significant. The main concern is not the predictive power of the confounder, but how it affects the coefficient of the cause under study. Furthermore, there are features that should not be included in a causal inference model, such as mediators. Mediators represent indirect causal paths and controlling for such variables would make it impossible to measure the overall causal effect of interest (see figure below). Thus, the main difference is that including a feature in causal inference depends on the assumed causal relationship between the variables.

Mediator illustration. Here, motivation is a mediator of the effect of training on productivity. Imagine that training your staff not only increases their productivity directly, but also indirectly through motivation. Employees are motivated because they have acquired new skills and see that their employer is committed to improving their skills. When measuring the effectiveness of training, most of the time you want to measure the overall effect of the treatment (direct and indirect), but including motivation as a control variable prevents you from doing so. Image by author.

This is a sensitive topic; for more information, see “A Crash Course in Good and Bad Controls” by Cinelli et al. (2002).

Imagine if you interpreted the correlation between ice cream sales and drowning deaths as a causal relationship: You might want to ban ice cream at all costs, but of course this might have little or no effect on the outcome.

A famous correlation is between chocolate consumption and Nobel Prize winners (Messerli (2012)). The authors found a linear correlation coefficient of 0.8 between the two variables at the country level. This seems like a great argument to eat more chocolate, but it should not be interpreted as causal. (The potential causal arguments presented in Messerli (2012) have been disproved later (e.g. P Maurage et al. (2013)).)

A positive correlation between the number of Nobel Prize winners per 10 million inhabitants and chocolate consumption (kg/year/person) has been confirmed by (Messerli (2012)). Image by author.

Now, a more serious example: Imagine you are trying to optimize posts for your content creators. To do so, you build an ML model with a number of features. After some analysis, you find that posts published in the late afternoon or evening perform best. Therefore, you recommend a precise schedule: posting only between 5pm and 9pm. Once you implement it, you see a sharp drop in impressions per post. What happened? Your ML algorithm makes predictions based on current patterns and interprets the data as it is. That is, posts created later in the day correlate with higher impressions. Ultimately, posts published in the evening were more spontaneous and less planned, where the creators did not specifically aim to please their audience, but simply shared something valuable. In other words, it was not because of the timing, but because of the nature of the post. This spontaneous nature may be hard to capture with your ML model (even if you code for some features like length, tone, etc., this may not be easy to capture).

In marketing, predictive models are often used to measure the ROI of marketing campaigns.

In many cases, a simple Marketing Mix Modeling (MMM) It suffers from missing variable bias, making the measurement of ROI misleading.

Competitor actions are usually correlated with our campaigns and may also affect sales. If this is not properly taken into account, it may lead to under- or over-estimation of ROI and lead to suboptimal business decisions and advertising expenditures.

This concept is also important for policy and decision-making. At the beginning of the COVID-19 pandemic, French “experts” used graphs to argue that lockdowns were counterproductive (see figure below). The graphs revealed a positive correlation between lockdown stringency and COVID-19-related deaths (the stricter the lockdown, the more deaths). However, this relationship was probably driven by reverse causation: when things are bad (more deaths), countries will take stricter measures. This is called reverse causation. Indeed, if one properly studies the trajectory of national cases and deaths before and after lockdowns, controlling for potential confounding factors, one finds a strong negative effect (see Bonardi et al. (2023)).

A reproduction of the graph used to argue that lockdowns were ineffective. Green corresponds to the least restrictive lockdown measures, red the most restrictive. Image by author.

Machine learning and causal inference are both extremely useful, but they serve different purposes.

As is often the case with numbers and statistics, the problem is almost always in the interpretation, not the metric. Thus correlation is useful and only problematic if you blindly interpret it as causation.

When to use causal inference: If you want to understand causal relationships and do impact assessments.

Policy evaluation: Determine the impact of new policies, including the impact of new educational programs on student achievement.
Medical Research: Evaluate the effects of new medicines and treatments on health conditions.
economy: Understand how changes in interest rates affect economic indicators such as inflation and employment.
marketing: Evaluate the impact of your marketing campaigns on sales.

Important questions in causal inference:

How does X affect Y?
Does changing X change Y?
What will happen to Y if we intervene in X?

When to use predictive reasoning: You want to make accurate predictions (associating features with outcomes) and learn patterns from your data.

risk assessment: Predict the likelihood of credit defaults or insurance claims.
Recommended System: Suggest products and content based on a user's past behavior.
diagnosis: Classify medical images for disease detection.

Important questions about predictive inference: