Combining machine learning and panel data: What practitioners need to know

author: Augusto Cerqua, Marco Letta, Gabriele Pinto

Learning (ML) plays a central role in economics, social science, and business decision-making. In the public sector, ML is increasingly being used for so-called predictive policy problems. This is a setting in which policymakers aim to identify and proactively intervene in sectors at highest risk of negative outcomes. For example, targeting public subsidies, predicting regional recessions, and predicting migration patterns. Similar predictive tasks occur in the private sector when companies seek to predict customer churn or optimize credit risk assessment. In both areas, better predictions lead to more efficient resource allocation and more effective interventions.

To achieve these goals, ML algorithms are increasingly applied to panel data characterized by repeated observations of the same unit over multiple time periods. However, ML models were not originally designed for use with panel data, which are characterized by unique cross-sectional and longitudinal dimensions. Applying ML to panel data increases the risk of a subtle but serious problem: data leakage. This occurs when information that is not available at the time of prediction inadvertently enters the model training process, increasing prediction performance. In our paper,About the misuse of machine learning using panel data” (Cerqua, Letta, & Pinto, 2025), recently published. Oxford Journal of Economic StatisticsIn this paper, we provide the first systematic evaluation of data breaches in ML using panel data, propose clear guidelines for practitioners, and illustrate our results through an empirical application using publicly available U.S. county data.

leakage problem

Panel data combine two structures: a temporal dimension (units observed over time) and a cross-sectional dimension (multiple units such as regions or firms). Standard ML practice randomly splits the sample into training and testing sets, implicitly assuming independent and identically distributed (IID) data. When default ML procedures (such as random splits) are applied to panel data, this assumption is violated, resulting in two main types of leakage.

Temporal leakage: Future information leaks into the model during the training phase, making predictions appear unrealistically accurate. Additionally, past information may remain in the test set and “predictions” may be made retrospectively.
Cross-sectional leak: The same or very similar units appear in both the training and test sets. This means that the model already “knows” most of the cross-sectional dimensions of the data.

Figure 1 shows how different partitioning strategies affect leakage risk. Random partitioning at the unit time level (panel A) is the most problematic because it introduces both temporal and cross-sectional leakage. Alternatives such as by unit (panel B), by group (panel C), or by time (panel D) reduce one type of leakage but not the other. As a result, no strategy exists to completely eliminate the problem. In some cases, one form of leakage may not be a real concern, so the appropriate choice depends on the task at hand (see below).

Figure 1 | Training set and test set based on different splitting rules

Note: In this example, the panel data is structured using year as the time variable, county as the unit variable, and state as the grouping variable. The image was created by the author.

Two types of predictive policy problems

A key insight of the study is that researchers need to clearly define their predictive goals in advance. Predictive policy problems fall into two broad classes.

1. Cross-sectional forecasting: This task is to map the results across units within the same time period. For example, impute missing data on GDP per capita for an entire region when reliable measurements are available only for some regions. The optimal split here is at the unit level. Different units are assigned to training and testing sets, while all periods are maintained. This leaves the temporary leak but eliminates the cross-sectional leak. However, this is not a real problem since prediction is not the goal.

2. Sequential Forecasting: The goal is to predict future outcomes based on past data. For example, predict county-level revenue declines one year ahead to trigger early intervention. Here, the correct division is by time. In other words, the first half is training and the second half is testing. This avoids temporary leaks, but not cross-sectional leaks. Not a real concern since the same units are predicted over time.

The wrong approach in both cases is a random division by unit time (panel A of Figure 1). This contaminates the results with both types of leakage and produces misleading high-performance metrics.

practical guidelines

To help practitioners, we’ve compiled a list of do’s and don’ts when applying ML to panel data.

Choose a sample split based on your research question. Unit-based for cross-cutting problems and time-based for forecasting.
Temporal leakage can occur not only through observations but also through predictors. Use only delayed or time-invariant predictors for prediction. Using contemporaneous variables (for example, using the 2014 unemployment rate to predict 2014 income) is conceptually incorrect and will result in a temporary data leak.
Adapt cross-validation to panel data. The random k-fold CV included in most off-the-shelf software packages is inappropriate because it mixes future and past information. Instead, use rolling or expanding windows for prediction, or use stratified CV by unit/group for cross-sectional prediction.
Ensure that out-of-sample performance is tested on truly unseen data rather than data already encountered during training.

Empirical application

To illustrate these issues, we analyzed a balanced panel of 3,058 U.S. counties from 2000 to 2019, focusing only on sequential forecasts. We consider two tasks. One is a regression problem that predicts per capita income, and the other is a classification problem that predicts whether income will decrease in the next year.

We run hundreds of models with different partitioning strategies, use of simultaneous predictors, inclusion of delayed results, and algorithms (Random Forest, XGBoost, Logit, OLS). This comprehensive design allows us to quantify how leakage increases performance. Figure 2 below shows the key findings.

Panel A of Figure 2 shows the predicted performance for the classification task. Random splits give very high accuracy, but this is an illusion and the model already sees similar data during training.

Panel B shows the predictive performance for the regression task. Again, random splits make the model look much better than it actually is, while correct time-based splits show realistic accuracy, albeit with much lower accuracy.

Figure 2 | Time leakage in prediction problems

Panel A – Classification task

Panel B – Regression task

The paper also shows that in years with significant distributional shifts and structural disruptions, such as the Great Recession, overestimation of model accuracy becomes significantly more pronounced, making the results particularly misleading for policy purposes.

why is it important

Data breaches are more than just technical pitfalls. It has real-world consequences. In policy applications, models that appear highly accurate during validation can break down after deployment, leading to misallocated resources, missed crises, or misdirected targets. In business, the same problems can lead to poor investment decisions, inefficient customer targeting, or misplaced confidence in risk assessments. This danger is particularly acute when machine learning models are intended to act as early warning systems, where false confidence in inflated performance can lead to costly failures.

In contrast, a well-designed model provides honest, reliable predictions that may be less accurate on paper but can provide meaningful information for decision-making.

remove

ML has the potential to transform decision-making in both policy and business, but only if applied correctly. Although panel data offers rich opportunities, it is especially vulnerable to data leaks. To generate reliable insights, practitioners must tailor their ML workflows to their predictive goals, consider both temporal and cross-sectional structure, and use validation strategies that prevent overly optimistic estimates and the illusion of high accuracy. By following these principles, the model avoids the performance inflating trap and instead provides guidance that truly helps policymakers allocate resources and companies make sound strategic choices. Given the rapid adoption of ML using panel data in both public and private domains, addressing these pitfalls is now an urgent priority for applied research.

References

A. Cerqua, M. Letta, and G. Pinto, “On the (mis)use of machine learning with panel data,” Oxford Journal of Economic Statistics (2025): 1–13, https://doi.org/10.1111/obes.70019.

Source link