Lightweight benchmarks reveal that the tabular underlying model requires 40,000 times more latency than the tree ensemble, achieving a gain of 0.8%

The promise of zero-shot learning relies on underlying models that can make predictions without task-specific training, but the computational cost of these models is often unknown. IEEE’s Aayam Bansal and Ishaan Gangwani, along with colleagues, have published a detailed analysis of the hardware demands of key tabular underlying models. Their work introduced reproducible benchmarks that reveal important trade-offs between predictive accuracy and resource consumption, and demonstrated that traditional tree-based methods often match or exceed the performance of the underlying models, while requiring significantly less processing time and memory. This study quantifies the hidden hardware costs associated with current tabular underlying models and establishes an important baseline for developing more efficient and sustainable machine learning techniques.

PFN-1. 0 and TabICL base were evaluated against carefully tuned XGBoost, LightGBM, and Random-Forest models on a single NVIDIA T4 GPU. The tree ensemble matches or exceeds the accuracy of the underlying model on three datasets, completes a complete test batch in less than 0.40 seconds, and uses less than 150 MB of RAM with no VRAM. TabICL scores only 0.8 percentage points on the Higgs dataset, but incurs approximately 40,000 times more latency (960 seconds) and requires 9 GB of VRAM. TabPFN matches the tree accuracy of Wine and Housing, but peaks at 4 GB VRAM and cannot handle the entire 100,000-row Higgs table. These findings quantify the substantial trade-off between hardware requirements and accuracy and provide an open baseline for future efficiency-oriented research in tabular foundational models.

Benchmarking underlying models on tabular datasets

In this study, we closely benchmark the performance of a zero-shot foundational model against established tree-based methods on tabular data and quantify both predictive accuracy and hardware demands. Researchers employed four publicly available datasets for a comprehensive evaluation: Adult Income, Higgs 100k, Wine Quality, and California Housing. The experiments were hosted on the Kaggle platform and conducted on a single NVIDIA T4 GPU with 2 vCPUs and 13 GB of RAM to ensure a controlled comparison. The methodology included evaluation of five models: XGBoost 1.7, LightGBM 4.

3, scikit-learn random forest, TabPFN-1. 0, and TabICL-based. The tree-based model was tuned by a 15-trial randomized search using stratified three-fold cross-validation to optimize performance for each dataset. However, the basic model was evaluated in a zero-shot manner without any gradient updates or fine-tuning. Key constraints applied to TabPFN-1.

0 was the limit for processing up to 10,000 rows and required using random samples for training on the Adult, Higgs, and Housing datasets. In this study, psutil and torch. cuda library. Statistical analysis using Friedman test and Nemeny post hoc test was performed on accuracy ranks to identify significant differences between models. Hardware cost ratios were then calculated compared to XGBoost’s performance to establish a baseline for comparison. This detailed methodology provides a nuanced understanding of the trade-off between accuracy and resource consumption in current tabular underlying models.

Comparing the performance of the base model and the adjusted tree

In this study, we present a comprehensive benchmark that evaluates the performance of zero-shot foundational models on tabular data and directly compare them with tuned gradient-boosted decision trees. Using a single NVIDIA T4 GPU, researchers carefully measured test accuracy, observed latency, peak RAM usage, and peak GPU VRAM consumption across four public datasets: Adult Income, Higgs 100k, Wine Quality, and California Residential. This study aimed to quantify the hardware demands of these emerging models and establish a baseline for future efficiency-focused research. Experiments demonstrate that tree-based ensembles, especially XGBoost and LightGBM, achieve consistently high accuracies, reaching 87.

It is over 45% in the adult income dataset and 91% in the California housing dataset. Random Forest also delivers competitive results and maintains good performance across all datasets. Among the base models, TabICL achieved an improvement of 0.8 percentage points on the Higgs dataset, reaching 73.29% accuracy, and 90.29% accuracy on Wine, showing good performance.

00%. However, TabICL requires significantly more resources to achieve these benefits. Importantly, the results reveal a significant trade-off between accuracy and hardware consumption. Although TabICL achieves competitive accuracy on a given dataset, it requires approximately 40,000 times more latency (reaching 960 seconds) and 9 GB of VRAM compared to a tree-based model that completes a complete test batch in less than 0.40 seconds and uses minimal RAM. TabPFN is limited to processing 10,000 rows due to architectural constraints, matching the accuracy of the Wine and Housing tree models, but peaks at 4 GB VRAM and cannot process the entire 100,000 row Higgs table. These findings demonstrate that current tabular underlying models impose a significant hardware load and suggest that their primary value lies in rapid prototyping on small tables rather than large-scale operational inference.

Tabular models have no accuracy advantage

In this study, we present for the first time a controlled comparison of a zero-shot tabular foundation model and a tuned decision tree ensemble, evaluating both accuracy and hardware cost. The researchers found that there was no statistically significant difference in overall accuracy between the underlying model and tree-based methods such as XGBoost and LightGBM, with performance trading within close range across four public datasets. However, significant differences emerged in computational efficiency. The tree-based approach completed full-batch inference in less than 0.4 seconds using minimal memory, whereas one underlying model required 960 seconds and a large amount of video RAM to achieve a small accuracy improvement on a single dataset.

The findings show that current zero-shot tabular foundational models do not outperform tuned gradient-boosted decision trees in terms of efficiency for medium-sized tabular tasks. Although foundational models are useful for rapid prototyping and exploring small datasets, current hardware demands prevent them from being deployed in real-time or resource-constrained environments. The authors suggest that future research should focus on developing lightweight variants of these models using techniques such as quantization and distillation, or exploring hybrid pipelines that combine the features generated by the underlying model with the efficient inference capabilities of tree-based learners. Researchers have made code and data available to facilitate reproducible, hardware-aware evaluation of structured data infrastructure models.

👉 More information
🗞 Lightweight benchmark reveals hidden hardware costs of zero-shot tabular foundation models
🧠ArXiv: https://arxiv.org/abs/2512.00888

Source link