Won Kaggle competition for generative AI-assisted coding

Machine Learning


In March 2026, three LLM agents generated over 600,000 lines of code and ran 850 experiments, helping them win first place in the Kaggle Playground competition.

Success in modern machine learning competitions depends on how quickly ideas can be generated, tested, and iterated. The LLM agent combined with GPU acceleration compresses this loop significantly.

Historically, two bottlenecks have limited this experimentation:

  • How quickly can you write code for a new experiment?
  • How quickly can you run these experiments?

GPUs and libraries such as NVIDIA cuDF, NVIDIA cuML, XGBoost, and PyTorch have largely solved the second problem. The LLM agent is now addressing the initial problem and enabling rapid iterative experimentation at new scales.

This blog post describes how to use LLM agents to accelerate the discovery of the best-performing tabular data prediction solutions.

Case Study: Kaggle Playground Churn Prediction

The March 2026 Kaggle Playground competition will challenge participants to predict telecom customer churn using performance measured by area under the curve (AUC), with the most accurate solution winning.

The first solution is a four-level stack of 150 models selected from 850 models.

Guided LLM agent workflow

For this tabular data competition, we coached LLM agents to follow the Kaggle Grandmaster playbook described in a previous blog post.

Specifically, the LLM agent follows a workflow. It starts with exploratory data analysis (EDA), followed by baseline building, feature engineering, and finally combining models through mountain climbing and stacking.

This solution used multiple LLM agents (GPT-5.4 Pro, Gemini 3.1 Pro, Claude Opus 4.6) in a human-involved workflow.

Step 1: LLM agent runs EDA

The LLM agent needs to understand the data structure before generating a complete pipeline.

The main questions are:

  • How many rows and columns are there in the training and test sets?
  • What is the target column and how is it formatted?
  • Is the task classification or regression?
  • What features are available and how are they formatted?
  • Which characteristics are categorical or numerical?
  • Is there any data missing?

This information can be provided upfront or automatically inferred through EDA.

If you are using LLM in a chat window, you will see the following prompt:

“Please write EDA code to explore the CSV file train.csv and test.csv. I will run the code and share the plots and text back with you.” 

If you use LLM for code execution, such as Claude code, you can ask LLM to write and run your own code to understand your data.

“Please write and run EDA code to understand the CSV files train.csv and test.csv”

Step 2: LLM agent builds baseline

Once the LLM understands the data, specifically the feature and target columns, we create the first complete pipeline for training the kfold model by asking the LLM for a specific model.

“Please write full code pipeline to read train.csv and test.csv and train a kfold XGBoost model. Save the OOF (out of fold predictions) and the Test PREDS to disk as Numpy files. Display the metric score each fold and overall.”

Copy and paste the output code into your codebase. Create a Python or Jupyter notebook directly when using the command line or an IDE agent.

Run the code to get the initial CV metric score, OOF, and test PRED file.

You can ask LLM to build various baselines such as GBDT, NN, and ML models. Each experiment reports a CV score and saves the predictions to disk as follows.train_oof_[MODEL]_[VERSION].npy” and “test_preds_[MODEL]_[VERSION].npy”.

These files are important and will be used later.

Step 3: LLM agent performs feature engineering

We currently have a diverse collection of models, and we know their baseline CV metric scores. Each model can be improved through feature engineering and model tuning/improvement. Feature engineering focuses on transforming data so that the model can extract more signals. Also, model tuning/improvement focuses on changing the model to extract more signals. LLM agents excel at both of these tasks.

By repeatedly running experiments and keeping all the ideas that improve the model, you’ll end up with a better and better model. For each experiment, good or bad, we always save the OOF and test predictions to disk.

LLM agents can write code as fast as they want. To accelerate cycles, we always use GPUs and GPU libraries such as cuDF, cuML, Gradient Boosting Decision Tree GPU, and PyTorch GPU to run each experiment as fast as possible.

To generate new ideas, make suggestions or have your LLM generate ideas for you. Here are some effective ways to encourage idea generation in your LLM:

  • Ask your LLM to find and read research papers on the topic.
  • Ask your LLM to read forums and publicly shared code on this topic.
  • Have LLM perform EDA to determine relationships between features and feature engineering targets.
  • Ask your LLM for ideas based on their current knowledge base.
  • A human brainstorms with an LLM and creates ideas together.

You can use one of our ideas to ask your LLM agent to create new code from old code.

“Please write me a complete replacement code for the code below that uses XYZ instead of ABC”.

A new experiment must be performed.

Step 4: LLM agent combines models

At the moment, we have many experimental results, each with its own model and various feature engineering stored in Python scripts or Jupyter notebooks. LLM agents excel at combining all these models and ideas and help you use and manage them all in a variety of ways, including:

  • Summarize all model types and feature engineering.
  • Combine ideas from different models and feature engineering to build a new, powerful single model.
  • Build an ensemble from different models.
  • Stack models on top of other models.
  • Use several models to extract pseudo-labels/knowledge into a new powerful single model.

One of the first and most helpful things I do is have the LLM agent summarize all the experiments. You can drag and drop files into the chat window, or use an LLM command line agent (such as Claude Code) to read and aggregate results for multiple files. This helps you better understand your data and problem and show you what’s working.

One powerful technique is to ask the LLM agent to combine multiple ideas/models into one model.

“Can you read all these IPYNB files and use all these ideas to write full code to train a new single XGBoost model which is stronger than all of these models?” 

Another technique is to transfer knowledge from some or all of the models into a single model. We use OOF and test predictions (essentially pseudo-labels) to transfer knowledge into a new, powerful, single model.

“Can you please train a new single NN or GBDT using knowledge distillation from all our OOF and Test PREDs and make a new high performing single model?”

Both of the above techniques generate new experiments and new OOF and test prediction files. Each baseline model and experiment with new feature engineering or model improvements has an associated OOF and test prediction file. It is common to have hundreds of files. You can now ask your LLM to combine using hill climbing and stacking.

“Can you please try combining all our OOF and Test PREDs using various meta models? Please try Hill Climbing, Ridge/Logistic regression, NN, and GBDT stackers. Thanks”

result

Follow the four steps above to create a set of different models. Next, improve the performance of each model. Finally, combine everything to create a powerful solution. The advantage is that you can write code faster with GPU-accelerated model execution and LLM agents, so you can explore more ideas quickly. Anyone looking for the best performing solution for tabular data prediction tasks can use these techniques.

Let’s get started

Ready to accelerate your results? Start by exploring the cuDF and cuML libraries and CUDA-X for data science.

Want to learn more? Sharpen your skills with the DLI Workshop on Feature Engineering. Take professional strategies from the post The Kaggle Grandmasters Playbook: 7 Battle-Tested Modeling Techniques for Tableular Data.



Source link