Hug Face launches AI agent “ML Intern” that defeats Claude Code with reasoning power

hug face We launched ML Intern, an open source AI agent that autonomously explores, writes, and executes machine learning code. Early benchmark results show better performance. antropic Claude Code for Scientific Reasoning and OpenAI’s Codex for Medical Evaluation.

The project was built by Hugging Face’s AI agent team and is pitched as an automated version of the post-training research loop used by the company’s ML researchers, and is currently available as a CLI and mobile and desktop web apps.

Aksel Joonas Reedi, who works on the AI agent at Hugging Face, announced the release on LinkedIn and said the agent will retrieve papers from arXiv and hf.co/papers, examine citation graphs, retrieve and reformat datasets, and start training jobs with Hugging Face Jobs if local GPUs are not available. Hugging Face is also provisioning $1,000 in GPU resources and Anthropic credits for early users.

ML Intern beats Claude Code and Codex in benchmarks

Reedi said on LinkedIn that ML Intern is set up to train the best LLM for scientific inference, and that he found NVIDIA research including OpenScience and NemoTron-CrossThink through citation searches before running 12 supervised fine-tuning passes on Qwen3-1.7B. Claude Code’s highest: 22.99%.

Reedy said that during another medical test, the agent determined that the quality of the existing dataset was too low, so he created a script that generated 1,100 synthetic data points covering emergency, client, and multilingual communication, and upsampled the data 50 times for training. He said it “outperformed HealthBench’s Codex by 60%.” He added that for the competitive math task, the agent created a complete GRPO training script, launched it on the A100 GPU via Hugging Face Spaces, and performed ablation until success even after the initial reward collapsed.

What developers and researchers get

According to the project’s public documentation, ML Intern runs agent loops for up to 300 iterations per task, using a context manager that handles message history and automatic compaction, a tool router for Hugging Face documents, datasets, jobs, and papers, as well as GitHub code search and sandbox execution. The CLI is installed via uv and accepts any inference provider model ID with default settings pointing to Anthropic’s Claude model.

Reedy said on LinkedIn that the agent “deeply embodies the work and thinking of researchers” and “knows what data should look like and what a good model feels like.”

Hugging Face donates $1,000 in GPU credits to launch

Reedi added that Hugging Face is offering $1,000 in GPU resources and Anthropic credits to the “earliest” initial users of the tool, and both the CLI and web app are now live. This incentive comes as universities, bootcamps, and EdTech startups are under pressure to provide students and staff with hands-on access to model training without paying commercial cloud fees.

The question at hand is how ML Intern’s autonomy profile performs outside of carefully selected benchmarks, especially on messy real-world education datasets where data quality, consent, and licensing all apply. Hug Face said the agent is open source and built on its own ecosystem, which will allow the community to test these limits in public.

Source link