Introducing AutoAgent: an open source library that allows AI engineers to leverage and optimize their own agents overnight

There are certain types of boring tasks that all AI engineers are familiar with. It’s a prompt tuning loop. Create system prompts, run agents against benchmarks, read fault traces, tweak prompts, add tools, and rerun. If you repeat this several dozen times, the needle may move. This is the tedious task of dressing up your Python files. Now, a new open source library called . auto agentThis tool, built by Kevin Gu of thirdlayer.inc, suggests an anxiety-inducing alternative. Don’t do that work yourself. Let the AI do it.

AutoAgent is an open source library for autonomously improving agents on any domain. In a 24-hour run, it achieved #1 on SpreadsheetBench with a score of 96.5% and #1 on Terminal Bench with a GPT-5 score of 55.1%.

https://x.com/kevingu/status/2039843234760073341

What actually is AutoAgent?

AutoAgent is described as being “like automated research, but for agent engineering.” The idea is to give an AI agent a task and let it autonomously build and iterate on an agent harness overnight. Modify system prompts, tools, agent configurations, and orchestrations, run benchmarks, check scores, keep or discard changes, and repeat.

Understanding the analogy: Andrej Karpathy’s autoresearch We do the same thing with ML training. It loops through the proposal, training, and evaluation cycles, keeping only the changes that improve the validation loss. AutoAgent transfers the same ratchet loop from ML training to agent engineering. Instead of optimizing model weights or training hyperparameters, harness – System prompts, tool definitions, routing logic, and orchestration strategies that determine how agents behave on tasks.

a harnessIn this context, it is the scaffolding around LLM. that is, the system prompts it receives, the tools it can invoke, how it routes between subagents, and how its tasks are formatted as input. Most agent engineers handcraft this scaffolding. AutoAgent automates the iteration of the scaffold itself.

Architecture: 2 agents, 1 file, 1 directive

GitHub repositories have an intentionally simple structure. agent.py The entire harness under test in a single file. This includes configuration, tool definitions, agent registries, routing/orchestration, and harbor adapter boundaries. The adapter section is explicitly marked as fixed. The rest are the main editorial aspects of MetaAgent. program.md It contains meta-agent instructions and directives (what kind of agent to build) and is the only file edited by humans.

Think of this as a separation of concerns between humans and machines. Humans set direction internal program.md. of meta agent (a separate higher level AI) reads and inspects that directive agent.pyrun the benchmark, diagnose what failed, and rewrite the relevant parts. agent.pyI repeat. humans never touch agent.py directly.

A key part of the infrastructure that keeps the loop consistent across iterations is results.tsv — Experiment logs automatically created and maintained by the meta-agent. It tracks all experiments performed and provides a history for the meta-agent to learn and adjust what to try next. The complete project structure also includes: Dockerfile.base,option .agent/ Directory of reusable agent workspace artifacts such as prompts and skills tasks/ Benchmark payload folders (added for each benchmark branch), and jobs/ Directory for Harbor job output.

This metric is the total score produced by the benchmark’s task test suite. Meta-agents will climb this score. All experiments generate numerical scores. If it’s good, keep it, if it’s not, discard it. This is the same loop as automatic investigation.

Task format and harbor integration

Benchmarks are expressed as Harbor-style tasks. Each task exists below tasks/my-task/ and, task.toml For configuration of timeouts, metadata, etc. instruction.md This is the prompt sent to the agent. tests/ directory test.sh Entry point to write score /logs/reward.txt,and test.py For validation using either deterministic checking or LLM-as-judge. Ann environment/Dockerfile Define a task container, files/ The directory holds reference files mounted in the container. The test writes a score between 0.0 and 1.0 to the verifier’s log. Meta-agents climb this mountain.

of LLM as a judge The pattern here is worth flagging. In addition to definitively checking the answer (like a unit test), the test suite can use another LLM to evaluate whether the agent’s output is “good enough.” This is common in agent benchmarks where the correct answer is not reducible to a string match.

Important points

Autonomous driving harness engineering business — AutoAgent proves that meta-agents can completely replace human prompt coordination loops. agent.py Work is completed overnight without any direct human contact with the harness file.
Benchmark results validate the approach — Over a 24-hour run, AutoAgent achieved first place in SpreadsheetBench (96.5%) and top GPT-5 score in Terminal Bench (55.1%), outperforming all other hand-engineered human entries.
“Model empathy” may be a real phenomenon — Claude metaagents that optimize Claude task agents appear to diagnose failures more accurately than those that optimize GPT-based agents, suggesting that pairing same-family models may be important when designing AutoAgent loops.
A person’s job changes from being an engineer to being a director. — Does not write or edit agent.py. you write program.md — Simple Markdown directives to control meta-agents. This distinction reflects broader changes in agent engineering, from writing code to setting goals.
Plug and play for any benchmark — AutoAgent is domain agnostic because tasks follow Harbor’s open format and agents run inside Docker containers. Any task that can be scored, such as spreadsheets, terminal commands, or your own custom domain, can be a target for autonomous self-optimization.

Please check lipo and Tweet. Also, feel free to follow us Twitter Don’t forget to join us 120,000+ ML subreddits and subscribe our newsletter. hang on! Are you on telegram? You can now also participate by telegram.

Need to partner with us to promote your GitHub repository, Hug Face Page, product release, webinar, etc.? connect with us

Source link