What does an end-to-end stack for a terminal agent look like when you combine a structured toolkit, a synthetic RL environment, and benchmark-aligned evaluation? Released by a team of researchers from CAMEL AI, Eigent AI, and other collaborators Setaa toolkit and environment stack focused on reinforcement learning for terminal agents. This project targets an agent that runs within a Unix-style shell and must complete verifiable tasks under a benchmark harness such as a terminal bench.
Three main contributions:
- State-of-the-art terminal agents on Terminal Bench: Achieve state-of-the-art performance with Claude Sonnet 4.5-based agents on Terminal Bench 2.0 and GPT 4.1-based agents on Terminal Bench 1.0. Comparisons are limited to agents that use the same basic model.
- Scalable RL training using a synthetic terminal environment: The research team released an initial synthetic dataset containing 400 terminal tasks covering a range of difficulty levels. Of these, 260 tasks are used for RLVR fine-tuning of the Qwen3-8B model.
- Clean agent design that generalizes across training and evaluation frameworks: The same agent implementation is used for both local task execution and the official terminal bench evaluation harness.
Terminal toolkit and log structure
The SETA code repository introduces a terminal toolkit that transforms language models into executable terminal agents. Every time a task runs, the framework creates a log directory structured in: evaluation/terminal_bench_run. The README page displays the specific layout of the task, called. play-zork.
The main files include:
chatagent.logIt records a complete history of agent messages and tool calls, including test results.- a
sessionsdirectorysession_logsCapture terminal interactions from the toolkit. - within
session_logsa file like thisblocking_commands.log,session_run_zork_1_correct_path.log,session_zork-1.logandsession_zork_start.logSaves command output for various sessions and modes. tests.logandtests.log.stripThis records the output of the test run, the latter removes terminal control characters.
This structure provides a concrete way to debug the agent. Can be tracked from high-level chat decisions. chatagent.log Look down to the individual shell commands in the session log and check for success or failure from the test log.
For the official Terminal Bench evaluation, the GitHub repository provides the following alternative entry points: evaluation/terminal_bench_eval. Developers can navigate to that directory and run it. run_eval.sh Terminal Bench 1.0 and run_tb2.sh For Terminal Bench 2.0.
The results are written to evaluation/terminal_bench_eval/run/{run_id}/results.json. Task-specific session logs are located below. evaluation/terminal_bench_eval/logs/camel_logs/{task_id}. The agent class that binds the CAMEL agent to the benchmark is implemented as follows. tbench_camel_agent.py.
Note that we use the toolkit as persistent memory
The research team also introduces a note-taking toolkit, described as persistent memory for long-term tasks. These show examples of note-taking tool calls where the agent writes and reads notes in a structured manner while solving a terminal task. Current public documentation focuses on the existence and use cases of this toolkit. The full training objectives for the use of notes have not yet been described.
The important point is that the agent has an explicit channel, apart from the raw terminal buffer, that allows it to externalize intermediate results and hints.
Understand performance
SETA's agent harness achieved excellent results on the terminal bench. The CAMEL terminal agent, which uses Claude Sonnet-4.5 as its backbone, reaches an accuracy of 46.5% on Terminal Bench 2.0 across 89 real-world tasks, ranking first and outperforming the second system by 3 percentage points. In particular, it has delivered excellent results for git workflows, DevOps automation, and code security tasks. On Terminal Bench 1.0, the GPT 4.1-based agent achieves 35% accuracy. This is 4.7 percentage points above the next entry, also within the same model family. By comparison, the supervised Qwen3 8B baseline reaches 3.4% on Terminal Bench 2.0, and the Qwen3 8B Terminal agent trained on the SETA RL pipeline improves over this baseline in selected synthetic environments.
Important points
- SETA is a collaborative community project that provides both an agent toolkit specifically for terminal agents and a synthetic RL environment tailored to the terminal bench evaluation format.
- This framework reports the state-of-the-art performance of the CAMEL terminal agent on Terminal Bench 1.0 and 2.0 using Claude Sonnet 4.5 and GPT 4.1 as base models and evaluates it against agents built with the same model family.
- Hugging Face's SETA RL dataset contains 400 synthetic terminal tasks, each packaged as follows:
task.yaml,Dockerfileandrun-tests.sh260 tasks used for RLVR fine-tuning of Qwen3-8B-based agents. - The open source SETA codebase exposes a terminal toolkit with structured logging and note-taking toolkit for long-term memory, and integrates directly with terminal bench evaluation scripts and log paths.
setaGitHub repository. - The overall design presents a clean path from synthetic RL environments to benchmark-validated agents, giving developers a reproducible stack to train, debug, and evaluate terminal agents rather than relying on ad-hoc tool invocation examples.
Please check Blog, technical details, GitHub repository and weight. Also, feel free to follow us Twitter Don't forget to join us 100,000+ ML subreddits and subscribe our newsletter. hang on! Are you on telegram? You can now also participate by telegram.
Check out the latest releases ai2025.devis a 2025-focused analytics platform that transforms model launches, benchmarks, and ecosystem activity into structured datasets that can be filtered, compared, and exported.

Michal Sutter is a data science expert with a master's degree in data science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.
