Artificial intelligence learns faster in 1,000 new virtual worlds

Machine Learning


Researchers are tackling the challenge of reliably training autonomous agents with the introduction of Agent World, a new pipeline for generating fully synthetic environments. Zhaoyang Wang of the University of North Carolina at Chapel Hill, Canwen Xu and Boyi Liu of Snowflake, along with Yite Wang, Siwei Han, Zhewei Yao, and others, present a system that can create 1,000 diverse daily scenarios for training agents. This work is a major step forward as it addresses the limitations of current agent training methods, which suffer from a lack of diverse and trusting environments. By leveraging code-driven, database-assisted simulation, Agent World provides consistent state transitions and efficient agent interactions, ultimately enabling strong out-of-distribution generalization, as demonstrated in three benchmarks.

This breakthrough addresses a critical limitation in the field of artificial intelligence: the lack of diverse and reliable environments needed to scale agent training.

Unlike existing methods that rely on expensive real-world data or potentially unreliable LLM simulation environments, AWM creates a database-backed, code-driven environment that ensures consistent and predictable state transitions. These environments allow agents to interact with an average of 35 tools each, facilitating complex multi-turn interactions and high-quality observation collection.
The core of the innovation lies in AWM’s systematic approach to environment composition, which mirrors established software development practices. Starting with a high-level scenario description, the pipeline generates user requirements and a corresponding database schema. This schema guides the creation of a robust toolset and backend code, ensuring a clear data model for each tool.

A unified interface powered by the Model Context Protocol allows agents to seamlessly interact with the environment, and automatic validation code provides reliable reward signals for reinforcement learning. This automatic execution and self-modifying process enables the creation of scalable environments.
To demonstrate the effectiveness of this resource, researchers performed large-scale reinforcement learning using multiturn tool-using agents. A fully executable environment and accessible database state enabled the design of robust reward functions, significantly improving agent performance.

Experiments conducted on three established benchmarks reveal that agents trained only within these synthetic environments exhibit strong out-of-distribution generalization abilities, outperforming the performance achieved with benchmark-specific training. This suggests a path to creating more versatile and adaptive AI agents that can tackle real-world challenges.

Building a synthesis environment using large-scale language models and database-driven state management

A 72-qubit superconducting processor forms the basis of the Agent World (AWM) pipeline, a fully synthetic environment generation system designed to scale agent training. The research team built a pipeline that generated 1,000 diverse environments representing everyday scenarios. Each environment is equipped with an average of 35 tools for agents to interact with.

These environments are code-driven and utilize databases to ensure reliable and consistent state transitions, unlike environments that rely on LLM-simulated responses. The methodology begins with scenario generation and leverages large-scale language models to create descriptions of stateful applications such as e-commerce platforms and CRM systems.

A filtering pipeline incorporating an LLM-based classifier and embedding-based deduplication ensures the selection of scenarios that include core CRUD operations and maintains diversity within the 1,000 generated scenarios. This initial stage focuses on establishing a wide range of potential interaction contexts for the agent.

After scenario creation, proceed to task synthesis and database design. The system generates a task set for each environment and designs a corresponding database to define the state space. Data is synthesized to populate this database, providing an initial state of interaction with the agent and enabling grounded feedback during reinforcement learning.

This database-based approach is central to ensuring consistency and reliability in simulated environments. The key innovation is in the implementation of code extension validation. The research team designed validation code for each task, allowing for reliable reward function design and objective evaluation of agent performance.

This validation process, combined with an executable environment, facilitates multi-turn reinforcement learning for agents using the tool. The resulting AWM dataset consists of 35,062 tools and 10,000 tasks, making it the largest set of open source tool usage environments currently available.

Synthetic environment generation supports robust multi-tool agent generalization

Researchers developed Agent World (AWM), a pipeline that generates fully synthetic environments, and scaled it to 1,000 diverse environments for autonomous agent training. These environments incorporate an average of 35 tools per environment, allowing agents to interact with a rich toolset and receive high-quality observations.

At the core of AWM is a code-driven environment backed by a database, which ensures more reliable and consistent state transitions than those that rely on simulation of large language models. This study demonstrates large-scale reinforcement learning with a multiturn tool-using agent, utilizing 1,024 environmental instances at each step.

Accessible database state facilitates the design of reliable reward functions that are essential for effective agent training. Experiments were conducted across three benchmarks and revealed that agents trained only within these synthetic environments achieved strong decentralized generalization abilities.

AWM synthesizes the environment as a partially observable Markov decision process. Each process consists of a state space, an action space, an observation space, a transition function, and a task-specific reward function. The pipeline goes through scenario synthesis, task creation, database design, interface synthesis, and validation to arrive at a fully executable environment suitable for online reinforcement learning.

This approach avoids dependencies on existing task sets or API documentation, and reduces potential copyright concerns. The generated environment features database-based state management to enforce consistency and enable code extension validation for reinforcement learning applications. To date, AWM represents the largest set of open source tooling environments, consisting of 1,000 environments, 35,062 tools, and 10,000 tasks paired with corresponding validation code. This extensive resource provides a robust platform for training and evaluating agents in complex, realistic scenarios.

Synthesis environment enhances tool usage and enables agent generalization

Researchers developed the Agent World Model, a scalable pipeline for creating executable environments used to train agents that can utilize tools. This pipeline successfully generates 1,000 diverse environments, each containing an average of 35 tools and 10,000 tasks, facilitating large-scale reinforcement learning for agents using multiturn tools.

These environments are built using code and supported by SQL databases, ensuring reliable and consistent state transitions and enabling more efficient agent interactions than relying on realistic environments. The importance of this work lies in its ability to improve generalization beyond distribution in agents.

Experiments across three benchmarks demonstrate that training agents only within these synthetic environments yields strong performance in unseen scenarios, outperforming both training with large-scale language model simulations and simultaneous synthesis techniques. The authors acknowledge limitations such as computing resource constraints limiting training to 526 of the 1,000 environments generated and a focus on the Qwen3 model family at 4B, 8B, and 14B scales.

Future research directions include incorporating self-evolution paradigms in which trained agents contribute to environment synthesis, and optimizing synthesis pipelines with proactive error detection using large-scale language models that can be augmented by human inspection. Although creating 1,000 synthetic environments and scalable pipelines is a valuable resource for the research community, care must be taken when deploying agents trained on synthetic data to real-world applications.



Source link