New Alibaba AI framework skips loading all tools and reduces agent token usage by 99%

As enterprise AI systems scale to handle complex workflows, practitioners face the challenge of routing subtasks to the appropriate tools and skills. Agents have hundreds of tools and skills, and it can be confusing to know which one to use at each step in a workflow.

To address this challenge, researchers at Alibaba developed SkillWeaver, a framework that creates an execution graph for a given task and selects the appropriate skill for each node. We also introduce skill-aware decomposition (SAD), a new technique that uses feedback loops to enable agents to repeatedly retrieve and vet relevant tool candidates. This compositional approach and feedback loop mechanism distinguishes SkillWeaver from other tool routing frameworks that select tools in a one-shot fashion.

SkillWeaver is relevant to real-world AI applications where agents autonomously coordinate multi-tool ecosystems such as Model Context Protocol (MCP) to perform multi-step business operations such as downloading datasets, transforming information, and creating visual reports.

In fact, researchers’ experiments with SkillWeaver showed that implementing this acquisition and routing approach significantly improved accuracy while reducing token consumption by more than 99% compared to simply exposing the agent to the entire tool library.

The main takeaway for practitioners building AI agents is that the granularity of task decomposition is the biggest bottleneck to getting the tools accurate.

Skill routing challenges

Skills are a key pattern in modern LLM agent architectures. Skills are modular, reusable tool specifications that use structured natural language documentation.

As enterprise agents integrate with larger tool ecosystems, accurately routing user queries to the right skill becomes a difficult task. Exposing an entire library to LLM to find the right tool is highly inefficient, quickly exceeding context limits and consuming hundreds of thousands of tokens.

Most current tooling frameworks attempt to solve this problem through hierarchical structures that treat API retrieval, document matching, or routing strictly as a single skill selection or step-by-step problem.

However, this single-skill paradigm is insufficient for enterprise environments because the actual queries are compositional in nature. Standard business requests such as “download and transform a dataset to create a visual report” cannot be fulfilled by a single tool. You need to decompose your prompts and order your API clients, data processors, and visualization tools into a consistent multi-step execution plan.

How SkillWeaver and SAD work

To address this, researchers framed the problem of processing complex tasks that require multiple skills as “compositional skill routing.” Given a vast library of complex user prompts and tools, agents must simultaneously understand how to break down requests into a set of atomic subtasks, how to map each subtask to the single best available skill, and how to incorporate those skills into an executable plan.

SkillWeaver orchestrates this process through three distinct stages: decomposition, acquisition, and composition. In the first stage, LLM acts as a task decomposer, dividing a user’s complex query into a set of subtasks, each requiring one skill. Once the subtasks are well defined, the system uses an embedded model to compare each subtask to the skills library and extract a final list of top candidate tools for each step.

In the final stage, the planner evaluates the searched candidates based on how well they work together. Check compatibility between skills so that the output of one tool flows naturally into the input of the next tool. The final execution plan is then created as a directed acyclic graph (DAG) that maps dependencies, allowing independent tasks to potentially execute in parallel.

For example, a user might ask an AI agent to “download a dataset, transform it, and create a visual report.” In the decomposition stage, the decomposer LLM divides this into three different subtasks: downloading the dataset, transforming the data, and creating the report.

During the retrieval phase, the system searches the libraries and finds candidates such as “api-client” or “http-fetch” for task 1 and “csv-parser” or “etl-pipeline” for task 2. Finally, the creation stage evaluates these options and selects the most compatible specific combination of “api-client”, “csv-parser”, and “chart-gen” and ties them together into the final ready-to-run workflow.

The main challenge with this pipeline is that the general step descriptions that LLM produces often do not match the specific technical terminology of the actual skills available in the library. To fix this, SkillWeaver introduces a new feedback loop: Iterative Skill-Aware Decomposition (SAD). SAD works by having the LLM draft an initial plan, performing a preliminary search to find a loose skill match, and feeding the acquired skills back to the LLM as inspiration. This allows LLM to rewrite its decomposition so that the granularity and vocabulary perfectly match the real-world tools.

SkillWeaver in action

To evaluate how SkillWeaver performs in a realistic corporate scenario, researchers created a custom benchmark called CompSkillBench. It consists of 300 multi-step queries of varying difficulty. To reflect real-world environments, we used a library of 2,209 real-world skills from the public MCP ecosystem, covering 24 functional categories such as cloud infrastructure, finance, and databases.

Regarding the core engine, the researchers mainly used a lightweight 7 billion parameter model (Qwen2.5-7B-Instruct) for task decomposition and combined it with a standard semantic search retriever (MiniLM with FAISS index) for finding tools. SkillWeaver was evaluated against three main setups. One is a brute-force “LLM direct” approach that crams all tool names into the large model prompt, a vanilla LLM-based decomposition without SAD, and a ReAct-style agent loop.

Experiments show that task decomposition is the main bottleneck. When dealing with large tool libraries, standard LLM behavior is insufficient, but the SAD feedback loop changes dramatically. In vanilla settings, the 7B model achieved decomposition accuracy (i.e., accurate step count prediction) only 51.0% of the time. By activating the SAD feedback loop, accuracy jumped to 67.7% (with the larger Qwen-Max model, accuracy reached 92%). For “difficult” tasks that require four to five different skills, SAD improved accuracy by 50%.

Skillweaver results — Compared to naive approaches, SkillWeaver reduces token consumption by more than 99% (Source: arXiv)

One interesting finding was that large models can actually perform worse without guidance. When tested in vanilla settings, the larger model with 14 billion parameters had a tendency to over-decompose tasks into small and unnecessary steps, resulting in accuracy below that of the 7B model. Once SAD was introduced, the captured tool tips anchored the model to reality and improved accuracy. This suggests that tailoring an agent to a particular tool’s vocabulary is often more effective than paying for a large and expensive LLM.

Another important point is the token savings. An LLM-Direct baseline with a very large Qwen-Max model showed that feeding all tools to the large model prompts failed. Despite near-perfect task decomposition capabilities, the large-scale model was only able to obtain the appropriate tool category 21.1% of the time when there were a large number of tool options. SkillWeaver’s targeted retrieval and root approach significantly outperformed this in accuracy, reducing context window consumption from an estimated 884,000 tokens per query to approximately 1,160 tokens (99.9% reduction). For practitioners, this directly translates into significantly lower API costs and faster response times.

Finally, the traditional ReAct baseline completely failed, achieving 0% decomposition accuracy. That loop naturally collapses multi-step plans into discrete actions, rather than explicitly mapping a consistent multi-tool sequence.

Considerations for developers

Although the researchers have not yet released the source code for SkillWeaver, their work builds on an off-the-shelf tool that is easy to reproduce.

Skill-Aware Decomposition (SAD) is a key innovation at the heart of the framework: smart prompt engineering and acquisition loops. The authors share prompt templates in their paper, and developers can very easily implement prompt templates themselves using standard orchestration libraries such as LangChain, LlamaIndex, or even raw Python scripts.

For the search component, the authors used the open source embedding model all-MiniLM-L6-v2 to build the core framework. They found that replacing it with a slightly more powerful off-the-shelf encoder (BGE-base-en-v1.5) immediately improved accuracy without any fine-tuning. Off-the-shelf biencoders are good at placing related tools in the top 10 candidates almost 70% of the time, but it is difficult to consistently and accurately rank the perfect tool in first place, and they only achieve that about 37% of the time. To bridge this gap, teams may need to implement a secondary cross encoder or LLM-based reranker to sort the top 10 candidates.

One of the preliminary requirements is to vectorize the tool library and build the FAISS index in advance. In reality, this is a negligible hurdle. It took just 15 seconds to embed and index all 2,209 skills into the benchmark. Once built, retrieving the tools from the index adds less than 15ms of latency per query. In an enterprise environment, tool index synchronization is a simple background job.

A current limitation of SkillWeaver is the lack of error recovery. Although SkillWeaver successfully planned a compatible DAG for execution, the authors’ pilot studies revealed challenges with the multi-step tool chain. For example, if an API call fails at step 2, the entire chain breaks. The core contribution of this paper is limited to the routing and planning stages. True production deployments will need to build their own error recovery, fallback, and retry mechanisms on top of the creation stage to handle actual API timeouts and malformed output.

Source link