AI agents can’t teach themselves new tricks, but humans can learn them • The Register

Machine Learning


By teaching AI agents how to mine for information, they can feed data to themselves. Telling an AI agent to solve things on its own can make things even worse.

An AI agent is a machine learning model (such as Claude Opus 4.6) that accesses other software through a CLI harness (such as Claude Code) and operates in an iterative loop. These agents can be instructed to handle a variety of tasks, some of which may not be covered by the training data.

You can give your software agents access to new “skills” if they don’t have the proper training. This is basically reference material added to give domain-specific functionality. “Skills” in this context refer to procedures, metadata, and other resources such as scripts and templates that agents load to obtain knowledge about procedures.

For example, you can use a skill consisting of markdown text, code, libraries, and API references to tell an AI agent how to process a PDF. The agent may have some idea how to do this from training data, but more specific guidance should improve performance.

But asking agents to develop those skills on their own is likely to lead to disappointment, according to a recent study, “SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks.” The “intelligence” part of artificial intelligence is somewhat exaggerated.

At least this is true during large-scale language model (LLM) inference, when the trained model is used rather than during the training process.

new benchmark

Certain forms of machine learning, such as deep learning, can be applied in ways that improve the performance of neural network models in domain-specific tasks such as video games.

The explosion of AI agents, such as Anthropic’s Claude Code, Google’s Gemini CLI, and OpenAI’s Codex CLI, has rapidly developed skills that extend what agents can do. Skill directories are proliferating like weeds. And considering how OpenClaw agents have taught each other within the Moltbook automation community network, it seems like it’s long past time we realized just how good a job they are doing in this area.

So far, there is no general way to check whether these skills have the desired effect. So a team of 40 (!) computer scientists from companies like Amazon, BenchFlow, ByteDance, Foxconn, and Zennity and various universities including Carnegie Mellon, Stanford, UC Berkeley, and Oxford set out to develop a benchmark test to assess how an agent’s skills improve performance during inference.

The authors, led by Xiangyi Li, founder of agent measurement startup BenchFlow, developed a test they called SkillsBench and described their results in the preprint paper mentioned above.

The researchers examined a seven-agent model setup across 84 tasks for 7,308 trajectories, or attempts by one agent to solve a single task under specific skill conditions. Three conditions were tested: no skill, selected skill, and self-generated skill.

Agents using curated human-designed skills completed tasks on average 16.2% more often than agents without skills, although the variance was higher.

One example given in this study is the flood risk analysis task. The pass rate was only 2.9% because unskilled agents did not apply proper statistical calculations. Carefully selected skills in instructing the agent to use Pearson Type III probability distributions, apply appropriate standard USGS methodologies, and specify other details such as scipy function calls and parameter interpretation increased the agent’s task success rate to 80%.

Analyzed in terms of specific knowledge areas, curation of skills in healthcare (+51.9 percentage points) and manufacturing (+41.9 percentage points) helped AI agents the most, while curation of skills related to mathematics (+6.0 percentage points) and software engineering (+4.5 percentage points) had a smaller effect. The authors explain this by observing that areas that require expertise tend to be underestimated in the training data. Therefore, it makes sense for humans to power agents that tackle tasks in these domains.

And when you do, less is more. Skills with only a few (2-3) modules perform better than large data dumps.

This also applies to the scale of the model. Curated skills help smaller models punch above their weight class when it comes to completing tasks. Anthropic’s Claude Haiku 4.5 model and Skill (27.7 percent) outperformed Haiku 4.5 (11 percent) without skill, and Claude Opus 4.5 (22 percent) without skill.

When it came time to have agents teach themselves the skills, study authors instructed agents to:

  1. Analyze task requirements, domain knowledge, and required APIs.
  2. Create one to five modular skill documents to solve the task.
  3. Save each skill as a markdown file. and
  4. Then use the generated references to solve the task.

Agents who tried it performed worse than if they hadn’t tried it at all.

The authors state that “the effect of self-generated skills is negligible or negative (-1.3 percentage points on average), indicating that effective skills require hand-picked human expertise.”

At least for now, the AI ​​revolution will not be fully automated. Machines still need a human teacher to guide them down the right path. ®



Source link