How to test agents during development

Machine Learning


Why testing agents is so difficult

Ensuring that AI agents perform as expected is not easy. Even small adjustments to components such as prompt versions, agent orchestration, and models can have large and unexpected impacts.

Key challenges include:

non-deterministic output

The fundamental problem at hand is that agents are nondeterministic. The same input will produce two different outputs.

How can you test for expected results if you don't know what the expected results will be? Simply put, testing for tightly defined outputs doesn't work.

unstructured output

A second, less discussed challenge of testing agent systems is that the output is often unstructured. The basics of an agent system are: big language A model after all.

It's much easier to define tests for structured data. For example, the id field must not be NULL and must not always be an integer. How do you define text quality for large fields?

cost and scale

LLM-as-judge is the most common methodology for evaluating the quality or trustworthiness of AI agents. However, this is a heavy workload, and each user interaction (trace) can consist of hundreds of interactions (spans).

So we rethought our agent testing strategy. In this post, we share what we learned, including key new concepts that have proven critical to ensuring reliability at scale.

Image courtesy: Author provided

Test the agent

We have two agents in operation with over 30,000 users. Troubleshooting agents sift through hundreds of signals to identify the root cause of data reliability incidents, while monitoring agents make smart recommendations to monitor data quality.

The troubleshooting agent tests three main aspects: semantic distance, groundedness, and tool usage. Here's how to test each:

semantic distance

We utilize clear, explainable, and cost-effective deterministic tests where appropriate. For example, it's relatively easy to deploy tests to verify that one of your subagent's outputs is in JSON format, that the output doesn't exceed a certain length, or that your guardrails are being invoked as intended.

However, sometimes deterministic tests don't get the job done. For example, I considered embedding both the expected and new outputs as vectors and using a cosine similarity test. We considered this to be a cheaper and faster way to assess the semantic distance (are they similar in meaning) between observed and expected outputs.

However, we found that there were too many cases where the words were similar but had different meanings.

Instead, ask LLM to determine the expected output from the current configuration and score it on a scale of 0 to 1. similarity of new output.

grounding

To stay grounded, make sure that important context is present when it should be present, but also make sure that the agent refuses to answer if important context is missing or the question is out of scope.

This is important because LLMs are eager to please and will hallucinate if not in the right context.

How to use the tool

When it comes to tool usage, let LLM act as the judge and evaluate whether the agent performed as expected against predefined scenarios. This means:

  • Tool was not expected and tool was not called
  • Tool was expected but an allowed tool was used
  • No necessary tools have been omitted
  • No unauthorized tools used

The real magic is not in deploying these tests, but in how these tests are applied. Here's my current setup after some painful trial and error.

Agent testing best practices

It is important to note that not only the agent is non-deterministic, but the LLM evaluation is also non-deterministic. These best practices are primarily designed to address these inherent shortcomings.

soft failure

For obvious reasons, hard thresholds can be noisy in non-deterministic tests. So we invented the concept of “soft failure.”

Ratings are returned as scores between 0 and 1. Below .5 is a fail; above .8 is a pass. Soft failures occur with scores between 0.5 and 0.8.

You can merge changes for soft failures. However, if a certain threshold of soft failures is exceeded, it becomes a hard failure and the process stops.

For our agents, we are currently configured to consider a hard failure if 33% of the tests have a soft failure or more than two total soft failures. This will prevent your changes from being merged.

Re-evaluate soft failures

Soft failures can be the canary in the coal mine, but they can also be nonsense in some cases. Approximately 10% of soft disorders are the result of hallucinations. If a soft failure occurs, the evaluation is automatically rerun. If the resulting test passes, it is assumed that the original result was incorrect.

explanation

If a test fails, you need to understand why it failed. We are now asking all LLM examiners to not only provide their score, but also to explain it. Although this is imperfect, it helps build confidence in the evaluation and often speeds up debugging.

Remove unstable tests

You need to test your tests. Particularly in LLM judge assessments, how the prompts are created can have a significant impact on the results. If you run the test multiple times and the delta between the results is too large, fix the prompt or remove the unstable test.

Monitoring in production environment

Testing agents is new and difficult, but it's easy compared to monitoring agent behavior and output in a production environment. The input is even messier, there is no expected output to the baseline, and everything is scaled much larger.

Needless to say, the stakes are much higher. System reliability issues quickly become business issues.

This is our current focus. We are leveraging agent observability tools to address these challenges and will report new findings in future posts.

The Troubleshooting Agent is one of the most impactful features ever released. Developing trusted agents is a career-defining journey, and we're excited to share it with you.


Michael Segner He is a product strategist at the University of Monte Carlo and author of the O'Reilly report, “Enhancing Data + AI Trust with Observability.” It is co-authored by Elor Arieli and Alik Peltinovich.



Source link