How to test agents during development

Why testing agents is so difficult

Ensuring that AI agents perform as expected is not easy. Even small adjustments to components such as prompt versions, agent orchestration, and models can have large and unexpected impacts.

Key challenges include:

non-deterministic output

The fundamental problem at hand is that agents are nondeterministic. The same input will produce two different outputs.

How can you test for expected results if you don't know what the expected results will be? Simply put, testing for tightly defined outputs doesn't work.

unstructured output

A second, less discussed challenge of testing agent systems is that the output is often unstructured. The basics of an agent system are: big language A model after all.

It's much easier to define tests for structured data. For example, the id field must not be NULL and must not always be an integer. How do you define text quality for large fields?

cost and scale

LLM-as-judge is the most common methodology for evaluating the quality or trustworthiness of AI agents. However, this is a heavy workload, and each user interaction (trace) can consist of hundreds of interactions (spans).

So we rethought our agent testing strategy. In this post, we share what we learned, including key new concepts that have proven critical to ensuring reliability at scale.

Test the agent

We have two agents in operation with over 30,000 users. Troubleshooting agents sift through hundreds of signals to identify the root cause of data reliability incidents, while monitoring agents make smart recommendations to monitor data quality.

The troubleshooting agent tests three main aspects: semantic distance, groundedness, and tool usage. Here's how to test each:

semantic distance

We utilize clear, explainable, and cost-effective deterministic tests where appropriate. For example, it's relatively easy to deploy tests to verify that one of your subagent's outputs is in JSON format, that the output doesn't exceed a certain length, or that your guardrails are being invoked as intended.

However, sometimes deterministic tests don't get the job done. For example, I considered embedding both the expected and new outputs as vectors and using a cosine similarity test. We considered this to be a cheaper and faster way to assess the semantic distance (are they similar in meaning) between observed and expected outputs.

However, we found that there were too many cases where the words were similar but had different meanings.

Instead, ask LLM to determine the expected output from the current configuration and score it on a scale of 0 to 1. similarity of new output.

grounding

To stay grounded, make sure that important context is present when it should be present, but also make sure that the agent refuses to answer if important context is missing or the question is out of scope.

This is important because LLMs are eager to please and will hallucinate if not in the right context.

How to use the tool

When it comes to tool usage, let LLM act as the judge and evaluate whether the agent performed as expected against predefined scenarios. This means:

Tool was not expected and tool was not called
Tool was expected but an allowed tool was used
No necessary tools have been omitted
No unauthorized tools used

The real magic is not in deploying these tests, but in how these tests are applied. Here's my current setup after some painful trial and error.

Agent testing best practices

It is important to note that not only the agent is non-deterministic, but the LLM evaluation is also non-deterministic. These best practices are primarily designed to address these inherent shortcomings.

soft failure

For obvious reasons, hard thresholds can be noisy in non-deterministic tests. So we invented the concept of “soft failure.”

Ratings are returned as scores between 0 and 1. Below .5 is a fail; above .8 is a pass. Soft failures occur with scores between 0.5 and 0.8.

You can merge changes for soft failures. However, if a certain threshold of soft failures is exceeded, it becomes a hard failure and the process stops.

For our agents, we are currently configured to consider a hard failure if 33% of the tests have a soft failure or more than two total soft failures. This will prevent your changes from being merged.

Re-evaluate soft failures

Soft failures can be the canary in the coal mine, but they can also be nonsense in some cases. Approximately 10% of soft disorders are the result of hallucinations. If a soft failure occurs, the evaluation is automatically rerun. If the resulting test passes, it is assumed that the original result was incorrect.

explanation

If a test fails, you need to understand why it failed. We are now asking all LLM examiners to not only provide their score, but also to explain it. Although this is imperfect, it helps build confidence in the evaluation and often speeds up debugging.

Remove unstable tests

You need to test your tests. Particularly in LLM judge assessments, how the prompts are created can have a significant impact on the results. If you run the test multiple times and the delta between the results is too large, fix the prompt or remove the unstable test.

Monitoring in production environment

Testing agents is new and difficult, but it's easy compared to monitoring agent behavior and output in a production environment. The input is even messier, there is no expected output to the baseline, and everything is scaled much larger.

Needless to say, the stakes are much higher. System reliability issues quickly become business issues.

This is our current focus. We are leveraging agent observability tools to address these challenges and will report new findings in future posts.

The Troubleshooting Agent is one of the most impactful features ever released. Developing trusted agents is a career-defining journey, and we're excited to share it with you.

Michael Segner He is a product strategist at the University of Monte Carlo and author of the O'Reilly report, “Enhancing Data + AI Trust with Observability.” It is co-authored by Elor Arieli and Alik Peltinovich.

Source link

打开Binance账户 commented on Venture capital is opening the gates for defense tech: Can you be more specific about the content of your
注册 commented on Apple Stops Human Support on X: Your point of view caught my eye and was very inte
god of كازينو commented on Apple and Salesforce respond to YouTube video complaints: Hello Dear, are you actually visiting this web pag
创建免费账户 commented on CX Decoded Podcast Episode 2: AI Empowered CX: Real Conversations, Real Results: Shri Nandan, Comcast: Thank you for your sharing. I am worried that I la
开设Binance账户 commented on Driving Innovation & Making a Lasting Impact: Can you be more specific about the content of your

How to test agents during development

Why testing agents is so difficult

non-deterministic output

unstructured output

cost and scale

Test the agent

semantic distance

grounding

How to use the tool

Agent testing best practices

soft failure

Re-evaluate soft failures

explanation

Remove unstable tests

Monitoring in production environment

RECENT POSTS

Thematic analysis with open-source generative AI and machine learning: a new method for inductive qualitative codebook development

AI adoption in the banking sector

Evonik’s Q1 EBITDA beat and AI pilot could be a game-changer for Evonik Industries (XTRA:EVK)

Why testing agents is so difficult

non-deterministic output

unstructured output

cost and scale

Test the agent

semantic distance

grounding

How to use the tool

Agent testing best practices

soft failure

Re-evaluate soft failures

explanation

Remove unstable tests

Monitoring in production environment

Related Posts