Generate recommendations from production traces, validate with batch evaluation and A/B testing, and ship with confidence.
Just because an AI agent performs well at launch doesn’t mean it will stay that way. As the model evolves, user behavior changes and prompts are reused in new contexts for which they were never designed. The quality of the agent is silently degraded. For most teams, the improvement process still looks the same. Without an automatic feedback loop, if a user complains, the developer reads the trace, makes a hypothesis, rewrites the prompt, tests some cases, and ships a fix. The cycle then repeats, often creating new problems for different users. Until today, Amazon Bedrock AgentCore provided elements for manual debugging and building custom implementations. Check reputation scores to detect quality degradation, dig deep into traces to identify root causes, and update agents with improved settings. Developers are performance engines that rely on intuition rather than evidence backed by systematic data. Dedicated scientific teams and large, focused benchmarks are helpful, but for most product teams they are neither a practical nor timely solution. Even when you have such machines, machines tend to run on weekly or monthly cycles while agents move during production every day.
AgentCore is a platform for building, connecting, and optimizing agents at scale with security at the infrastructure layer. Thousands of developers already use AgentCore to build agents that reason, plan, and behave across complex workflows. Today, we’re announcing a new feature in AgentCore that completes the loop of observing, evaluating, and improving agent performance and quality: recommendations and two ways to validate them.
Recommendations analyze production traces and evaluation output to optimize system prompts or tool descriptions for the specified evaluator. Batch evaluation helps you test recommendations against predefined test datasets and report aggregate scores to detect regressions in cases that are known to be important. If manually created scenarios aren’t enough, you can also use LLM-powered actors to play the role of end users and simulate datasets. A/B testing performs controlled comparisons between agent versions through the AgentCore Gateway, splits live production traffic by a set percentage, and reports results with confidence intervals and statistical significance. Recommendations replace the manual cycle of suggesting changes, validating them with batch evaluation and A/B testing, reading traces, inferring fixes, and blindly deploying.
“Continuous evaluation and improvement of agents is essential to driving data-driven value creation. With AgentCore, processes that previously required weeks of manual, rapid adjustments have evolved into fast, repeatable cycles. Deriving improvement recommendations from operational trace data and improving A/B By validating the impact through testing, organizations can optimize performance while ensuring accuracy and effectiveness. This approach enables continuous and highly efficient improvements at scale.” Yoshiharu Okuda, Head of Generative AI Business Strategy, NTT Data.
How the loop actually executes
Here is how the loop executes in a model upgrade scenario. This pattern is the same for any change, whether it’s a prompt refactoring, a tool set update, or a framework upgrade.
AgentCore’s end-to-end traceability captures all model calls, tool calls, and inference steps as OpenTelemetry-compatible traces managed using AgentCore Observability. Evaluation automatically scores these traces across aspects such as target success rate, tool selection accuracy, usability, and safety using built-in evaluation tools, ground truth comparison, or custom LLM-as-judge scoring.
Generate recommendations. Specify the CloudWatch Log group to which the agent writes traces to the Recommendations API. Choose Reward Signal as the evaluator to optimize, either AgentCore’s built-in evaluator or a custom evaluator you created, and choose what you want to optimize (system prompt or tool description). AgentCore reflects the trace taking into account the provided reward signal and generates recommendations aimed at improving the performance of that reward signal. For tool description recommendations, we only clarify the tool description, not the tool implementation. Decide what to proceed with the validation step as the service suggests.
Package your changes as a configuration bundle. Configurations are shipped as bundles. This is an immutable, versioned snapshot of the agent configuration keyed by runtime ARN (model ID, system prompt, tool description). The agent dynamically reads active configuration through the AgentCore SDK at runtime, so replacing prompts or models results in configuration changes rather than code changes. Create one bundle for your current configuration and another for your recommendations. Bundling is optional. For changes that involve code, deploy to a separate runtime endpoint instead.
Verify offline: Batch evaluation. Run the agent against a curated dataset using the new bundle, evaluate the resulting sessions in batches, and compare the aggregated scores to your baseline. This will capture regressions for the use cases you have already defined. Teams typically connect batch evaluation to a CI/CD pipeline so configuration changes don’t make it to production without passing through a known good case.
Validated against live traffic: A/B testing. Configure the AgentCore Gateway to split live production traffic between the two variants using the current version as the control and the candidate as the treatment. Variants can be different bundle versions on the same runtime for configuration-only changes, or different gateway targets pointing to different runtime endpoints for changes that include code. Online assessments are scored for each session by a designated evaluator. A/B test results include confidence intervals and p-values. Once you have enough data to be confident in the new version’s performance, stop testing, set the new variant as the default, and promote it. To rollback, pause the test and the agent reverts to its existing configuration.
“Manual prompt iterations used to take weeks, but with AgentCore we now have a repeatable cycle that generates recommendations from operational traces, validates them against statistically significant live traffic, and deploys the best configuration. Each cycle generates the next baseline data, complicating the remediation process.” — Masashi Shimizu, Managing Director, Nomura Research Institute, Ltd.
where we are heading
Today’s preview is designed by our developers. Choose when to generate recommendations, which evaluators to target, and whether to promote results. Our vision is a flywheel where traces lead to ratings, ratings drift to the surface, recommendations turn that signal into tangible change, and A/B testing proves it works. The optimal configuration becomes the new baseline, and the trace it generates becomes the input for the next cycle. Over time, the flywheel spins with less force. Recommendations are weighed across multiple evaluators and evidence trade-offs are identified. It also extends optimization to skills, suggesting new skills or improving existing ones based on production usage. Trace analysis categorizes production failures into multiple patterns so you can address them before they escalate. Monitor alarms automatically trigger recommendations and validations when raters fall below a threshold and place the results in a review queue. Once you decide on a ship, the system will do the heavy lifting to get you there.
See it in action
The Market Trends Agent sample on GitHub is a market intelligence agent built for investment brokers, covering real-time stock data, sector analysis, news searches, and personalized broker profiles. For agents serving brokers with different risk profiles, sector interests, and conversational styles, quality declines are difficult to spot and difficult to remediate without the right tools.
Walk through the complete improvement loop. Generates recommendations that highlight where agents are failing to tailor their advice to the broker’s defined strategy or choosing the wrong tools when queries span multiple sectors. Package your changes as a configuration bundle version. Validate fixes through batch evaluation across a selected set of broker conversations. Then, A/B test your configuration against real broker sessions with statistical confidence before promoting to production.
Let’s get started
These features are currently available in preview through Amazon Bedrock AgentCore in AWS Regions where AgentCore evaluation is available. During preview, AgentCore Optimization targets system prompts and tool descriptions for agents that are deployed to AgentCore Runtime and use AgentCore Observability and evaluation.
Start from the AgentCore console or CLI. Read the documentation and follow the step-by-step tutorial here.
About the author
