Utility ranked memory system for self-improving AI agents

The new approach closes the feedback loop between agent observability and performance, enabling continuous improvement without the need for rapid engineering.

StarlightSearch, a startup building infrastructure for self-improving AI agents, today announced the launch of Reflect, a utility-ranked memory layer that ranks guidance obtained by real-world results, not just semantic similarity.

Agents in production require usefulness, or outcome-based, learnable scores, rather than similarity scores. ”

— Sonam Pankaj

This announcement addresses a persistent gap in production AI systems. Most organizations today have robust observability stacks that capture traces of agents and assessment frameworks that measure pass/fail rates, but these systems are largely unconnected. The agent starts each task from a blank slate and cannot learn from previous mistakes.

“Every AI team we talk to has the same frustration,” says Sonam Pankaj, founder of StarlightSearch. “They can see exactly where their agents are failing. Dashboards are full of traces. But turning those failures into better behavior requires manual intervention. We built Reflect to automate that learning loop.”

Also read: AiThority Interview with Glenn Jocher, Ultralytics Founder and CEO

How Reflect works: Utility differences

Traditional memory systems for large-scale language models rely on semantic similarity to retrieve content that seems relevant to the current query. Reflect adds a second dimension: utility, a score that tracks whether following a particular piece of advice you got actually led to success.

The system uses a weighted scoring formula that balances semantic relevance against proven validity. Memories that have been retrieved multiple times and consistently contributed to success will be ranked higher than memories that simply sound similar to the current task.

“Think of it like a credit score,” Pankaj explained. Track whether you have repaid it. Similarly, utility tracks whether the memory actually helped the agent succeed. ”

From facts to inferences

Unlike traditional memory layers that store static facts (user preferences, chunks of documents, conversation history), Reflect stores inferences about results. When a trace is reviewed (marked as passing or failing), Reflect generates a reflection. This is a condensed lesson on what went wrong and what you should do differently next time.

For example, a customer support agent handling a refund request might first get the advice to “immediately issue a refund” for a duplicate charge complaint. If you neglect to check the payment status and end up getting double refunds, the value of those memories will decrease. At the same time, Reflect saves the new reflection. “For duplicate billing complaints, please check the payment status before initiating a refund.” For subsequent similar tickets, newer reflections will be ranked higher and older reflections will have lower priority. No out-of-the-box engineering, all done by humans.

3-tier integration

Reflect sits between three existing components of the production agent architecture.

– Observability: Tracing captures all tool calls, LLM completions, and exceptions.
– Rating: Reviews mark results as pass, fail, or provide detailed feedback.
– Action: Agent acquires memory before execution.

The company’s approach treats traces as training signals rather than passive audit logs. When a trace is marked as failed by review, the system extracts the reflection and stores it as memory linked to the task. As similar tasks arrive, they are reflected in the updated utility score.

“What allows us to put this into production quickly is that we can react to results,” Pankaj says. “You’re not searching by keywords; you’re searching by semantic similarities, weighted by whether those memories helped or hurt you historically. The results of your evaluation become first-class signals in search rankings.”

market situation

This release comes as more companies deploy AI agents for customer support, code reviews, and task automation. This is a use case where consistent, reliable behavior is more important than one-off heroic performance.

Existing memory frameworks have primarily focused on user continuity: personalization, preferences, and conversation history. Academic research involving Reflexion has demonstrated that agents can learn from verbal reflection, achieving a 91% pass rate on coding benchmarks. However, these approaches do not incorporate learned utility signals that rank which experiences surface.

Reflect’s differentiation lies in its quantitative ranking layer. Semantic memory retrieves what is likely to be relevant, while Reflect retrieves what is trusted through repeated reviews.

Also read: The infrastructure war behind the AI boom

[To share your insights with us, please write to psen@itechseries.com ]

Source link