Asynchronous validated semantic caching for hierarchical LLM architectures

Machine Learning


Large-scale language models (LLMs) are now in the critical path of search, assistance, and agent workflows, making semantic caching essential to reduce inference costs and delays. Production deployments typically use a tiered static-dynamic design. It is a static cache of curated responses mined from logs and vetted offline, backed by a dynamic cache set online. In practice, both layers are generally governed by a single embedding similarity threshold, which leads to hard trade-offs. Conservative thresholds miss opportunities for safe reuse, while aggressive thresholds risk providing semantically incorrect responses. Krites is an LLM-determined asynchronous cache policy that extends static coverage without changing service delivery decisions. On the critical path, Krites behaves exactly like a standard static threshold policy. When a prompt’s static nearest neighbor falls slightly below the static threshold, Krites asynchronously calls the LLM decision to see if the static response is acceptable to the new prompt. Approved matches are promoted to a dynamic cache, allowing you to reuse curated static answers in future iterations and rephrasings, increasing your static reach over time. In trace-driven simulations of conversational and search workloads, Krites increases the proportion of requests served with curated static answers (direct static hits and verified promotions) by up to 3.9x compared to a tuned baseline for conversational traffic and search-style queries while keeping critical path latency unchanged.



Source link