Dnotitia presents STAR-KV, which achieves up to 20x KV cache compression, selected as ICML 2026 spotlight paper

Machine Learning


  • Introducing a low-rank-based approach to KV cache compression, one of the key bottlenecks in long-context AI
  • Speed ​​up attention calculations by up to 6.9x and overall generation throughput by up to 3.1x, delivering faster inference beyond memory savings.
  • It was selected as a spotlight paper at ICML 2026, accounting for approximately 2.2% of reviewed submissions and approximately 8.4% of accepted papers.
  • Following the spotlight on Google’s TurboQuant at ICLR 2026, STAR-KV presents another approach to evolving KV cache compression
  • The paper is available on arXiv. Publish source code on GitHub

Seoul, South Korea, July 1, 2026 /PRNewswire/ — Dnotitia Inc. (Dnotitia), which specializes in long-term memory AI and semiconductor-based AI infrastructure technologies, has published the paper and source code for “STAR-KV: Low-Rank KV Cache Compression with Soft Threshold for Adaptive Rank Control.” The technology was developed through a collaborative research effort involving researchers from UC San Diego’s VVIP Lab and Dnotitia, and the paper was selected as a spotlight paper at ICML 2026 (International Conference on Machine Learning 2026), one of the world’s leading conferences on machine learning.

In experiments reported in the paper, low-rank compression alone reduced KV cache by up to 75%. Combined with the mixed-precision quantization technique proposed in the paper, STAR-KV compressed the full KV cache by up to 20 times. This technology also improves calculation speed through custom GPU kernels, increasing attention calculation speed by up to 6.9x and overall generation throughput by up to 3.1x. STAR-KV also showed higher accuracy than existing major KV cache compression methods.

KV cache compression has become a key technical challenge in AI infrastructure. As research to alleviate memory bottlenecks in long-context AI gains momentum, including Google’s focus on TurboQuant at ICLR 2026, STAR-KV presents a new approach that combines low-rank compression with quantization and GPU execution optimization.

KV cache is temporary memory stored on the GPU so that large-scale language models (LLMs) do not have to recompute context that has already been processed. As AI evolves into agent systems that use multiple documents, conversation histories, code, search results, and outputs from external tools, the amount of context that models must process is rapidly increasing. In this environment, the KV cache emerges as a major bottleneck that impacts both GPU memory usage and inference cost.

According to the STAR-KV paper, the LLaMA-3.1-8B model is 128K-token In the context of batch size 4, the KV cache occupies approximately 81% of the total GPU memory. As long-context AI becomes more widely used, KV cache compression is gaining increasing attention as a core AI infrastructure technology for processing long-context at low cost.

ICML, where STAR-KV papers were accepted, is widely recognized as one of the top international conferences in AI and machine learning, along with NeurIPS and ICLR. ICML 2026 will be held from July 6th to 11th at COEX in Seoul. This year, 23,918 papers were reviewed, 6,352 were accepted, and 536 were selected as spotlight papers. Spotlight papers represent approximately 2.2% of all reviewed submissions and approximately 8.4% of accepted papers.

In the future, Dnotitia plans to further evolve STAR-KV for use in real-world AI service environments and consider applying it to open source LLM inference frameworks such as vLLM.

“Technology is rapidly advancing that allows AI to process longer contexts faster and at lower cost,” said MK Chung, CEO of Dnotitia. “STAR-KV addresses key bottlenecks in KV cache capacity and attention processing speed, and Dnotitia aims to contribute to the AI ​​inference ecosystem through open source.”

Source Dnotitia Inc.



Source link