Streaming Attention Approximation via Discrepancy Theory
A Google TechTalk, presented by Ekaterina Kochetkova, 2025-10-23
ABSTRACT: The memory requirements of LLM inference grow rapidly with the context length due to the demands of attention computation. We present BalanceKV - an algorithm that leverages the geometric properties of the key-value cache to compress it without significantly affecting the quality of attention computation. BalanceKV has strong theoretical guarantees grounded in discrepancy theory and demonstrates empirically validated performance improvements over existing methods. The full paper is available at arXiv:2502.07861.
About the Speaker: Ekaterina Kochetkova is a third year CS PhD student at EPFL working with Michael Kapralov. She is broadly interested in applying theoretical insights to develop efficient algorithms for large-scale machine learning. Her recent work focuses on optimizing the memory/runtime of LLM inference and on sublinear graph clustering methods that utilize learned vertex features. More information is available at https://ekaterina-kochetkova.github.io/e_kochetkova.github.io/.
Google TechTalks
Google Tech Talks is a grass-roots program at Google for sharing information of interest to the technical community. At its best, it's part of an ongoing discussion about our world featuring top experts in diverse fields. Presentations range from the br...