🗓️ Posted: January 29th, 2026

Aditya Desai (SkyLight Team), Audrey Cheng, Ion Stoica, and the ADRS team

<aside> 💡

This post is part of our AI-Driven Research for Systems (ADRS) case study series, where we use AI to automatically discover better algorithms for real-world systems problems.

In this post, we study sparse attention for accelerating decoding in Large Language Models (LLMs). The goal is to reduce the memory traffic and latency during the decoding phase, which is a fundamental bottleneck in LLM inference.

We explore how AI, specifically a Cursor agent within the SkyLight framework, can evolve towards state-of-the-art solutions like vAttention. Starting from a simple sink + sliding-window attention, the agent iteratively refines the approach, discovering components like top-k selection and positional bias. While the agent made significant progress, human oversight was crucial to correct subtle numerical issues and fully realize vAttention-level quality.

</aside>


The Problem: Designing Sparse Attention for Accelerating Decoding

Modern auto regressive LLM inference proceeds in two distinct phases: prefill and decode. In a typical single turn interaction, the model first processes the entire prompt, which may include long textual instructions or visual inputs, and computes token representations across all transformer layers while materializing the key value cache. This stage, referred to as the prefill phase, is largely compute bound and can be executed efficiently on modern accelerators.

The subsequent decode phase behaves very differently, leading to performance challenges. During decoding, the model generates one token at a time while repeatedly accessing the previously constructed key value cache. Despite operating on far fewer tokens per step, decoding is often significantly slow. This slowdown is fundamental. The decode phase is dominated by memory traffic rather than computation, which limits effective GPU utilization and makes memory bandwidth and latency the primary bottlenecks (Figure 1).

For more information, check out this blog from FlashInfer.

Figure 1. Decoding is bounded by memory (courtesy of FlashInfer).

Figure 1. Decoding is bounded by memory (courtesy of FlashInfer).

Sparse Attention for Decoding

Sparse attention is a widely explored paradigm for accelerating the decode phase of LLM inference. Instead of attending to all previously generated tokens, sparse attention selects only a subset of tokens to participate in the attention computation. This selection is typically performed independently for each attention head and each transformer layer. By limiting attention to a small set of relevant tokens, sparse attention significantly reduces key value cache accesses and overall memory traffic. As a result, memory pressure is lowered and the latency of each decoding step is reduced.