Shu Liu, Jeff Chen, Audrey Cheng, Melissa Pan, Ion Stoica, and ADRS Team

🗓️ Posted: November 6th, 2025

<aside> 💡

This post is part of our AI-Driven Research for Systems (ADRS) case study series, where we use AI to automatically discover better algorithms for real-world systems problems.

In this blog, we revisit a recent research challenge from our MLSys ’25 paper, “Optimizing LLM Queries in Relational Data Analytics Workloads ”, which tackles the high cost and latency of executing LLM queries over relational data. We show how using OpenEvolve, ADRS autonomously discovered a 3× faster algorithm that achieves the same prefix reuse ratio as the published solution. This case study shows how AI can independently improve even state-of-the-art algorithms in LLM-based data analytics.

✍️ Previous Blogs: https://ucbskyadrs.github.io/
📝 Paper: https://arxiv.org/abs/2510.06189
👩‍💻 Code: https://github.com/UCB-ADRS/ADRS
💬 Join us: https://join.slack.com/t/adrs-global/shared_invite/zt-3fgme22n5-PKYyAc9aIeTyX5iSQTKIoA and Discord
Follow us: https://x.com/ai4research_ucb </aside>

The Problem: LLM Queries are Slow and Expensive

Large language models (LLMs) are increasingly used in relational data analytic workloads, where SQL queries or DataFrame operations invoke LLMs over relational (table-based) data. Major analytics platforms, such as AWS Redshift, Databricks, and Google BigQuery, ****now integrate LLM functionality directly into their SQL APIs. Similarly, modern DataFrame libraries like LOTUS support LLM calls over structured data.

For example, a user might issue the following query to check whether a product description matches its review:

SELECT LLM(
  "Is the description consistent with the review?", 
  Description, Review
) FROM Products;

This simple-looking query triggers one LLM inference per row, often thousands or millions of times. Running such queries on commercial APIs or GPUs is prohibitively slow and expensive. For instance, as shown in Figure 2, executing a 100 GB table on Llama-3-8B with NVIDIA L4s can take 30 minutes and cost over $5K, and scaling to full-dataset analytics can reach $15K per query.

Figure 2. Running LLM Queries with 100GB of relational data.

Prefix Reuse

One of the key insights from the MLSys ‘25 paper is that relational tables often contain repeated values. Some examples are identical timestamps, product categories, and metadata fields shared across many rows. However, LLMs can reuse computation only when the shared parts appear as a prefix in consecutive prompts. In such cases, the model can avoid reprocessing the repeated tokens by reusing their intermediate computed states, stored as key and value (KV) vectors in the KV-cache.

In practice, standard table orderings rarely align these shared prefixes, forcing the model to redundantly re-compute identical values across rows. The MLSys ’25 paper shows that by reordering the table’s columns and rows so that similar prefixes appear consecutively, the LLM can reuse its KV-cache across successive inference operations. This eliminates redundant computation and significantly reduces both latency and cost, as cached prefix tokens are 2–10× cheaper to compute. Figure 3 illustrates this intuition.

Figure 3. Example of two different column orderings: when “time” and “description” columns are swapped to appear before “review,” the model reuses cached tokens across rows, converting repeated misses into hits.

SOTA Algorithm: Greedy Group Recursion (GGR)

To exploit the structure of relational tables, the reordering algorithm aims to maximize prefix cache hit rate (PHR), i.e., ****the fraction of tokens reused across consecutive LLM inference requests. However, finding the optimal ordering is combinatorially intractable: a table with $n$ rows and $m$ fields has $n!\times(m!^n)$ possible orderings.

The Greedy Group Recursion (GGR) algorithm proposed by the paper provides a practical approximation. It works by recursively scanning the table to find the value that contributes most to prefix reuse (typically one that is both frequent and long). Once this value is found, GGR prioritizes the column in which it occurs, moving that field earlier in the ordering. After selecting that column, GGR partitions the table into smaller subsets: rows sharing that value form one group, and the rest form another. The same process is then applied recursively within each subset (Figure 4).

This recursive grouping naturally aligns rows that share prefixes, enabling high cache reuse across LLM inferences. To further improve algorithm efficiency, GGR uses functional dependencies and table statistics (e.g., average string length, value cardinality) to decide when to stop recursion early.

Figure 4. Demonstration of GGR recursive reordering. GGR first selects the value that contributes most to prefix reuse (highlighted in yellow) and then partitions the table into sub-tables (red and blue boxes) based on that value.

Despite achieving near-optimal cache reuse in practice, GGR remains computationally expensive: it repeatedly scans tables and performs deep recursion, which adds runtime overhead and makes it difficult to scale to larger tables. This inefficiency makes it a natural target for automated optimization using ADRS.