🗓️ Posted: January 8th, 2026

Jiarong Xing, Yifan Qiao, Audrey Cheng, Shu Liu, Ion Stoica, and the ADRS team

<aside> 💡

This post is part of our AI-Driven Research for Systems (ADRS) case study series, where we use AI to automatically discover better algorithms for real-world systems problems.

Low GPU utilization is a long-standing pain point for cloud LLM inference providers. As GPUs are become more powerful (e.g., faster compute and larger memory), sharing GPUs across multiple models becomes a natural approach to improve efficiency. In our Prism project, we explore flexible GPU sharing for multi-LLM serving.

A central challenge in Prism is model placement: co-locating the right models on shared GPUs to avoid latency SLO violations under dynamic workloads. In this blog, we first present a manually designed placement heuristic, then show how OpenEvolve automatically rediscovers and further improves it, improving the original by 17%.

</aside>

📝 Prism Paper: Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
✍️ Previous ADRS Blogs: https://ucbskyadrs.github.io/
📝 ADRS Paper: https://arxiv.org/abs/2510.06189
👩‍💻 Code: https://github.com/UCB-ADRS/ADRS
💬 Join us: https://join.slack.com/t/adrs-global/shared_invite/zt-3fgme22n5-PKYyAc9aIeTyX5iSQTKIoA and Discord
Follow us: https://x.com/ai4research_ucb

Background: Multi-LLM Serving with GPU Sharing

In real-world deployments, LLM inference providers host many models with long-tailed popularity: a small number of models are hot and receive most requests, while many others are cold and remain largely idle. Moreover, inference workloads are pretty dynamic The request rates could fluctuate over time, and models frequently can switch between hot and cold states. The traditional approach of dedicating GPUs to individual models therefore leads to poor GPU utilization, especially for cold models with low resource demand.

Prism addresses this inefficiency by enabling flexible GPU sharing across multiple models. By co-locating models on shared GPUs, Prism significantly improves resource utilization and reduces serving costs. At its core, Prism introduces kvcached, a memory sharing mechanism that dynamically allocates KV cache across models using a virtual-memory abstraction. On top of this foundation, Prism provides scheduling algorithms that coordinate resource sharing among models to meet their diverse latency SLO requirements.

kvcached enables co-located models to dynamically share a GPU to improve utilization and reduce costs.

The Model-to-GPU Placement Problem

In GPU sharing, the primary bottleneck is often GPU memory, particularly the KV cache, whose footprint grows with the number of active tokens. When multiple models are co-located on the same GPU, they compete for KV cache capacity; once memory becomes scarce, the system must preempt or evict work, leading to degraded TTFT and TPOT.

As a result, model co-location decisions are critical. Placing multiple memory-hungry models on the same GPU can quickly exhaust KV cache capacity, triggering resource contention and ultimately causing latency SLO violations. Prism needs to find a model placement that minimizes SLO violation as much as possible.

This problem is deceptively simple to state (see Prism §6.2 for more details):

We have N GPUs, each with memory capacity C.
We have M models, each with:
- a request rate,
- a memory footprint to store its weights,
- an SLO (latency deadline target).