x October 23, 2025

Audrey Cheng, Bowen Wang, Shu Liu, Melissa Pan, Ion Stoica, and the ADRS team

<aside> 🛠

This post is the first in a series of case studies in which we apply AI-Driven Research for Systems (ADRS) to optimize performance in various systems. In this blog, we discuss the optimization of a key component in large language model (LLM) inference. Specifically, we demonstrate how OpenEvolve independently discovers and surpasses highly optimized algorithms engineered by human experts to achieve a 5.0x speedup.

✍️ Previous blog post: https://adrs-ucb.notion.site/
📝 Paper: https://arxiv.org/abs/2510.06189
👩‍💻 Code: https://github.com/UCB-ADRS/ADRS
💬 Join us: https://join.slack.com/t/adrs-global/shared_invite/zt-3fgme22n5-PKYyAc9aIeTyX5iSQTKIoA and Discord
🌎 Follow us: https://x.com/ai4research_ucb </aside>

The Problem: Balancing Load for MoE Inference

The immense scale of modern LLMs is made manageable by architectures like Mixture-of-Experts (MoE). In this model, a router dynamically sends each token of an input to a small subset of specialized "expert" networks. This allows requests to be processed using only a fraction of the model's total parameters, greatly improving inference efficiency. However, this architecture introduces the critical performance challenge of balancing the load across these experts.

Inevitably, some experts become more popular or "hot," creating computational bottlenecks. The GPUs hosting these hot experts are overwhelmed, while others sit idle, wasting valuable resources (Figure 1).

Figure 1. An unbalanced MoE system: the bright yellow spots represent "hot" experts, showing load imbalance and GPU underutilization. “Physical experts” refer to the model weights residing on GPUs, which may include both regular “logical” experts without EPLB and their replicated counterparts, as illustrated in the following figure.

Figure 1. An unbalanced MoE system: the bright yellow spots represent "hot" experts, showing load imbalance and GPU underutilization. “Physical experts” refer to the model weights residing on GPUs, which may include both regular “logical” experts without EPLB and their replicated counterparts, as illustrated in the following figure.

The solution is an Expert Parallelism Load Balancer (EPLB), an algorithm that dynamically rearranges experts across GPUs to minimize load imbalance and maximize system throughput. The basic EPLB algorithm runs in three stages:

(i) distribute expert groups across nodes to balance the load
(ii) create replicas for hot experts
(iii) assign these replicas to GPUs to further maximize load balancing.

Given a workload, an MoE setup, and some GPUs, the EPLB algorithm determines the number of replicas for each experts and then maps these replicas on GPUs.

The EPLB algorithm has two objectives:

Minimize imbalance: Distribute the load as evenly as possible.
Minimize runtime: The rearrangement process itself must be fast to avoid becoming a new bottleneck.

The EPLB algorithm has direct impact on the cost and performance of production LLM serving (Figure 2).

Figure 2. With load balancing under an EPLB algorithm, GPUs can be more fully utilized to lower costs and provide better LLM serving performance. In this figure, we have 64 logical experts and 16 replicated experts.

Existing EPLB Algorithms

We consider two baselines in searching for a better EPLB algorithm.

First, we evaluate DeepSeek's open-source EPLB implementation. This employs a greedy bin-packing strategy: experts are sorted by load in descending order, and each is placed onto the least-loaded GPU that has capacity (Figure 3a, Example 1). While simple, the solution is slow because it written in Python and uses a for-loop to performs linear search for finding the best-fit GPU choice. On average, it takes about 540 ms to re-balance the experts and achieves a load balance factor of 0.66 (calculated as the ratio of average to maximum tokens generated per GPU).