Automating Algorithm Discovery: A Case Study in Scheduler Design for Multi-LLM Serving Systems

🗓️ Posted: January 15th, 2026

Jacopo Tagliabue, Audrey Cheng, Shu Liu, Ion Stoica, and the ADRS team

<aside> 💡

This post is part of our AI-Driven Research for Systems (ADRS) case study series, where we use AI to automatically discover better algorithms for real-world systems problems. We feature exciting work from Bauplan this week!

In this blog, we describe how we repeatedly sample frontier models to generate scheduling policies for workloads in a FaaS lakehouse, Bauplan. Leveraging our FaaS simulator, Eudoxia, as a fast verifier, we share preliminary findings (to be presented at AAAI26) in our journey in applying AI to real-world systems.

</aside>

📝 Bauplan Paper: https://arxiv.org/pdf/2510.18897
📝 ADRS Paper: https://arxiv.org/abs/2510.06189
✍️ Previous ADRS Blogs: https://ucbskyadrs.github.io/
👩‍💻 Code: https://github.com/UCB-ADRS/ADRS
💬 Join us: https://join.slack.com/t/adrs-global/shared_invite/zt-3fgme22n5-PKYyAc9aIeTyX5iSQTKIoA and Discord
Follow us: https://x.com/ai4research_ucb, **Bauplan Engineering Blog, Bauplan on Linkedin**

In Eudoxia, (...), a carpet is preserved in which you can observe the city’s true form. At first sight nothing seems to resemble Eudoxia less than the design of that carpet (...), but if you pause and examine it carefully, you become convinced that each place in the carpet corresponds to a place in the city and all the things contained in the city are included in the design.” I. Calvino, Invisible Cities

The Problem: Designing Flexible Scheduling Policies

Bauplan is a platform for running data pipelines, based on a scheduler: when multiple users want to run their DAGs, which one goes first? How many resources should they be given? When should failed work be retried?

In practice, this is a multi-dimensional, complex problem. Tasks within an organization may differ in how much time and resources they require, and different organizations may have wildly different distributions: some use Bauplan mostly for long-running batch jobs while others execute frequent mini-batches and online queries. Moreover, schedulers can be designed for optimizing different metrics: high throughput, low latency for high-priority jobs, predictable performance.

Our challenge is therefore the following: can we use LLMs to automatically write and improve policies, based on a distribution of workloads and a target KPI?

The ADRS loop with a simulator as the verifier.

Not a Warehouse, not a Lambda, but a Third New Thing

The Data Lakehouse (DLH) is the de facto standard for analytics, data engineering and AI workloads. In traditional DLHs, this flexibility comes at a cost: debugging data pipelines, running workloads on a schedule, querying tables requires practitioners to move between several UIs and master a plethora of tools, each with their own mental models:

Interaction	UX	Infrastructure
Traditional DLH
Batch pipeline	Submit API	One-off cluster
Dev. pipeline	Notebook Session	Dev. cluster
Inter. query	Web Editor (JDBC Driver)	Warehouse