🗓️ Posted: December 18, 2025

Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Shubham Agarwal, Mert Cemri, Ion Stoica, and the ADRS team

<aside> 🛠

This post expands our work on AI-Driven Research for Systems (ADRS). We evaluate three open-source frameworks across ten real-world research problems, demonstrating their ability to generate solutions that outperform human experts, including a 13x speedup in load balancing and 35% cost savings in cloud scheduling. Based on these findings, we outline best practices for problem specification, evaluation, and feedback, providing a roadmap for applying these tools effectively.

📝 Paper: https://arxiv.org/abs/2512.14806
✍️ Previous ADRS Blogs: https://ucbskyadrs.github.io/
📝 Previous Paper: https://arxiv.org/abs/2510.06189
👩‍💻 Code: https://github.com/UCB-ADRS/ADRS
💬 Join us: https://join.slack.com/t/adrs-global/shared_invite/zt-3fgme22n5-PKYyAc9aIeTyX5iSQTKIoA and Discord
Follow us: https://x.com/ai4research_ucb </aside>

Accelerating Discovery: AI-Driven Systems Research

One of the most ambitious goals of artificial intelligence is to automate the scientific discovery process itself, from algorithm design to experiment execution. We argue that computer systems research is uniquely positioned to benefit from this shift. Traditionally, improving performance in these fields relies on the meticulous, human-driven design of algorithms for routing, scheduling, or resource management. However, this is changing. We are moving from treating systems as black boxes (using AI merely to tune configuration knobs) to viewing them as white boxes, where AI tools can rewrite the system code itself. We term this approach AI-Driven Research for Systems (ADRS).

Building on our prior work, this article expands the scope of ADRS by rigorously evaluating multiple frameworks and suggesting best practices for applying them effectively. Our focus here is to determine how to deploy these tools to solve real problems.

To validate the capability of this approach, we test three emerging open-source ADRS frameworks—OpenEvolve, GEPA, and ShinkaEvolve—across ten research tasks. The results confirm that these frameworks can already generate solutions that match or exceed human state-of-the-art, such as a 13x speedup for MoE load balancing and 35% greater savings for cloud costs in job scheduling across spot instances.

MoE Load Balancing: OpenEvolve discovered an algorithm to rebalance experts across GPUs that is 13x faster than the best-known baseline.
Cloud Cost Optimization: In a job scheduling problem for spot instances, the AI generated a solution achieving 35% greater savings than an expert-developed baseline.

With the efficacy of these tools established, we turn to best practices. Based on extensive ablation studies, we outline the strategies necessary for success along three critical axes: problem specification (where "less is often more"), evaluation (where the solution is only as good as the verifier), and feedback (where granularity determines convergence).

New Results on Three ADRS Frameworks

To rigorously evaluate the capability of ADRS, we expanded our investigation to three open-source frameworks—GEPA, OpenEvolve, and ShinkaEvolve—on ten real-world research problems across diverse sub-domains, including networking, databases, and core systems. We use GPT-5 and Gemini-3.0, capped at 100 iterations per run to ensure a fair comparison, and provide the specific configs used in the appendix of our paper.

Table 1. Summary of results achieved by ADRS frameworks.

Table 1 presents an overview of selected case studies. In nearly all cases, LLMs were able to discover solutions that outperformed state-of-the-art baselines.

CBL (Spot Instance Savings) [NSDI ‘24 Outstanding Paper]: Given a job with a deadline, the solution maximizes the use of cheaper spot instances. ADRS improved the SOTA result by up to 35% for a single region.
CBL-Multi (Multi-Region Spot Savings): An extension of CBL where the policy must also choose migration timing and region placement. The system achieved a 17% improvement over a strong baseline.
EPLB (MoE Expert Placement): To balance load across GPUs for Mixture-of-Experts inference, ADRS provided a 5x improvement in rebalancing time compared to the best-known proprietary implementation.
LLM-SQL (KV Cache Optimization) [MLSys ‘25]: The solution reorders table rows and columns to maximize KV cache hit rates. ADRS matched SOTA hit rates while reducing the algorithm's runtime by 3x.
TXN (Transaction Scheduling) [VLDB ‘24]: For transaction reordering, the system "rediscovered" the SOTA solution for the online case and improved a strong baseline by 34% for the offline case—a problem for which we are not aware of any published solution.

Most of these solutions were discovered in under 8 hours, at a cost of less than $30. Importantly, the results we're sharing should be seen as a starting point; as the frameworks and models improve, we expect even more improvements.