đď¸ Posted: December 18, 2025
Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Shubham Agarwal, Mert Cemri, Ion Stoica, and the ADRS team
<aside> đ
This post expands our work on AI-Driven Research for Systems (ADRS). We evaluate three open-source frameworks across ten real-world research problems, demonstrating their ability to generate solutions that outperform human experts, including a 13x speedup in load balancing and 35% cost savings in cloud scheduling. Based on these findings, we outline best practices for problem specification, evaluation, and feedback, providing a roadmap for applying these tools effectively.
One of the most ambitious goals of artificial intelligence is to automate the scientific discovery process itself, from algorithm design to experiment execution. We argue that computer systems research is uniquely positioned to benefit from this shift. Traditionally, improving performance in these fields relies on the meticulous, human-driven design of algorithms for routing, scheduling, or resource management. However, this is changing. We are moving from treating systems as black boxes (using AI merely to tune configuration knobs) to viewing them as white boxes, where AI tools can rewrite the system code itself. We term this approach AI-Driven Research for Systems (ADRS).
Building on our prior work, this article expands the scope of ADRS by rigorously evaluating multiple frameworks and suggesting best practices for applying them effectively. Our focus here is to determine how to deploy these tools to solve real problems.
To validate the capability of this approach, we test three emerging open-source ADRS frameworksâOpenEvolve, GEPA, and ShinkaEvolveâacross ten research tasks. The results confirm that these frameworks can already generate solutions that match or exceed human state-of-the-art, such as a 13x speedup for MoE load balancing and 35% greater savings for cloud costs in job scheduling across spot instances.
With the efficacy of these tools established, we turn to best practices. Based on extensive ablation studies, we outline the strategies necessary for success along three critical axes: problem specification (where "less is often more"), evaluation (where the solution is only as good as the verifier), and feedback (where granularity determines convergence).
To rigorously evaluate the capability of ADRS, we expanded our investigation to three open-source frameworksâGEPA, OpenEvolve, and ShinkaEvolveâon ten real-world research problems across diverse sub-domains, including networking, databases, and core systems. We use GPT-5 and Gemini-3.0, capped at 100 iterations per run to ensure a fair comparison, and provide the specific configs used in the appendix of our paper.

Table 1. Summary of results achieved by ADRS frameworks.
Table 1 presents an overview of selected case studies. In nearly all cases, LLMs were able to discover solutions that outperformed state-of-the-art baselines.
Most of these solutions were discovered in under 8 hours, at a cost of less than $30. Importantly, the results we're sharing should be seen as a starting point; as the frameworks and models improve, we expect even more improvements.