🗓️ Posted: December 15, 2025
Mert Cemri, Melissa Pan, Audrey Cheng, Shu Liu, Ion Stoica, and the ADRS team
<aside> 💡
This post is part of our AI-Driven Research for Systems (ADRS) case study series, where we use AI to automatically discover better algorithms for real-world systems problems.
Designing effective multi-agent systems typically requires debugging workloads via execution logs and iteratively refining the agentic systems’ behavior. Previously, we demonstrated how the MAST Annotator provides scalable, systematic feedback on failure modes to guide agents builders to make design improvements. However, that approach still relied on hand-crafted solutions and implementations.
In this blog, we replace hand-tuning with OpenEvolve to optimize the Multi-Agent System (MAS) code directly. By leveraging MAST feedback, OpenEvolve continuously mutates the architecture, automatically converging toward a more reliable system, improving failure rates by 7x.
In this work, we demonstrate that you can replace manual agent debugging with automated architectural search. By combining OpenEvolve (evolutionary optimization) with MAST (fine-grained failure signals), we took a standard MetaGPT-style software development team and let the code rewrite itself.
Over 46 iterations, the system autonomously evolved a fragile baseline (0.136 score, ~7 failures/trace) into a robust architecture (0.50 score, ~1 failure/trace), with 7x fewer failure. Crucially, the optimizer discovered sophisticated design patterns that usually take humans days to identify:
SimpleVerifier agent to decouple execution from checking.We also expose the risks of automated design: without strict guardrails (like mandatory evidence gates), the evolutionary search creates "reward hacks"—optimizing for score by simply deleting the agents responsible for reporting failures.
Multi-agent systems are easy to prototype but painfully to improve. When a run goes wrong, you rarely get a single clear signal. Often, you see a spaghetti trace where agents repeat steps, drift roles, drop context, ignore each other’s messages, “verify” hand-wavily, or declare success early. The result is that every design decision becomes a guessing game: How many agents? How to breakdown the task? How agents should communicate with each other?
In a previous blog, we introduce MAST [NeurIPS'25 Spotlight], a failure taxonomy and LLM annotator that turns those messy agent traces into structured, actionable failure signals (instead of vibes-based log reading). But even with good diagnostics, the development loop is still mostly manual:
So we asked: can we automate the iteration loop itself?