Automating Algorithm Discovery: A Case Study in Improving Multi-Agent System Design using MAST

🗓️ Posted: December 15, 2025

Mert Cemri, Melissa Pan, Audrey Cheng, Shu Liu, Ion Stoica, and the ADRS team

<aside> 💡

This post is part of our AI-Driven Research for Systems (ADRS) case study series, where we use AI to automatically discover better algorithms for real-world systems problems.

Designing effective multi-agent systems typically requires debugging workloads via execution logs and iteratively refining the agentic systems’ behavior. Previously, we demonstrated how the MAST Annotator provides scalable, systematic feedback on failure modes to guide agents builders to make design improvements. However, that approach still relied on hand-crafted solutions and implementations.

In this blog, we replace hand-tuning with OpenEvolve to optimize the Multi-Agent System (MAS) code directly. By leveraging MAST feedback, OpenEvolve continuously mutates the architecture, automatically converging toward a more reliable system, improving failure rates by 7x.

✍️ Previous MAST Blog: https://mast-ucb.notion.site/improve-agents-with-mast
📝 MAST Paper: https://arxiv.org/abs/2503.13657
✍️ Previous ADRS Blogs: https://ucbskyadrs.github.io/
📝 ADRS Paper: https://arxiv.org/abs/2510.06189
👩‍💻 Code: https://github.com/UCB-ADRS/ADRS
💬 Join us: https://join.slack.com/t/adrs-global/shared_invite/zt-3fgme22n5-PKYyAc9aIeTyX5iSQTKIoA and Discord
Follow us: https://x.com/ai4research_ucb </aside>

TL;DR: Automating Multi-Agent System Design Process Using MAST and OpenEvolve

In this work, we demonstrate that you can replace manual agent debugging with automated architectural search. By combining OpenEvolve (evolutionary optimization) with MAST (fine-grained failure signals), we took a standard MetaGPT-style software development team and let the code rewrite itself.

Over 46 iterations, the system autonomously evolved a fragile baseline (0.136 score, ~7 failures/trace) into a robust architecture (0.50 score, ~1 failure/trace), with 7x fewer failure. Crucially, the optimizer discovered sophisticated design patterns that usually take humans days to identify:

Negative Constraints: Shifting prompts from "be helpful" to strict negative boundaries (e.g., "do not explain," "do not plan") to prevent role drift.
Structural Verification: Spontaneously inventing a dedicated SimpleVerifier agent to decouple execution from checking.
Hybrid Handoffs: Inserting cheap, deterministic AST static analysis before expensive LLM calls.

We also expose the risks of automated design: without strict guardrails (like mandatory evidence gates), the evolutionary search creates "reward hacks"—optimizing for score by simply deleting the agents responsible for reporting failures.

The Problem: MAS Debugging Does Not Scale

Multi-agent systems are easy to prototype but painfully to improve. When a run goes wrong, you rarely get a single clear signal. Often, you see a spaghetti trace where agents repeat steps, drift roles, drop context, ignore each other’s messages, “verify” hand-wavily, or declare success early. The result is that every design decision becomes a guessing game: How many agents? How to breakdown the task? How agents should communicate with each other?

In a previous blog, we introduce MAST [NeurIPS'25 Spotlight], a failure taxonomy and LLM annotator that turns those messy agent traces into structured, actionable failure signals (instead of vibes-based log reading). But even with good diagnostics, the development loop is still mostly manual:

Run tasks
Read long traces
Guess what to change
Repeat

So we asked: can we automate the iteration loop itself?

TL;DR: Automating Multi-Agent System Design Process Using MAST and OpenEvolve

The Problem: MAS Debugging Does Not Scale

The Idea: Treat MAS Design Like Algorithm Discovery