🗓️ Posted: December 4th, 2025

Jai Menon, Rohan Kulkarni, Sesh Nalla, and the ADRS team

<aside> 💡

This post is part of the AI-Driven Research for Systems (ADRS) blog series, where we explore how AI can be applied to systems research. We feature exciting work from Datadog this week!

In this blog post, we examine the problem of generating production-ready, optimized GPU code from an evolutionary search perspective. Specifically, we share results from BitsEvolve, an ADRS framework built at Datadog. BitsEvolve targets various modalities ranging from optimizing hotspots in CPU-bound code to policy/configuration tuning for applications like inference serving frameworks (e.g. vLLM), Load Balancers, Garbage Collectors and more. Through profile guidance and robust evaluation mechanisms, we show how BitsEvolve-generated code can outperform compiled models, achieving speedups of up to 1.6x with reasonable search costs.

✍️ Previous Blogs: https://ucbskyadrs.github.io/
🚀 Previous BitsEvolve blog post: https://www.datadoghq.com/blog/engineering/self-optimizing-system/
📝 ADRS Paper: https://arxiv.org/abs/2510.06189
👩‍💻 ADRS Code: https://github.com/UCB-ADRS/ADRS
💬 Join us: https://join.slack.com/t/adrs-global/shared_invite/zt-3fgme22n5-PKYyAc9aIeTyX5iSQTKIoA and Discord
Follow us: https://x.com/ai4research_ucb </aside>

The Problem: The Optimization Bottleneck

AI workloads are devouring compute. Across both datacenters and the edge, current trends point to a rising share for GPUs and accelerators. This momentum has also led to rapid maturity and increasing complexity in the GPU software stack. We have evolved from programming GPUs using raw CUDA to a landscape of DSLs and compiler-driven code generation, all with varying levels of efficacy. While well-known examples exist of hand-optimized, performant kernels (relative to peak SOL throughput), such efforts are restricted to a handful of core primitives. Similar to high-performance CPU optimization, the specialized skill set required is rare, and applying it to every niche problem is often not ROI-positive.

Furthermore, given the pace of innovation, the set of possible optimization targets is ever-growing. We must constantly adapt to new GPU architectures and compute capability levels, cost-efficient SKUs, evolving numeric formats (e.g., quantized types, microscaling formats), and shifting model configurations.

So, in the spirit of ADRS, we ask the question: Can we use LLM-based coding agents as our “GPU kernel engineers” to continuously optimize AI/ML workloads?

To explore answers to that question, we built a GPU code optimization and kernel generation flow in BitsEvolve, an agentic optimization system.

Related Work

Automated kernel generation has seen significant interest recently.

KernelBench provides reference computation graphs across different complexity levels to evaluate correctness and performance on open-weight and proprietary models.
KernelFalcon takes an agentic approach, combining techniques to achieve 100% correctness on the KernelBench suite.
KernelLLM explores fine-tuning existing models specifically for GPU kernel generation.

BitsEvolve builds on these ideas but targets a more holistic, production-first approach.

BitsEvolve for GPU Code Optimization

BitsEvolve is an ADRS framework that takes a base ML model (e.g., PyTorch model code) as input, generates an evaluation harness with the model code built in, and executes an LLM-guided evolutionary search as described in our previous Datadog blog post. The result is an optimized model that functions as a drop-in replacement for the original base model.

We build BitsEvolve on top of ShinkaEvolve, adding customizations that we are currently upstreaming (including support for languages like Rust and LLM query streaming). In the core evolutionary loop, we use frontier models: specifically GPT-5, GPT-5.1 (varying reasoning efforts), and Gemini 2.5 Pro (dynamic thinking).

In comparison to the related work mentioned previously, BitsEvolve aims to take a more holistic, layered approach to GPU code optimization. More concretely:

Modular Evolutionary Search: We retain the core LLM-guided evolutionary search but use modular layers to specialize our approach. For example:
- Flexible Targets: The code generation agent can target raw CUDA, Triton, CuTe, or other vendor-specific DSLs.
- Model Flexibility: The underlying model itself can be a proprietary external model, or a fine-tuned open-weights model.