🗓️ Posted: November 20, 2025

Charles Hong, Sahil Bhatia, Alvin Cheung, and Yakun Sophia Shao, and the ADRS team

<aside> đź’ˇ

This post is part of the AI-Driven Research for Systems (ADRS) blog series, where we explore how AI can be applied to systems research. This post was contributed by a team of our colleagues at UC Berkeley’s SLICE Lab!

In this blog post, we go down the stack and explore how AI is being used to speed up AI—at the kernel level. Specifically, we highlight Autocomp, the first LLM-driven code optimizer for low-resource tensor accelerators. Autocomp helps hardware designers extract the full performance of tensor accelerators, outperforming human expert kernel writers by up to 17x on AWS Trainium while being highly portable and easy to use. Read below and see the Autocomp 📝  paper and 👩‍💻 GitHub repo for full technical details!

More from ADRS:

The Problem: Accelerators are Hard to Program

NVIDIA (briefly) became a 5 trillion dollar company by accelerating AI. Tensor accelerator offerings from Amazon, Apple, Cerebras, Google, Groq, Meta, Qualcomm, and many other companies promise to do even better — but why aren’t these accelerators dominating the market?

One key reason is software. In practice, accelerators don’t dominate because their software stacks are immature. Each accelerator requires custom kernels, compilers, and runtime code tuned to its unique programming model, and writing this software is slow and error-prone. GPUs succeed largely because of their deep, battle-tested software ecosystem, not just their hardware. This raises the question: can ADRS help close this software gap? To answer this question, let’s go in more depth on what a tensor accelerator is and why exactly writing software for one is challenging.

What is a Tensor Accelerator?

Tensor accelerators are specialized hardware architectures optimized to run AI models. Due to AI models’ regular tensor-based computations and low precision requirements, tensor accelerators can allocate larger percentages of chip area (compared to CPUs/GPUs) towards specialized structures like systolic arrays. This can lead to orders-of-magnitude improvements in performance and energy efficiency for AI workloads such as LLMs. However, despite being simpler than CPUs or GPUs, accelerators still vary widely in size, dataflow, and programming model. They can range from tiny devices like the Raspberry Pi AI HAT to wafer-scale systems like Cerebras’s CS-3. Furthermore, simpler accelerator hardware architectures can often mean that more aspects of performance optimization must be handled by software, making said software more complex. Fig. 1, based on Gemmini’s architecture, shows how a representative tensor accelerator’s hardware might be organized.

Fig. 1: Architecture and dataflow of a tensor accelerator system.

Fig. 1: Architecture and dataflow of a tensor accelerator system.

How is Accelerator Code Written Today?

So, how are accelerators like Gemmini programmed? If you’ve built a machine learning model, you’ve likely written hardware-agnostic code in libraries like PyTorch or JAX. On NVIDIA GPUs, this code is compiled to CUDA, PTX, and SASS—but other accelerators are less straightforward. Compilers like XLA, TVM, and Triton support a few hardware backends, but none are universal. Building a new accelerator almost always requires developing a custom software stack, which is challenging.

Adapting compilers to new hardware platforms has always been difficult, due to things like vendor- and implementation-specific ISAs (instruction set architectures). As a result, new accelerators often need hand-optimized kernels for key operations like matrix multiplication or convolution. And even once a compiler exists, generating performant code requires good scheduling, i.e., deciding which optimizations to apply and in what order—a process refined over years for CPUs and GPUs, but lacking for new accelerators.

As researchers at UC Berkeley’s SLICE Lab, we’ve also explored using LLMs to write low-level software for low-resource tensor accelerators for a while. Our prior work shows that LLMs perform poorly in zero-shot scenarios, which is unsurprising given each accelerator’s unique interface and the scarcity of available training data for these specific platforms.

Fig. 2: Tensor accelerator developers hard at work! (as imagined by GPT-4o)

Fig. 2: Tensor accelerator developers hard at work! (as imagined by GPT-4o)

Tensor Accelerator Programming

So, what do these difficult-to-write programs actually look like? To begin with, programming tensor accelerators differs greatly from programming general-purpose CPUs. Tensor accelerators focus on efficiently executing fixed-size (e.g., 16Ă—16) matrix multiplications. Rather than trying to reduce the number or type of these instructions, software optimization emphasizes: