๐๏ธ Posted: October 30, 2025
Zhifei Li, Tian Xia, Shu Liu, Audrey Cheng, Melissa Pan, Ion Stoica, and ADRS Team
<aside> ๐ก
This post is the second in our AI-Driven Research Systems (ADRS) series, where we apply AI to optimize complex system problems. Here, we tackle spot instance scheduling, a classic cost-versus-reliability problem in the cloud. Specifically, we demonstrate how OpenEvolve discovers novel algorithms that surpass the algorithm from an NSDI'24 Best Paper, achieving up to 16% and 48% cost savings in a single and multi-region setups on our final evaluation traces, respectively.
GitHub - UCB-ADRS/ADRS: AI-Driven Research For Systems (ADRS)
AI compute costs have become a significant bottleneck for innovation (as a16z recently points out). Compounding this challenge, many applications have strict deadlines for which missing them means lost revenue and degraded performance. Examples include recommendation systems processing user data or trading applications analyzing market information.
Cloud providers offer spot instances at 60-90% discounts to reduce costs, but these can be preempted at any moment, making it difficult to guarantee completion for deadline-sensitive workloads. This creates a critical challenge: how to minimize costs while ensuring jobs finish on time?
The NSDI โ24 best paper Canโt Be Late tackled this problem by introducing a state-of-the-art spot instance scheduling algorithm for a single region. We use OpenEvolve to improve upon it and further extend the approach to multi-region scheduling.
Consider a typical scenario: running a large-scale data analysis that needs 40 hours of computation and must complete within 60 hours before a conference submission deadline. With this 20 hour buffer, the scheduler can choose between:
The scheduler continuously monitors the system and makes decisions whenever the state changes - when a spot instance becomes available or unavailable, or when the current instance gets preempted.
At each moment, the scheduler must pick one of three actions: (1) Wait - No instance running, no cost, no progress, (2) Spot - Run on cheap instances but that may be preempted at any time, or (3) On-Demand - Run on expensive but guaranteed instance.
The simplest policy, Always On-Demand, runs the job entirely on on-demand instances and ignores all spot opportunities. It guarantees completion but at maximum cost (~$4k for a two-day run), since it never uses cheaper spot capacity.
A better policy is the Greedy Policy (a baseline from the original NSDI paper) which uses spot instances whenever available, until it cannot risk another cold start delay and has to fall back to on-demand instances. But once the fallback happens, it can never go back to spot instances, even if these cheaper instances become available later on.
These limitations motivate the **Canโt Be Late** paper to design a more adaptive scheduling algorithm.
.png)
Figure 1. Uniform Progress Algorithm. Upper: instance usage over time; Lower: progress curve showing target line (pink) vs actual progress (green). The algorithm tracks a target progress line and switches to on-demand whenever actual progress falls behind, even if cheaper spots would become available shortly.
The Uniform Progress algorithm proposed from the NSDI'24 paper follows a simple idea: use on-demand instances early and stay close to the target progress line (illustrated in Figure 1), which represents the ideal steady progress rate**,** ensuring sufficient slack time to fall back to on-demand instances.