🗓️ Posted: October 30, 2025

Zhifei Li, Tian Xia, Shu Liu, Audrey Cheng, Melissa Pan, Ion Stoica, and ADRS Team

<aside> 💡

This post is the second in our AI-Driven Research Systems (ADRS) series, where we apply AI to optimize complex system problems. Here, we tackle spot instance scheduling, a classic cost-versus-reliability problem in the cloud. Specifically, we demonstrate how OpenEvolve discovers novel algorithms that surpass the algorithm from an NSDI'24 Best Paper, achieving up to 16% and 48% cost savings in a single and multi-region setups on our final evaluation traces, respectively.

🕸️ ADRS Blog Series: https://ucbskyadrs.github.io/
📝 Paper: https://arxiv.org/abs/2510.06189
👩‍💻 Code:

GitHub - UCB-ADRS/ADRS: AI-Driven Research For Systems (ADRS)

💬 Join us: https://join.slack.com/t/adrs-global/shared_invite/zt-3fgme22n5-PKYyAc9aIeTyX5iSQTKIoA and Discord
Follow us: https://x.com/ai4research_ucb </aside>

The Problem: Cost vs. Reliability in the Cloud

AI compute costs have become a significant bottleneck for innovation (as a16z recently points out). Compounding this challenge, many applications have strict deadlines for which missing them means lost revenue and degraded performance. Examples include recommendation systems processing user data or trading applications analyzing market information.

Cloud providers offer spot instances at 60-90% discounts to reduce costs, but these can be preempted at any moment, making it difficult to guarantee completion for deadline-sensitive workloads. This creates a critical challenge: how to minimize costs while ensuring jobs finish on time?

The NSDI ‘24 best paper Can’t Be Late tackled this problem by introducing a state-of-the-art spot instance scheduling algorithm for a single region. We use OpenEvolve to improve upon it and further extend the approach to multi-region scheduling.

The Challenge in Practice

Consider a typical scenario: running a large-scale data analysis that needs 40 hours of computation and must complete within 60 hours before a conference submission deadline. With this 20 hour buffer, the scheduler can choose between:

Spot instances: ~$25/hour, cheap but unreliable (can be preempted anytime)
On-demand instances: ~$98/hour, expensive but guaranteed
Cold start delay: every time you switch instances, you lose 20 minutes to cold-start overhead

The scheduler continuously monitors the system and makes decisions whenever the state changes - when a spot instance becomes available or unavailable, or when the current instance gets preempted.

At each moment, the scheduler must pick one of three actions: (1) Wait - No instance running, no cost, no progress, (2) Spot - Run on cheap instances but that may be preempted at any time, or (3) On-Demand - Run on expensive but guaranteed instance.

The simplest policy, Always On-Demand, runs the job entirely on on-demand instances and ignores all spot opportunities. It guarantees completion but at maximum cost (~$4k for a two-day run), since it never uses cheaper spot capacity.

A better policy is the Greedy Policy (a baseline from the original NSDI paper) which uses spot instances whenever available, until it cannot risk another cold start delay and has to fall back to on-demand instances. But once the fallback happens, it can never go back to spot instances, even if these cheaper instances become available later on.

These limitations motivate the **Can’t Be Late** paper to design a more adaptive scheduling algorithm.

The SOTA Policy: Uniform Progress

Figure 1. Uniform Progress Algorithm. Upper: instance usage over time; Lower: progress curve showing target line (pink) vs actual progress (green). The algorithm tracks a target progress line and switches to on-demand whenever actual progress falls behind, even if cheaper spots would become available shortly.

The Uniform Progress algorithm proposed from the NSDI'24 paper follows a simple idea: use on-demand instances early and stay close to the target progress line (illustrated in Figure 1), which represents the ideal steady progress rate**,** ensuring sufficient slack time to fall back to on-demand instances.