Architecture and reliability

How Perpetual Compute is built for speed and reliability within the AWS Spot interruption window.

The 120-second migration window

When AWS decides to reclaim a Spot instance, it sends an interruption notice (typically with a 120-second warning) via IMDSv2. All migration steps—freeze, sync to S3, provision a new instance, restore—must complete within that window so that your workload resumes on the new instance before the old one is terminated.

We optimize for this constraint: checkpointing uses CRIU and high-throughput S3 sync (AWS CRT), and orchestration is stateless (Lambda) so we can scale and react quickly. We do not run latency-heavy logic or complex external API calls during the migration pipeline.

Design for resumability

Your workload benefits from writing checkpoints or outputs to external storage (e.g. S3) so that even in edge cases, progress is not lost. See Workload input and output.

Orchestrator

The “brain” of the platform runs in AWS Lambda. There is no long-running master server. The orchestrator handles:

Deployment requests (provisioning instances, launching your container)
Pricing and Spot price lookups
Health checks and migration coordination (triggering restore on a new instance when a checkpoint is ready)
Metering (reporting usage to AWS Marketplace)

State is stored in DynamoDB so that any Lambda invocation can pick up work. This keeps the system stateless and resilient.

Worker engine

A worker agent runs on each EC2 node that hosts your workloads. It:

Monitors IMDSv2 for Spot interruption notices
Triggers the checkpoint pipeline (Docker + CRIU) when an interruption is detected
Syncs checkpoint data to S3 using high-throughput (e.g. 10Gb/s+) transfer
Coordinates with the orchestrator so a new instance can be provisioned and the checkpoint restored

Docker runs in experimental mode on the node to support CRIU checkpoint and restore. Your container is frozen, synced, and then restored on the new instance without you changing your image.

S3 and high-throughput sync

Checkpoint data (RAM and GPU state) can be large. We use the AWS Common Runtime (CRT) for S3 uploads so that we can achieve 10Gb/s+ throughput and finish the sync within the interruption window. Standard S3 uploaders are not used for this path.

Checkpoints and manifests are stored in our internal S3 buckets for the migration pipeline only. For your own artifacts (model weights, results), use your workload to write to your own S3 or other destination—see Result destinations.

Reliability summary

The system is designed for the 120-second rule: freeze, sync, provision, restore within the AWS warning window. Orchestration is stateless (Lambda + DynamoDB), workers are single-purpose (monitor + CRIU + sync), and S3 transfer is optimized for speed. You get stability-as-a-service without managing any of this yourself.

Support and operations