Documentation

Everything you need to understand and use Perpetual Compute.

Overview

Perpetual Compute delivers stability-as-a-service for AWS Spot GPU instances. Run long AI training and inference workloads at significant savings versus On-Demand, without the typical risk of interruption.

When AWS sends a Spot termination notice (typically 120 seconds of warning), we checkpoint your workload with CRIU, migrate it to another instance, and resume within the warning window. You pay only a small stability fee on top of Spot pricing.

How it works

1

Spot interruption detected

We monitor IMDSv2 for interruption notices. The moment AWS signals reclaim, the migration pipeline triggers.

2

CRIU freeze and S3 sync

Docker with CRIU checkpoints your container. Full RAM and GPU state is frozen and synced to S3 at high throughput.

3

Provision and restore

A new Spot instance is provisioned in an optimal region. The checkpoint restores—your workload continues exactly where it left off.

Quick start

Submit your workload as a Docker image. Perpetual Compute handles the execution environment—no SSH or infrastructure management required.

  1. Subscribe through AWS Marketplace.
  2. Sign in to the app portal.
  3. Deploy your Docker image and start your workload.
  4. Migration runs automatically when Spot interruptions occur.

Architecture at a glance

Perpetual Compute is built for reliability and speed. Orchestration lives in AWS Lambda—stateless, scalable, and resilient. State is tracked in DynamoDB. Worker agents on each EC2 node monitor for interruptions and trigger checkpointing.

  • Orchestrator: Lambda functions for deployment, pricing, health checks, and migration coordination.
  • Worker engine: Runs on EC2 nodes, monitors IMDSv2, and executes CRIU checkpoint and restore.
  • S3 + CRT: Checkpoints are synced at 10Gb/s+ using the AWS Common Runtime for maximum throughput.

Next steps