Docker workloads

Prepare a Docker image that runs your workload and is ready to deploy on Perpetual Compute.

Why Docker

Perpetual Compute runs your workload inside a Docker container on AWS EC2. You do not get SSH access or a bare machine; you submit a Docker image, and we run it. This keeps the environment consistent, secure, and compatible with our migration technology (CRIU checkpoint/restore).

If you are new to Docker, we recommend the official Docker getting started guide. Below we focus on what you need to build and run an image that works well on our platform.

Docker basics for this platform

You need: (1) a Dockerfile that defines your app and dependencies, (2) a way to build and tag the image, and (3) a registry (e.g. Amazon ECR) where the image is stored. The portal asks for the image URI (e.g. 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-job:latest) and pulls it when starting your deployment.

Use a base image that matches the kind of workload you run (e.g. CUDA for GPU, Python, or a minimal OS). Keep the image as small and focused as possible to speed up pull and start time.

Example Dockerfile

A minimal Dockerfile that runs a Python script and is ready to push to ECR (or another registry):

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .

# Optional: default command; can be overridden in the portal
CMD ["python", "train.py"]

Replace train.py and requirements.txt with your own files. For GPU workloads, use a CUDA base image (e.g. nvidia/cuda:12.x-runtime) and install your framework (PyTorch, etc.) in the Dockerfile.

Build, tag, and push

Typical flow:

  1. Build — From the directory containing your Dockerfile: docker build -t my-job:latest .
  2. Tag for registry — Tag the image with the full URI of your registry (e.g. ECR): docker tag my-job:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-job:latest
  3. Authenticate and push — Log in to the registry (for ECR: aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com), then docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-job:latest

Use a specific tag (e.g. a git SHA or version) instead of latest for reproducibility. The portal accepts any tag that exists in the registry you specify.

Registry access

The EC2 instances that run your workload must be able to pull from the image registry. For private ECR, ensure the instance role (or our platform’s pull configuration) has permission toecr:GetDownloadUrlForLayer andecr:BatchGetImage. For public registries, no extra auth is needed.

Runtime expectations

Your container runs in a managed environment. We use Docker in experimental mode to support CRIU checkpoint and restore. A few best practices:

  • Write important state and outputs to external storage (e.g. S3) so that after a migration, nothing is lost.
  • Avoid relying on local disk for long-term data; local storage may not persist across migration.
  • Use idempotent or resumable logic where possible (e.g. checkpointing training steps) so that if we restore from a checkpoint, your job can continue cleanly.

You can also push custom metrics (e.g. loss, epoch) to the platform; see Workload metrics.

For more on what the platform produces and where to send results, see Workload input and output.

Next steps