Docs / Operate Helmr

Operate Helmr

Use these checks when operating a self-hosted environment.

AreaPractice
Control rolloutRun migrations before starting a new control image. Use /readyz as the readiness probe.
WorkersDrain workers before host replacement or AMI rollout. Scale capacity from Auto Scaling settings, not by manually editing instances.
DatabaseKeep RDS backups and deletion protection enabled for production. Restore into a separate environment before destructive testing.
Redis/ValkeyKeep Redis available for run dispatch and schedule next-fire leases. Dispatcher repair repopulates schedule entries after restarts.
SecretsRotate GitHub and auth secrets from Secrets Manager. Keep secret values out of Terraform variables and logs.
CheckpointsKeep the same checkpoint encryption key available to all workers that may restore a paused run.
NetworkingPrivate workers need outbound access to GitHub, S3, ECR, AWS APIs, and the control URL.

Worker status is command based:

helmr-worker status

The command exits non-zero unless the worker can authenticate to the control plane and is active.

Checkpoint restore verifies runtime compatibility before resuming a run. Worker backend, architecture, ABI, kernel digest, rootfs digest, runtime config digest, vCPU count, memory, and CNI profile must match the checkpoint metadata.

Checkpoint objects are encrypted by the worker before upload to object storage.