Operate Helmr

Use these checks when operating a self-hosted environment.

Area	Practice
Control rollout	Run migrations before starting a new control image. Use `/readyz` as the readiness probe.
Workers	Drain workers before host replacement or AMI rollout. Scale capacity from Auto Scaling settings, not by manually editing instances.
Database	Keep RDS backups and deletion protection enabled for production. Restore into a separate environment before destructive testing.
Redis/Valkey	Keep Redis available for run dispatch and schedule next-fire leases. Dispatcher repair repopulates schedule entries after restarts.
Secrets	Rotate GitHub and auth secrets from Secrets Manager. Keep secret values out of Terraform variables and logs.
Checkpoints	Keep the same checkpoint encryption key available to all workers that may restore a paused run.
Networking	Private workers need outbound access to GitHub, S3, ECR, AWS APIs, and the control URL.

Worker status is command based:

helmr-worker status

The command exits non-zero unless the worker can authenticate to the control plane and is active.

Checkpoint restore verifies runtime compatibility before resuming a run. Worker backend, architecture, ABI, kernel digest, rootfs digest, runtime config digest, vCPU count, memory, and CNI profile must match the checkpoint metadata.

Checkpoint objects are encrypted by the worker before upload to object storage.