Docs / Troubleshooting

Troubleshooting

SymptomFirst checks
/healthz failsControl task is not running, load balancer target is unhealthy, or the public URL does not route to the service.
/readyz failsDatabase URL or HELMR_REDIS_URL is wrong, RDS or Redis/Valkey is unavailable, or migrations have not run successfully.
GitHub login failsCallback URL must be <control_url>/auth/github/callback; OAuth client secret must match the OAuth app.
Run stays queuedDispatcher is not running, Redis/Valkey is unavailable, no active workers exist, desired capacity is zero, worker bootstrap failed, or worker cannot reach the control plane.
Worker does not activateCheck KVM, Firecracker, jailer, CNI, BuildKit, guest artifacts, bootstrap token, and outbound network access.
External repository access failsCheck the task secret, token scope, payload values, and worker egress.
Image build failsCheck BuildKit service status and worker egress to registries and AWS APIs.
Parked wait resume failsCheck worker availability and checkpoint runtime compatibility.

For control or dispatcher task failures, inspect ECS service events and CloudWatch logs for the affected ECS service. For worker failures, use SSM to inspect systemd journals on the worker instance.

Do not debug by copying secrets into Terraform files. Read the secret ARN from tofu output -json secret_arns, then update the value in Secrets Manager.