Ops-dashboard/docs/runbooks/recovery.md
Scrum4Me Agent caeb5f3306 feat(ops): self-update script, systemd units, README install guide, recovery runbook
- deploy/ops-dashboard-updater/update.sh: git pull → docker build → force-recreate → smoke-test
- deploy/ops-dashboard-updater/install.sh: installs script + systemd units to host
- ops-dashboard-updater.service / .timer: oneshot + daily 03:00 scheduled trigger
- README.md: Installation and Configuration sections (env files, ops-agent, updater)
- docs/runbooks/recovery.md: agent-crash, DB corruption/restore, container failure, cert expiry

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-13 20:10:21 +02:00

5.5 KiB

Recovery Runbook — Ops Dashboard

This runbook covers the four most common failure scenarios. All commands must be run as root (or with sudo) on the host over SSH.


1. Agent crashed / ops-agent not responding

Symptoms: The dashboard shows "Agent unreachable" or HTTP 502/504 on flow endpoints; curl http://127.0.0.1:3099/healthz times out.

# Check current status and recent logs
systemctl status ops-agent
journalctl -u ops-agent -n 50 --no-pager

# Restart the agent
systemctl restart ops-agent

# Verify it came back
systemctl status ops-agent
curl -sf http://127.0.0.1:3099/healthz && echo OK

If the agent exits immediately on restart, the most common causes are:

Cause Fix
Missing /etc/ops-agent/secret openssl rand -hex 32 | sudo tee /etc/ops-agent/secret && sudo chown root:ops-agent /etc/ops-agent/secret && sudo chmod 0640 /etc/ops-agent/secret
Missing or invalid commands.yml cp /opt/ops-agent/commands.yml.example /etc/ops-agent/commands.yml
Port 3099 already in use ss -tlnp | grep 3099 then kill the conflicting process
Node.js missing apt install nodejs or reinstall via deploy/ops-agent/setup.sh

If the binary itself is corrupt, reinstall from the repo:

cd /srv/ops/repos/ops-dashboard
sudo deploy/ops-agent/setup.sh

2. Database corruption (ops_dashboard DB)

Symptoms: Dashboard shows database errors; Prisma throws P1001 / P1002; psql commands fail against the ops_dashboard database.

2a. Restore from latest backup

Backups are stored in /var/backups/ops-dashboard/ (default path from the backup flow). Each file is a plain-SQL pg_dump dump named ops_dashboard_YYYY-MM-DDTHH-MM-SSZ.sql.

# List available backups (newest first)
ls -lt /var/backups/ops-dashboard/*.sql | head -10

# Drop the damaged DB and restore
BACKUP=/var/backups/ops-dashboard/<chosen-file>.sql
sudo -u postgres psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'ops_dashboard';"
sudo -u postgres dropdb ops_dashboard
sudo -u postgres createdb ops_dashboard
sudo -u postgres psql ops_dashboard < "$BACKUP"

Then restart the dashboard to re-open connections:

docker compose -f /srv/scrum4me/compose/docker-compose.yml restart ops-dashboard

2b. No usable backup available

Run Prisma migrations to re-create the schema (data loss):

cd /srv/ops/repos/ops-dashboard
sudo -u postgres createdb ops_dashboard   # if it was dropped
docker compose -f /srv/scrum4me/compose/docker-compose.yml run --rm ops-dashboard \
  npx prisma migrate deploy

3. Container refuses to start

Symptoms: docker compose up ops-dashboard exits immediately; the container appears in docker ps -a with Exited.

# Show exit logs
docker logs ops-dashboard --tail 50

# Inspect exit code
docker inspect ops-dashboard --format='ExitCode: {{.State.ExitCode}}'

Common causes and fixes

Exit code / symptom Likely cause Fix
EACCES / permission denied Env file unreadable chmod 640 /srv/ops/ops-dashboard.env
DATABASE_URL connect error DB not ready or wrong credentials Check docker ps for postgres; verify DATABASE_URL in env file
Port 3000 already bound Another process on 3000 ss -tlnp | grep 3000; adjust port mapping in Compose file
Invalid AUTH_SECRET Secret too short for Next.js Auth Regenerate: openssl rand -base64 32
Image not found build step skipped docker compose -f ... build ops-dashboard then up -d

Force a clean rebuild:

docker compose -f /srv/scrum4me/compose/docker-compose.yml build --no-cache ops-dashboard
docker compose -f /srv/scrum4me/compose/docker-compose.yml up -d --force-recreate ops-dashboard

4. TLS certificate expired

Symptoms: Browser shows ERR_CERT_DATE_INVALID; Caddy logs show certificate expired.

Caddy manages TLS automatically via Let's Encrypt. A certificate should never expire under normal operation because Caddy renews ~30 days before expiry.

Check certificate status

# View Caddy logs for ACME activity
journalctl -u caddy -n 100 --no-pager | grep -i acme

# Check the expiry of the live certificate
echo | openssl s_client -connect jp-visser.nl:443 -servername jp-visser.nl 2>/dev/null \
  | openssl x509 -noout -dates

Force renewal

# Stop Caddy, clear the ACME cache, restart
systemctl stop caddy
rm -rf /var/lib/caddy/.local/share/caddy/certificates
systemctl start caddy
journalctl -u caddy -f   # watch for successful issuance

If ACME fails (rate-limited or DNS not resolving)

  1. Confirm DNS for jp-visser.nl points to this server's public IP.
  2. Confirm port 80 is open inbound (required for HTTP-01 challenge).
  3. If rate-limited (>5 certificates/week for the domain), wait until the limit resets or use a staging cert temporarily:
    # In Caddyfile, add:  acme_ca https://acme-staging-v02.api.letsencrypt.org/directory
    # Remove after the rate-limit window passes.
    

Emergency: self-signed certificate

If renewal is blocked and you need access now:

# Generate a temporary self-signed cert
openssl req -x509 -newkey rsa:4096 -keyout /etc/caddy/selfsigned.key \
  -out /etc/caddy/selfsigned.crt -days 30 -nodes \
  -subj "/CN=jp-visser.nl"

# Point Caddyfile at the self-signed cert (tls section):
# tls /etc/caddy/selfsigned.crt /etc/caddy/selfsigned.key
systemctl reload caddy

Remember to revert to automatic TLS once the ACME issue is resolved.