feat(ops): self-update script, systemd units, README install guide, recovery runbook
- deploy/ops-dashboard-updater/update.sh: git pull → docker build → force-recreate → smoke-test - deploy/ops-dashboard-updater/install.sh: installs script + systemd units to host - ops-dashboard-updater.service / .timer: oneshot + daily 03:00 scheduled trigger - README.md: Installation and Configuration sections (env files, ops-agent, updater) - docs/runbooks/recovery.md: agent-crash, DB corruption/restore, container failure, cert expiry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
09050d5ce7
commit
caeb5f3306
6 changed files with 361 additions and 0 deletions
74
README.md
74
README.md
|
|
@ -4,6 +4,80 @@ Single-user ops dashboard voor jp-visser.nl.
|
||||||
|
|
||||||
See `docs/runbooks/` for setup, deployment, and operational procedures.
|
See `docs/runbooks/` for setup, deployment, and operational procedures.
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
|
||||||
|
- Docker + Docker Compose (plugin) installed on the host
|
||||||
|
- A PostgreSQL service named `postgres` already running in the same Compose stack
|
||||||
|
- The repository cloned to `/srv/ops/repos/ops-dashboard`
|
||||||
|
- `/srv/scrum4me/compose/docker-compose.yml` as the shared Compose file
|
||||||
|
|
||||||
|
### 1. Configure environment
|
||||||
|
|
||||||
|
```
|
||||||
|
cp deploy/ops-dashboard.env.example /srv/ops/ops-dashboard.env
|
||||||
|
# Edit /srv/ops/ops-dashboard.env — set DATABASE_URL, AUTH_SECRET, etc.
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Install ops-agent
|
||||||
|
|
||||||
|
```
|
||||||
|
sudo deploy/ops-agent/setup.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
This creates the `ops-agent` system user, installs `/opt/ops-agent`, generates
|
||||||
|
`/etc/ops-agent/secret`, and enables the systemd unit.
|
||||||
|
|
||||||
|
Copy the generated secret into the web-app env file:
|
||||||
|
|
||||||
|
```
|
||||||
|
sudo cat /etc/ops-agent/secret
|
||||||
|
# Paste the value as OPS_AGENT_SECRET= in /srv/ops/ops-dashboard.env
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Build and start the dashboard
|
||||||
|
|
||||||
|
```
|
||||||
|
sudo docker compose -f /srv/scrum4me/compose/docker-compose.yml build ops-dashboard
|
||||||
|
sudo docker compose -f /srv/scrum4me/compose/docker-compose.yml up -d ops-dashboard
|
||||||
|
```
|
||||||
|
|
||||||
|
The dashboard is now reachable on `127.0.0.1:3001` (proxied by Caddy).
|
||||||
|
|
||||||
|
### 4. Install the self-update script
|
||||||
|
|
||||||
|
```
|
||||||
|
sudo deploy/ops-dashboard-updater/install.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
To enable scheduled updates (daily at 03:00):
|
||||||
|
|
||||||
|
```
|
||||||
|
sudo systemctl enable --now ops-dashboard-updater.timer
|
||||||
|
```
|
||||||
|
|
||||||
|
To trigger a manual update via SSH:
|
||||||
|
|
||||||
|
```
|
||||||
|
sudo systemctl start ops-dashboard-updater.service
|
||||||
|
# or:
|
||||||
|
sudo /opt/ops-dashboard-updater/update.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Never** trigger updates through the dashboard UI — the script restarts the
|
||||||
|
> container that serves the UI.
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
| File | Purpose |
|
||||||
|
|---|---|
|
||||||
|
| `/srv/ops/ops-dashboard.env` | Web-app environment (DATABASE_URL, AUTH_SECRET, OPS_AGENT_SECRET, …) |
|
||||||
|
| `/etc/ops-agent/secret` | Shared HMAC secret between web-app and ops-agent |
|
||||||
|
| `/etc/ops-agent/commands.yml` | Whitelist of commands the ops-agent may run |
|
||||||
|
| `/etc/ops-agent/flows/` | Flow YAML files (backup, caddy reload, etc.) |
|
||||||
|
| `/srv/scrum4me/compose/docker-compose.yml` | Main Compose file (add ops-dashboard fragment from `deploy/`) |
|
||||||
|
|
||||||
## Ops-agent auth
|
## Ops-agent auth
|
||||||
|
|
||||||
The web-app communicates with the ops-agent via a shared secret stored in
|
The web-app communicates with the ops-agent via a shared secret stored in
|
||||||
|
|
|
||||||
33
deploy/ops-dashboard-updater/install.sh
Normal file
33
deploy/ops-dashboard-updater/install.sh
Normal file
|
|
@ -0,0 +1,33 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Install the ops-dashboard self-update script and systemd units.
|
||||||
|
# Run as root from within the repo.
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
REPO_DIR="$(cd "$(dirname "$0")/../.." && pwd)"
|
||||||
|
INSTALL_DIR=/opt/ops-dashboard-updater
|
||||||
|
SERVICE_DIR=/etc/systemd/system
|
||||||
|
|
||||||
|
echo "==> Installing update script to ${INSTALL_DIR}"
|
||||||
|
mkdir -p "${INSTALL_DIR}"
|
||||||
|
install -m 0750 -o root -g root \
|
||||||
|
"${REPO_DIR}/deploy/ops-dashboard-updater/update.sh" \
|
||||||
|
"${INSTALL_DIR}/update.sh"
|
||||||
|
|
||||||
|
echo "==> Installing systemd units"
|
||||||
|
install -m 0644 -o root -g root \
|
||||||
|
"${REPO_DIR}/deploy/ops-dashboard-updater/ops-dashboard-updater.service" \
|
||||||
|
"${SERVICE_DIR}/ops-dashboard-updater.service"
|
||||||
|
install -m 0644 -o root -g root \
|
||||||
|
"${REPO_DIR}/deploy/ops-dashboard-updater/ops-dashboard-updater.timer" \
|
||||||
|
"${SERVICE_DIR}/ops-dashboard-updater.timer"
|
||||||
|
|
||||||
|
systemctl daemon-reload
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "==> Done. To enable automatic scheduled updates:"
|
||||||
|
echo " systemctl enable --now ops-dashboard-updater.timer"
|
||||||
|
echo ""
|
||||||
|
echo " To run a manual update now:"
|
||||||
|
echo " systemctl start ops-dashboard-updater.service"
|
||||||
|
echo " # or directly:"
|
||||||
|
echo " /opt/ops-dashboard-updater/update.sh"
|
||||||
14
deploy/ops-dashboard-updater/ops-dashboard-updater.service
Normal file
14
deploy/ops-dashboard-updater/ops-dashboard-updater.service
Normal file
|
|
@ -0,0 +1,14 @@
|
||||||
|
[Unit]
|
||||||
|
Description=Self-update ops-dashboard (oneshot, triggered by timer or SSH)
|
||||||
|
After=network.target docker.service
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
User=root
|
||||||
|
ExecStart=/opt/ops-dashboard-updater/update.sh
|
||||||
|
StandardOutput=journal
|
||||||
|
StandardError=journal
|
||||||
|
SyslogIdentifier=ops-dashboard-update
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
11
deploy/ops-dashboard-updater/ops-dashboard-updater.timer
Normal file
11
deploy/ops-dashboard-updater/ops-dashboard-updater.timer
Normal file
|
|
@ -0,0 +1,11 @@
|
||||||
|
[Unit]
|
||||||
|
Description=Scheduled self-update for ops-dashboard (optional)
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
# Check for updates every day at 03:00 local time.
|
||||||
|
# Disable this timer if you prefer manual-only updates via SSH.
|
||||||
|
OnCalendar=*-*-* 03:00:00
|
||||||
|
Persistent=true
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
56
deploy/ops-dashboard-updater/update.sh
Normal file
56
deploy/ops-dashboard-updater/update.sh
Normal file
|
|
@ -0,0 +1,56 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Self-update script for ops-dashboard.
|
||||||
|
# Run as root via SSH or the systemd oneshot service below.
|
||||||
|
# Do NOT invoke this through the UI — it restarts the container serving the UI.
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
REPO_DIR=/srv/ops/repos/ops-dashboard
|
||||||
|
COMPOSE_FILE=/srv/scrum4me/compose/docker-compose.yml
|
||||||
|
SERVICE=ops-dashboard
|
||||||
|
LOG_TAG=ops-dashboard-update
|
||||||
|
|
||||||
|
log() { echo "[$(date -u +%FT%TZ)] $*" | tee /dev/fd/1 | systemd-cat -t "$LOG_TAG" -p info 2>/dev/null || true; }
|
||||||
|
die() { echo "[$(date -u +%FT%TZ)] ERROR: $*" >&2; exit 1; }
|
||||||
|
|
||||||
|
# ── 1. Pull latest code ────────────────────────────────────────────────────────
|
||||||
|
log "Pulling latest code from origin..."
|
||||||
|
git -C "$REPO_DIR" fetch --prune origin
|
||||||
|
CURRENT=$(git -C "$REPO_DIR" rev-parse HEAD)
|
||||||
|
git -C "$REPO_DIR" reset --hard origin/main
|
||||||
|
NEW=$(git -C "$REPO_DIR" rev-parse HEAD)
|
||||||
|
|
||||||
|
if [[ "$CURRENT" == "$NEW" ]]; then
|
||||||
|
log "Already up-to-date at $NEW — nothing to rebuild."
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
log "Updated $CURRENT → $NEW"
|
||||||
|
|
||||||
|
# ── 2. Build new image ─────────────────────────────────────────────────────────
|
||||||
|
log "Building Docker image..."
|
||||||
|
docker compose -f "$COMPOSE_FILE" build "$SERVICE"
|
||||||
|
|
||||||
|
# ── 3. Restart container ───────────────────────────────────────────────────────
|
||||||
|
log "Restarting $SERVICE with new image..."
|
||||||
|
docker compose -f "$COMPOSE_FILE" up -d --force-recreate "$SERVICE"
|
||||||
|
|
||||||
|
# ── 4. Smoke test ──────────────────────────────────────────────────────────────
|
||||||
|
log "Waiting for container to become healthy..."
|
||||||
|
for i in $(seq 1 12); do
|
||||||
|
STATUS=$(docker inspect --format='{{.State.Health.Status}}' "$SERVICE" 2>/dev/null || true)
|
||||||
|
if [[ "$STATUS" == "healthy" ]]; then
|
||||||
|
log "Container is healthy after ${i}×5 s."
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
# Fallback: accept running if no HEALTHCHECK is defined
|
||||||
|
RUNNING=$(docker inspect --format='{{.State.Running}}' "$SERVICE" 2>/dev/null || echo false)
|
||||||
|
if [[ "$RUNNING" == "true" && -z "$STATUS" ]]; then
|
||||||
|
log "Container running (no HEALTHCHECK defined)."
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
if [[ $i -eq 12 ]]; then
|
||||||
|
die "Container did not become healthy within 60 s. Check: docker logs $SERVICE"
|
||||||
|
fi
|
||||||
|
sleep 5
|
||||||
|
done
|
||||||
|
|
||||||
|
log "Update complete — $SERVICE is running commit $NEW."
|
||||||
173
docs/runbooks/recovery.md
Normal file
173
docs/runbooks/recovery.md
Normal file
|
|
@ -0,0 +1,173 @@
|
||||||
|
# Recovery Runbook — Ops Dashboard
|
||||||
|
|
||||||
|
This runbook covers the four most common failure scenarios. All commands must
|
||||||
|
be run as root (or with `sudo`) on the host over SSH.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Agent crashed / ops-agent not responding
|
||||||
|
|
||||||
|
**Symptoms:** The dashboard shows "Agent unreachable" or HTTP 502/504 on flow
|
||||||
|
endpoints; `curl http://127.0.0.1:3099/healthz` times out.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check current status and recent logs
|
||||||
|
systemctl status ops-agent
|
||||||
|
journalctl -u ops-agent -n 50 --no-pager
|
||||||
|
|
||||||
|
# Restart the agent
|
||||||
|
systemctl restart ops-agent
|
||||||
|
|
||||||
|
# Verify it came back
|
||||||
|
systemctl status ops-agent
|
||||||
|
curl -sf http://127.0.0.1:3099/healthz && echo OK
|
||||||
|
```
|
||||||
|
|
||||||
|
If the agent exits immediately on restart, the most common causes are:
|
||||||
|
|
||||||
|
| Cause | Fix |
|
||||||
|
|---|---|
|
||||||
|
| Missing `/etc/ops-agent/secret` | `openssl rand -hex 32 \| sudo tee /etc/ops-agent/secret && sudo chown root:ops-agent /etc/ops-agent/secret && sudo chmod 0640 /etc/ops-agent/secret` |
|
||||||
|
| Missing or invalid `commands.yml` | `cp /opt/ops-agent/commands.yml.example /etc/ops-agent/commands.yml` |
|
||||||
|
| Port 3099 already in use | `ss -tlnp \| grep 3099` then kill the conflicting process |
|
||||||
|
| Node.js missing | `apt install nodejs` or reinstall via `deploy/ops-agent/setup.sh` |
|
||||||
|
|
||||||
|
If the binary itself is corrupt, reinstall from the repo:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /srv/ops/repos/ops-dashboard
|
||||||
|
sudo deploy/ops-agent/setup.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Database corruption (ops_dashboard DB)
|
||||||
|
|
||||||
|
**Symptoms:** Dashboard shows database errors; Prisma throws `P1001` / `P1002`;
|
||||||
|
`psql` commands fail against the `ops_dashboard` database.
|
||||||
|
|
||||||
|
### 2a. Restore from latest backup
|
||||||
|
|
||||||
|
Backups are stored in `/var/backups/ops-dashboard/` (default path from the
|
||||||
|
backup flow). Each file is a plain-SQL `pg_dump` dump named
|
||||||
|
`ops_dashboard_YYYY-MM-DDTHH-MM-SSZ.sql`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# List available backups (newest first)
|
||||||
|
ls -lt /var/backups/ops-dashboard/*.sql | head -10
|
||||||
|
|
||||||
|
# Drop the damaged DB and restore
|
||||||
|
BACKUP=/var/backups/ops-dashboard/<chosen-file>.sql
|
||||||
|
sudo -u postgres psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'ops_dashboard';"
|
||||||
|
sudo -u postgres dropdb ops_dashboard
|
||||||
|
sudo -u postgres createdb ops_dashboard
|
||||||
|
sudo -u postgres psql ops_dashboard < "$BACKUP"
|
||||||
|
```
|
||||||
|
|
||||||
|
Then restart the dashboard to re-open connections:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose -f /srv/scrum4me/compose/docker-compose.yml restart ops-dashboard
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2b. No usable backup available
|
||||||
|
|
||||||
|
Run Prisma migrations to re-create the schema (data loss):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /srv/ops/repos/ops-dashboard
|
||||||
|
sudo -u postgres createdb ops_dashboard # if it was dropped
|
||||||
|
docker compose -f /srv/scrum4me/compose/docker-compose.yml run --rm ops-dashboard \
|
||||||
|
npx prisma migrate deploy
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Container refuses to start
|
||||||
|
|
||||||
|
**Symptoms:** `docker compose up ops-dashboard` exits immediately; the container
|
||||||
|
appears in `docker ps -a` with `Exited`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Show exit logs
|
||||||
|
docker logs ops-dashboard --tail 50
|
||||||
|
|
||||||
|
# Inspect exit code
|
||||||
|
docker inspect ops-dashboard --format='ExitCode: {{.State.ExitCode}}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Common causes and fixes
|
||||||
|
|
||||||
|
| Exit code / symptom | Likely cause | Fix |
|
||||||
|
|---|---|---|
|
||||||
|
| `EACCES` / permission denied | Env file unreadable | `chmod 640 /srv/ops/ops-dashboard.env` |
|
||||||
|
| `DATABASE_URL` connect error | DB not ready or wrong credentials | Check `docker ps` for `postgres`; verify `DATABASE_URL` in env file |
|
||||||
|
| Port 3000 already bound | Another process on 3000 | `ss -tlnp \| grep 3000`; adjust port mapping in Compose file |
|
||||||
|
| Invalid `AUTH_SECRET` | Secret too short for Next.js Auth | Regenerate: `openssl rand -base64 32` |
|
||||||
|
| Image not found | `build` step skipped | `docker compose -f ... build ops-dashboard` then `up -d` |
|
||||||
|
|
||||||
|
Force a clean rebuild:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose -f /srv/scrum4me/compose/docker-compose.yml build --no-cache ops-dashboard
|
||||||
|
docker compose -f /srv/scrum4me/compose/docker-compose.yml up -d --force-recreate ops-dashboard
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. TLS certificate expired
|
||||||
|
|
||||||
|
**Symptoms:** Browser shows `ERR_CERT_DATE_INVALID`; Caddy logs show
|
||||||
|
`certificate expired`.
|
||||||
|
|
||||||
|
Caddy manages TLS automatically via Let's Encrypt. A certificate should never
|
||||||
|
expire under normal operation because Caddy renews ~30 days before expiry.
|
||||||
|
|
||||||
|
### Check certificate status
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# View Caddy logs for ACME activity
|
||||||
|
journalctl -u caddy -n 100 --no-pager | grep -i acme
|
||||||
|
|
||||||
|
# Check the expiry of the live certificate
|
||||||
|
echo | openssl s_client -connect jp-visser.nl:443 -servername jp-visser.nl 2>/dev/null \
|
||||||
|
| openssl x509 -noout -dates
|
||||||
|
```
|
||||||
|
|
||||||
|
### Force renewal
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Stop Caddy, clear the ACME cache, restart
|
||||||
|
systemctl stop caddy
|
||||||
|
rm -rf /var/lib/caddy/.local/share/caddy/certificates
|
||||||
|
systemctl start caddy
|
||||||
|
journalctl -u caddy -f # watch for successful issuance
|
||||||
|
```
|
||||||
|
|
||||||
|
### If ACME fails (rate-limited or DNS not resolving)
|
||||||
|
|
||||||
|
1. Confirm DNS for `jp-visser.nl` points to this server's public IP.
|
||||||
|
2. Confirm port 80 is open inbound (required for HTTP-01 challenge).
|
||||||
|
3. If rate-limited (>5 certificates/week for the domain), wait until the limit
|
||||||
|
resets or use a staging cert temporarily:
|
||||||
|
```
|
||||||
|
# In Caddyfile, add: acme_ca https://acme-staging-v02.api.letsencrypt.org/directory
|
||||||
|
# Remove after the rate-limit window passes.
|
||||||
|
```
|
||||||
|
|
||||||
|
### Emergency: self-signed certificate
|
||||||
|
|
||||||
|
If renewal is blocked and you need access now:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Generate a temporary self-signed cert
|
||||||
|
openssl req -x509 -newkey rsa:4096 -keyout /etc/caddy/selfsigned.key \
|
||||||
|
-out /etc/caddy/selfsigned.crt -days 30 -nodes \
|
||||||
|
-subj "/CN=jp-visser.nl"
|
||||||
|
|
||||||
|
# Point Caddyfile at the self-signed cert (tls section):
|
||||||
|
# tls /etc/caddy/selfsigned.crt /etc/caddy/selfsigned.key
|
||||||
|
systemctl reload caddy
|
||||||
|
```
|
||||||
|
|
||||||
|
Remember to revert to automatic TLS once the ACME issue is resolved.
|
||||||
Loading…
Add table
Add a link
Reference in a new issue