docs(PBI-58): add developer manual chapters under docs/manual/
Adds a 7-file English-language manual targeted at new human contributors: index, overview, statuses & transitions (with mermaid state diagrams), git workflow, MCP integration, docker, and troubleshooting. The manual is the *map* — it cross-references existing runbooks/ADRs/architecture docs rather than duplicating their content. Regenerates docs/INDEX.md and validates with check-doc-links.mjs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
e8562d4018
commit
e75bac9375
8 changed files with 873 additions and 0 deletions
149
docs/manual/05-docker.md
Normal file
149
docs/manual/05-docker.md
Normal file
|
|
@ -0,0 +1,149 @@
|
|||
---
|
||||
title: "Docker"
|
||||
status: active
|
||||
audience: [contributor]
|
||||
language: en
|
||||
last_updated: 2026-05-07
|
||||
when_to_read: "Before running the worker locally, debugging a stuck job, or operating the Mac/NAS deployment."
|
||||
---
|
||||
|
||||
# 05 — Docker
|
||||
|
||||
This chapter is the contributor's tour of the Docker side of Scrum4Me. Two important up-front facts:
|
||||
|
||||
1. **The Next.js app is not containerised.** The web UI, API routes, server actions, and database connection all run on **Vercel** (serverless functions + Edge runtime). There is no `Dockerfile` in this repo and no `docker-compose.yml`.
|
||||
2. **Only the worker is containerised.** The "worker" is a Claude Code agent in a long-running container that polls the Scrum4Me job queue via MCP and executes `TASK_IMPLEMENTATION` / `IDEA_GRILL` / `IDEA_MAKE_PLAN` / `SPRINT_IMPLEMENTATION` jobs.
|
||||
|
||||
The container image and its supporting scripts live in a **separate repo**: [`madhura68/scrum4me-docker`](https://github.com/madhura68/scrum4me-docker). This manual documents the consumer side — what the worker is, how it relates to Scrum4Me, and how to diagnose issues. The container internals (Dockerfile, entrypoint, agent provisioning) are out of scope for this manual; see that repo's README.
|
||||
|
||||
> **Note:** A separate sandbox repo `scrum4me-sbx` ([`SC-4`](https://github.com/madhura68/scrum4me-sbx)) exists for Docker exploration. Treat it as a scratchpad, not as the production worker.
|
||||
|
||||
## Topology
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph Vercel
|
||||
App[Next.js app<br/>+ API routes]
|
||||
end
|
||||
subgraph Neon
|
||||
DB[(Postgres)]
|
||||
end
|
||||
subgraph Mac["Mac (default) / NAS (opt-in)"]
|
||||
Worker[Worker container<br/>Claude Code + MCP]
|
||||
end
|
||||
Worker -- MCP over HTTPS --> App
|
||||
App -- Prisma --> DB
|
||||
Worker -- git push --> GH[GitHub]
|
||||
GH -- webhooks --> App
|
||||
```
|
||||
|
||||
- The worker **never connects to the database directly**. All state changes go through MCP tools, which call the Vercel-hosted REST API, which writes to Neon via Prisma.
|
||||
- The worker **does** push commits directly to GitHub. GitHub then notifies Vercel and the auto-PR flow ([03 — Git Workflow](./03-git-workflow.md)) takes over.
|
||||
|
||||
## Mac vs NAS
|
||||
|
||||
| Flow | When to use | Status |
|
||||
|---|---|---|
|
||||
| **Mac-native (arm64)** | Default for development and small teams | Active |
|
||||
| **NAS** | Self-hosted always-on worker on a Synology / Asustor / similar | Opt-in, validated by historical smoke tests in [`docs/docker-smoke/`](../docker-smoke/) |
|
||||
|
||||
The Mac flow is the default because it doesn't require dedicated hardware. The container runs natively on Apple Silicon (arm64) — no x86 emulation overhead.
|
||||
|
||||
## Environment variables the worker needs
|
||||
|
||||
The worker container needs **only** what's required to authenticate to MCP and push to GitHub:
|
||||
|
||||
| Var | Purpose |
|
||||
|---|---|
|
||||
| `SCRUM4ME_BEARER_TOKEN` | Bearer token bound to a product. Returned by the user's API-token settings page. |
|
||||
| `SCRUM4ME_BASE_URL` | Usually `https://scrum4me.vercel.app` (or the user's domain). |
|
||||
| `GITHUB_TOKEN` | Personal access token with `repo` scope, used by `git push` and `gh pr create`. |
|
||||
| `ANTHROPIC_API_KEY` | The Claude API key used by the worker process. |
|
||||
| `MIN_QUOTA_PCT` | Optional. Worker pauses if Anthropic quota drops below this percentage. |
|
||||
|
||||
> **Hardstop:** the worker does **not** need `DATABASE_URL`, `SESSION_SECRET`, or `CRON_SECRET`. Those belong to the Next.js app; the worker only talks to MCP. If you find yourself adding DB env vars to the worker, stop — you're solving the wrong problem.
|
||||
|
||||
The full list and provisioning instructions live in the [`scrum4me-docker` README](https://github.com/madhura68/scrum4me-docker). **TODO:** link to specific sections of that README once it's stable.
|
||||
|
||||
## What the worker loop does, on a single iteration
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant W as Worker
|
||||
participant Q as worker-quota-probe.sh
|
||||
participant M as MCP server
|
||||
W->>Q: probe Anthropic quota
|
||||
Q-->>W: { pct, reset_at_iso }
|
||||
alt pct < MIN_QUOTA_PCT
|
||||
W->>M: worker_heartbeat(pct, last_quota_check_at)
|
||||
W->>W: sleep until reset_at_iso (cap 1h)
|
||||
else quota ok
|
||||
W->>M: worker_heartbeat(pct, last_quota_check_at)
|
||||
W->>M: wait_for_job (block ≤600s, claim atomically)
|
||||
alt queue empty
|
||||
W->>W: continue (no work, loop again)
|
||||
else got job
|
||||
W->>W: execute by `kind`
|
||||
W->>M: update_job_status(done|failed)
|
||||
end
|
||||
end
|
||||
Note over W: continue forever
|
||||
```
|
||||
|
||||
The loop is described authoritatively in [`docs/runbooks/mcp-integration.md`](../runbooks/mcp-integration.md#batch-loop-verplichte-agent-flow) and [`docs/runbooks/worker-idempotency.md`](../runbooks/worker-idempotency.md).
|
||||
|
||||
### Quota probe
|
||||
|
||||
`bin/worker-quota-probe.sh` (in `scrum4me-docker`) makes a tiny call to the Anthropic API to read the current quota percentage and reset time. Cost: ~1 output token per probe (~12 tokens/hour at 5-minute intervals). The default `MIN_QUOTA_PCT` is **20%** — typically high enough on Pro/Max plans that the worker never pauses during normal day-job hours.
|
||||
|
||||
### Heartbeat
|
||||
|
||||
Every iteration the worker calls `worker_heartbeat({ last_quota_pct, last_quota_check_at })`. The MCP server emits an SSE event so the NavBar in the Next.js app shows the worker as live. A heartbeat older than 15 seconds is rendered as "offline" / "stand-by" in the UI.
|
||||
|
||||
### Stale-claim recovery
|
||||
|
||||
If a worker dies mid-job (process crash, container kill, network partition), its claimed job stays as `CLAIMED` in the database. After **30 minutes** the next `wait_for_job` call automatically requeues it (`CLAIMED → QUEUED`) before claiming a fresh one. No manual intervention is required for clean recovery.
|
||||
|
||||
When you **do** need to manually requeue a job (e.g. you killed it intentionally and don't want to wait 30 min), the operator route is the admin board → "Requeue job" button. **TODO:** confirm the exact UI path; this is not yet documented in `docs/runbooks/`.
|
||||
|
||||
## Running the worker locally
|
||||
|
||||
The intended local workflow per the project's standing memory is **Mac-native Docker** (the user's `project_docker_default_target` memory). High-level steps (verify against the [scrum4me-docker README](https://github.com/madhura68/scrum4me-docker) for exact commands):
|
||||
|
||||
1. Clone `scrum4me-docker` next to `Scrum4Me/` (so `~/Development/Scrum4Me/scrum4me-docker/`).
|
||||
2. Provision the env vars above (typically a `.env` file in that repo, **not committed**).
|
||||
3. `docker build` the image and `docker run` it with the env file mounted.
|
||||
4. Watch container logs for the heartbeat/quota cycle.
|
||||
5. Trigger a job from the UI ("Voer alle uit" on the Solo Board) and verify the worker picks it up within ~5 seconds.
|
||||
|
||||
> **TODO:** once the `scrum4me-docker` README has stabilised, replace the bullets above with copy-paste-ready commands. Until then, defer to that repo for canonical instructions.
|
||||
|
||||
## Debugging a stuck worker
|
||||
|
||||
| Symptom | Likely cause | Fix |
|
||||
|---|---|---|
|
||||
| Worker shows offline in NavBar but container is running | `worker_heartbeat` not reaching MCP | Check `SCRUM4ME_BASE_URL` and `SCRUM4ME_BEARER_TOKEN`; tail container logs for HTTP errors |
|
||||
| Worker logs say "stand-by" indefinitely | `pct < MIN_QUOTA_PCT` and reset_at not reached | Lower `MIN_QUOTA_PCT` for testing, or wait for the printed `reset_at_iso` |
|
||||
| Job stuck `CLAIMED` for >30 min | Worker died mid-job | Wait — auto-requeue triggers on next `wait_for_job` |
|
||||
| Worker claims job but never updates status | Crashed before `update_job_status`; container restarted in a loop | Check `docker logs`; the next `wait_for_job` will requeue stale claims |
|
||||
| `update_job_status` returns `403` | Bearer token doesn't match `claimed_by_token_id` | The token was rotated mid-run; restart with fresh token |
|
||||
|
||||
For deeper troubleshooting see [06 — Troubleshooting](./06-troubleshooting.md).
|
||||
|
||||
## Smoke-test references
|
||||
|
||||
Historical Docker smoke tests live in [`docs/docker-smoke/`](../docker-smoke/). They validated the worktree-isolation + branch-per-story flow when the Docker worker was first introduced. They are **historical** — don't expect them to be runnable as-is — but they're a useful reference when you want to verify the same flow on a new container image.
|
||||
|
||||
## Deep links
|
||||
|
||||
| Topic | Source |
|
||||
|---|---|
|
||||
| Container image, Dockerfile, build | [`scrum4me-docker` repo](https://github.com/madhura68/scrum4me-docker) |
|
||||
| Worker loop & quota check | [`docs/runbooks/mcp-integration.md`](../runbooks/mcp-integration.md#pre-flight-quota-check-m13) |
|
||||
| Worker idempotency / job-status protocol | [`docs/runbooks/worker-idempotency.md`](../runbooks/worker-idempotency.md) |
|
||||
| Historical smoke tests | [`docs/docker-smoke/`](../docker-smoke/) |
|
||||
| Sandbox / exploration repo | [`scrum4me-sbx` repo](https://github.com/madhura68/scrum4me-sbx) |
|
||||
|
||||
## What's next
|
||||
|
||||
→ [06 — Troubleshooting](./06-troubleshooting.md) covers error codes and recovery procedures across the full stack.
|
||||
Loading…
Add table
Add a link
Reference in a new issue