Madhura68 e75bac9375 docs(PBI-58): add developer manual chapters under docs/manual/

Adds a 7-file English-language manual targeted at new human contributors:
index, overview, statuses & transitions (with mermaid state diagrams),
git workflow, MCP integration, docker, and troubleshooting. The manual
is the *map* — it cross-references existing runbooks/ADRs/architecture
docs rather than duplicating their content.

Regenerates docs/INDEX.md and validates with check-doc-links.mjs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-07 17:37:43 +02:00

8.7 KiB

Raw Blame History

title

status

audience

language

last_updated

when_to_read

Docker

active

contributor

2026-05-07

Before running the worker locally, debugging a stuck job, or operating the Mac/NAS deployment.

05 — Docker

This chapter is the contributor's tour of the Docker side of Scrum4Me. Two important up-front facts:

The Next.js app is not containerised. The web UI, API routes, server actions, and database connection all run on Vercel (serverless functions + Edge runtime). There is no Dockerfile in this repo and no docker-compose.yml.
Only the worker is containerised. The "worker" is a Claude Code agent in a long-running container that polls the Scrum4Me job queue via MCP and executes TASK_IMPLEMENTATION / IDEA_GRILL / IDEA_MAKE_PLAN / SPRINT_IMPLEMENTATION jobs.

The container image and its supporting scripts live in a separate repo: madhura68/scrum4me-docker. This manual documents the consumer side — what the worker is, how it relates to Scrum4Me, and how to diagnose issues. The container internals (Dockerfile, entrypoint, agent provisioning) are out of scope for this manual; see that repo's README.

Note: A separate sandbox repo scrum4me-sbx (SC-4) exists for Docker exploration. Treat it as a scratchpad, not as the production worker.

Topology

flowchart LR
    subgraph Vercel
        App[Next.js app<br/>+ API routes]
    end
    subgraph Neon
        DB[(Postgres)]
    end
    subgraph Mac["Mac (default) / NAS (opt-in)"]
        Worker[Worker container<br/>Claude Code + MCP]
    end
    Worker -- MCP over HTTPS --> App
    App -- Prisma --> DB
    Worker -- git push --> GH[GitHub]
    GH -- webhooks --> App

The worker never connects to the database directly. All state changes go through MCP tools, which call the Vercel-hosted REST API, which writes to Neon via Prisma.
The worker does push commits directly to GitHub. GitHub then notifies Vercel and the auto-PR flow (03 — Git Workflow) takes over.

Mac vs NAS

Flow	When to use	Status
Mac-native (arm64)	Default for development and small teams	Active
NAS	Self-hosted always-on worker on a Synology / Asustor / similar	Opt-in, validated by historical smoke tests in `docs/docker-smoke/`

The Mac flow is the default because it doesn't require dedicated hardware. The container runs natively on Apple Silicon (arm64) — no x86 emulation overhead.

Environment variables the worker needs

The worker container needs only what's required to authenticate to MCP and push to GitHub:

Var	Purpose
`SCRUM4ME_BEARER_TOKEN`	Bearer token bound to a product. Returned by the user's API-token settings page.
`SCRUM4ME_BASE_URL`	Usually `https://scrum4me.vercel.app` (or the user's domain).
`GITHUB_TOKEN`	Personal access token with `repo` scope, used by `git push` and `gh pr create`.
`ANTHROPIC_API_KEY`	The Claude API key used by the worker process.
`MIN_QUOTA_PCT`	Optional. Worker pauses if Anthropic quota drops below this percentage.

Hardstop: the worker does not need DATABASE_URL, SESSION_SECRET, or CRON_SECRET. Those belong to the Next.js app; the worker only talks to MCP. If you find yourself adding DB env vars to the worker, stop — you're solving the wrong problem.

The full list and provisioning instructions live in the scrum4me-docker README. TODO: link to specific sections of that README once it's stable.

What the worker loop does, on a single iteration

sequenceDiagram
    participant W as Worker
    participant Q as worker-quota-probe.sh
    participant M as MCP server
    W->>Q: probe Anthropic quota
    Q-->>W: { pct, reset_at_iso }
    alt pct < MIN_QUOTA_PCT
        W->>M: worker_heartbeat(pct, last_quota_check_at)
        W->>W: sleep until reset_at_iso (cap 1h)
    else quota ok
        W->>M: worker_heartbeat(pct, last_quota_check_at)
        W->>M: wait_for_job (block ≤600s, claim atomically)
        alt queue empty
            W->>W: continue (no work, loop again)
        else got job
            W->>W: execute by `kind`
            W->>M: update_job_status(done|failed)
        end
    end
    Note over W: continue forever

The loop is described authoritatively in docs/runbooks/mcp-integration.md and docs/runbooks/worker-idempotency.md.

Quota probe

bin/worker-quota-probe.sh (in scrum4me-docker) makes a tiny call to the Anthropic API to read the current quota percentage and reset time. Cost: ~1 output token per probe (~12 tokens/hour at 5-minute intervals). The default MIN_QUOTA_PCT is 20% — typically high enough on Pro/Max plans that the worker never pauses during normal day-job hours.

Heartbeat

Every iteration the worker calls worker_heartbeat({ last_quota_pct, last_quota_check_at }). The MCP server emits an SSE event so the NavBar in the Next.js app shows the worker as live. A heartbeat older than 15 seconds is rendered as "offline" / "stand-by" in the UI.

Stale-claim recovery

If a worker dies mid-job (process crash, container kill, network partition), its claimed job stays as CLAIMED in the database. After 30 minutes the next wait_for_job call automatically requeues it (CLAIMED → QUEUED) before claiming a fresh one. No manual intervention is required for clean recovery.

When you do need to manually requeue a job (e.g. you killed it intentionally and don't want to wait 30 min), the operator route is the admin board → "Requeue job" button. TODO: confirm the exact UI path; this is not yet documented in docs/runbooks/.

Running the worker locally

The intended local workflow per the project's standing memory is Mac-native Docker (the user's project_docker_default_target memory). High-level steps (verify against the scrum4me-docker README for exact commands):

Clone scrum4me-docker next to Scrum4Me/ (so ~/Development/Scrum4Me/scrum4me-docker/).
Provision the env vars above (typically a .env file in that repo, not committed).
docker build the image and docker run it with the env file mounted.
Watch container logs for the heartbeat/quota cycle.
Trigger a job from the UI ("Voer alle uit" on the Solo Board) and verify the worker picks it up within ~5 seconds.

TODO: once the scrum4me-docker README has stabilised, replace the bullets above with copy-paste-ready commands. Until then, defer to that repo for canonical instructions.

Debugging a stuck worker

Symptom	Likely cause	Fix
Worker shows offline in NavBar but container is running	`worker_heartbeat` not reaching MCP	Check `SCRUM4ME_BASE_URL` and `SCRUM4ME_BEARER_TOKEN`; tail container logs for HTTP errors
Worker logs say "stand-by" indefinitely	`pct < MIN_QUOTA_PCT` and reset_at not reached	Lower `MIN_QUOTA_PCT` for testing, or wait for the printed `reset_at_iso`
Job stuck `CLAIMED` for >30 min	Worker died mid-job	Wait — auto-requeue triggers on next `wait_for_job`
Worker claims job but never updates status	Crashed before `update_job_status`; container restarted in a loop	Check `docker logs`; the next `wait_for_job` will requeue stale claims
`update_job_status` returns `403`	Bearer token doesn't match `claimed_by_token_id`	The token was rotated mid-run; restart with fresh token

For deeper troubleshooting see 06 — Troubleshooting.

Smoke-test references

Historical Docker smoke tests live in docs/docker-smoke/. They validated the worktree-isolation + branch-per-story flow when the Docker worker was first introduced. They are historical — don't expect them to be runnable as-is — but they're a useful reference when you want to verify the same flow on a new container image.

Deep links

Topic	Source
Container image, Dockerfile, build	`scrum4me-docker` repo
Worker loop & quota check	`docs/runbooks/mcp-integration.md`
Worker idempotency / job-status protocol	`docs/runbooks/worker-idempotency.md`
Historical smoke tests	`docs/docker-smoke/`
Sandbox / exploration repo	`scrum4me-sbx` repo

What's next

→ 06 — Troubleshooting covers error codes and recovery procedures across the full stack.

8.7 KiB Raw Blame History