fix(agent): backoff long on Anthropic 529 Overloaded, don't UNHEALTHY-cascade #14

Merged
janpeter merged 1 commit from fix/api-overloaded-backoff into master 2026-05-27 14:04:24 +02:00
Owner

Probleem

Claude API Error: Overloaded (= Anthropic HTTP 529 server-capacity, ≠ HTTP 429 rate-limit) zorgde voor cascade:

  1. claude exit 1 met overloaded-payload
  2. run-agent.sh exit-code-1 path → backoff 5s → retry
  3. Anthropic nog steeds overloaded → exit 1 → 10s backoff
  4. 5×: too many consecutive failures → UNHEALTHY → run-agent.sh halt
  5. Manual restart vereist + nu ook 2e worker hetzelfde
  6. Stuck job ondanks dat Anthropic na ~30s alweer rustig was

Live bevestigd 2026-05-27 13:05 CEST tijdens sprint sub-agent dispatch (Sonnet 4.6 piek-belasting).

Fix in twee bestanden

bin/run-one-job.ts

  • Nieuwe constant API_OVERLOAD_PATTERNS (matcht API Error: Overloaded + "status_code":529)
  • Detection na TOKEN_EXPIRED check, alleen bij non-zero exit. Stelt apiOverloaded = true
  • rollbackClaim met reason=api_overloaded voor traceability (cleanup-effect via mcp PR #19)
  • Exit code 4 (nieuw) — onderscheidt van generic exit 1

bin/run-agent.sh

  • Nieuwe env AGENT_OVERLOAD_SLEEP (default 300s)
  • Slot tussen TOKEN_EXPIRED-block (exit 3) en de generic exit≠0 path
  • exit_code=4: sleep 300s, state api-overloaded, continue zonder CONSEC_FAILURES++ of BACKOFF-progression
  • Effect: 529-cascade = 1× 5-min nap, geen UNHEALTHY. Bij Anthropic-recovery hervat normale queue-poll automatisch.

Verificatie post-merge + deploy

  1. Trigger update_mcp_worker op scrum4me-srv
  2. Force-test (zonder echte Anthropic-load):
    # Schrijf overloaded-payload naar een test-runlog
    echo 'API Error: Overloaded' > /tmp/fake-overload.log
    # Bewijs detection-regex: grep -E '<patterns>' /tmp/fake-overload.log
    
  3. Bij volgende echte 529-incident in worker-log: zoek [run-one-job] API_OVERLOADED detected + run-agent.sh log API OVERLOADED (exit=4) — sleeping 300s
  4. Verifieer DB: na rollback met PR #19claude_jobs.status='QUEUED', schone state, retry-baar

Niet in scope

  • AGENT_MAX_FAILURES bump van 5 → 15: niet langer nodig zodra deze PR + mcp #19 staan. Counter blijft 5 voor échte worker-failures.
  • 529-detection in Anthropic SDK call zelf (zou subtieler kunnen, maar claude CLI exposes alleen exit + stdout). Pattern-match op stdout is voldoende voor het overgrote deel.

Rollback

Revert commit + update_mcp_worker flow → terug naar generic-exit-1 path. Geen state-impact.

Dependency

Scrum4me-mcp PR #19 (rollbackClaim cleanup) is co-dependent: zonder dat blijven worktree + sprint_task_executions hangen ook bij overload-rollback → retry na 300s sleep zou alsnog falen op stale-state. Beide deployen via één update_mcp_worker rebuild (cache-bust trekt latest scrum4me-mcp main + bundelt scrum4me-docker bin/* updates).

## Probleem Claude `API Error: Overloaded` (= Anthropic HTTP **529** server-capacity, ≠ HTTP 429 rate-limit) zorgde voor cascade: 1. claude exit 1 met overloaded-payload 2. run-agent.sh exit-code-1 path → backoff 5s → retry 3. Anthropic nog steeds overloaded → exit 1 → 10s backoff 4. 5×: too many consecutive failures → UNHEALTHY → run-agent.sh halt 5. Manual restart vereist + nu ook 2e worker hetzelfde 6. Stuck job ondanks dat Anthropic na ~30s alweer rustig was Live bevestigd 2026-05-27 13:05 CEST tijdens sprint sub-agent dispatch (Sonnet 4.6 piek-belasting). ## Fix in twee bestanden ### `bin/run-one-job.ts` - Nieuwe constant `API_OVERLOAD_PATTERNS` (matcht `API Error: Overloaded` + `"status_code":529`) - Detection na TOKEN_EXPIRED check, alleen bij non-zero exit. Stelt `apiOverloaded = true` - `rollbackClaim` met `reason=api_overloaded` voor traceability (cleanup-effect via mcp PR #19) - Exit code **4** (nieuw) — onderscheidt van generic exit 1 ### `bin/run-agent.sh` - Nieuwe env `AGENT_OVERLOAD_SLEEP` (default **300s**) - Slot tussen TOKEN_EXPIRED-block (exit 3) en de generic exit≠0 path - exit_code=4: sleep 300s, state `api-overloaded`, **`continue`** zonder CONSEC_FAILURES++ of BACKOFF-progression - Effect: 529-cascade = 1× 5-min nap, geen UNHEALTHY. Bij Anthropic-recovery hervat normale queue-poll automatisch. ## Verificatie post-merge + deploy 1. Trigger `update_mcp_worker` op scrum4me-srv 2. Force-test (zonder echte Anthropic-load): ```bash # Schrijf overloaded-payload naar een test-runlog echo 'API Error: Overloaded' > /tmp/fake-overload.log # Bewijs detection-regex: grep -E '<patterns>' /tmp/fake-overload.log ``` 3. Bij volgende echte 529-incident in worker-log: zoek `[run-one-job] API_OVERLOADED detected` + run-agent.sh log `API OVERLOADED (exit=4) — sleeping 300s` 4. Verifieer DB: na rollback met PR #19 → `claude_jobs.status='QUEUED'`, schone state, retry-baar ## Niet in scope - AGENT_MAX_FAILURES bump van 5 → 15: niet langer nodig zodra deze PR + mcp #19 staan. Counter blijft 5 voor échte worker-failures. - 529-detection in Anthropic SDK call zelf (zou subtieler kunnen, maar `claude` CLI exposes alleen exit + stdout). Pattern-match op stdout is voldoende voor het overgrote deel. ## Rollback Revert commit + `update_mcp_worker` flow → terug naar generic-exit-1 path. Geen state-impact. ## Dependency Scrum4me-mcp PR #19 (rollbackClaim cleanup) is **co-dependent**: zonder dat blijven worktree + sprint_task_executions hangen ook bij overload-rollback → retry na 300s sleep zou alsnog falen op stale-state. Beide deployen via één `update_mcp_worker` rebuild (cache-bust trekt latest scrum4me-mcp main + bundelt scrum4me-docker bin/* updates).
When Claude returns `API Error: Overloaded` (Anthropic HTTP 529,
server-capacity exhaustion) every 5s retry is contraproductive AND
counts toward AGENT_MAX_FAILURES → both workers UNHEALTHY in <2 min →
run-agent.sh halts, manual restart required.

Live observed 2026-05-27 13:05 CEST: peak-hour Sonnet capacity
shortage during sprint sub-agent dispatch → cascade above.

Fix in two parts:

run-one-job.ts:
- New API_OVERLOAD_PATTERNS (matches "API Error: Overloaded" + 529)
- Detection AFTER token-expiry check, only on non-zero exit
- Sets apiOverloaded = true, logs detection, rollbackClaim with
  reason=api_overloaded (rollback cleanup happens via mcp PR #19)
- Returns exit code 4 (new) instead of generic 1

run-agent.sh:
- New AGENT_OVERLOAD_SLEEP env (default 300s)
- exit_code=4 path: sleep AGENT_OVERLOAD_SLEEP, write state
  "api-overloaded", continue loop WITHOUT incrementing CONSEC_FAILURES
  or progressing BACKOFF
- Slot AFTER TOKEN_EXPIRED handling, BEFORE generic failure block

Effect: a 529-cascade triggers one 5-min nap per worker instead of
5×5s-retries-then-UNHEALTHY. When Anthropic recovers, normal queue
poll resumes. Pairs with scrum4me-mcp PR #19 which makes the rollback
actually retry-able (cleans worktree + sprint_task_executions).
janpeter merged commit b420361e44 into master 2026-05-27 14:04:24 +02:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
janpeter/scrum4me-docker!14
No description provided.