feat(content-policy): Product.content_policy + AVG checker (sub-project C, Phase 0) #16

Merged
janpeter merged 6 commits from feat/copilot-content-policy into main 2026-06-13 21:25:01 +02:00
Owner

Sub-project C, Phase 0 — the canonical foundation for the per-product AVG content-policy gate.

What

  • prisma/schema.prisma: Product.content_policy Json? — additive, nullable (null = no restriction; no behaviour change for existing products).
  • lib/content-policy.ts (dependency-free):
    • parseContentPolicy — fail-closed on malformed JSON and on a self-contradictory policy (an allowedFieldTerm that would mask its own forbidden field).
    • checkContentPolicy — standalone, boundary-aware allowlist masking + substring field/feature match. Pure string-ops (no RegExp built from policy data → no injection/ReDoS), code-point-correct boundaries, anti-evasion normalize (zero-width/format-char strip + whitespace-collapse).
  • __tests__/content-policy.test.ts — 23 tests.

Policy shape

{ forbiddenFields: string[], forbiddenFeatureTerms: string[], allowedFieldTerms: string[] } | null

Review record

3 codex rounds + 3 independent adversarial rounds; every finding reproduced against the real code and fixed — 1×P1 (prefix-compound bypass), 4×P2 (incl. allowlist-side re-open + non-BMP/astral boundary), several P3. codex final verdict: akkoord, geen P0–P3 bevindingen. npm run verify green (131/131).

Next (NOT in this PR)

Canonical schema + checker only. The live-DB migration is 154's lane (Phase 1: bump the consumer submodule → prisma migrate dev authors the SQL → prisma migrate deploy on the live DB, migrate-first), after which the enforcement gates land in scrum4me-mcp + Scrum4Me-web (Phases 2/3). Design: spec rev3.1; plan: the C implementation plan.

🤖 Generated with Claude Code

**Sub-project C, Phase 0** — the canonical foundation for the per-product AVG content-policy gate. ## What - `prisma/schema.prisma`: `Product.content_policy Json?` — additive, nullable (`null` = no restriction; no behaviour change for existing products). - `lib/content-policy.ts` (dependency-free): - `parseContentPolicy` — fail-closed on malformed JSON **and** on a self-contradictory policy (an `allowedFieldTerm` that would mask its own forbidden field). - `checkContentPolicy` — standalone, boundary-aware allowlist masking + substring field/feature match. Pure string-ops (no RegExp built from policy data → no injection/ReDoS), code-point-correct boundaries, anti-evasion `normalize` (zero-width/format-char strip + whitespace-collapse). - `__tests__/content-policy.test.ts` — 23 tests. ## Policy shape `{ forbiddenFields: string[], forbiddenFeatureTerms: string[], allowedFieldTerms: string[] } | null` ## Review record **3 codex rounds + 3 independent adversarial rounds**; every finding reproduced against the real code and fixed — 1×P1 (prefix-compound bypass), 4×P2 (incl. allowlist-side re-open + non-BMP/astral boundary), several P3. codex final verdict: **akkoord, geen P0–P3 bevindingen**. `npm run verify` green (131/131). ## Next (NOT in this PR) Canonical schema + checker only. The live-DB migration is **154's lane** (Phase 1: bump the consumer submodule → `prisma migrate dev` authors the SQL → `prisma migrate deploy` on the live DB, migrate-first), after which the enforcement gates land in scrum4me-mcp + Scrum4Me-web (Phases 2/3). Design: spec rev3.1; plan: the C implementation plan. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Adversarial review found two bypasses in the lexical checker:
- invisible chars (ZWSP/BOM/soft-hyphen) inside a forbidden field passed
- multi-word feature terms split by double-space/newline/tab missed
  (call-sites join title+\n+description, so a split phrase silently missed)

normalize() now strips \p{Cf}+U+00AD and collapses whitespace. +2 regression tests.
Adversarial review found the lookbehind was only a LEFT anchor, so prefix-
compounds of a forbidden field passed: clientbsn, woonadres, mijnbsn, clientadres.
These are the most natural forbidden field names in a zorg-app — a real bypass.

Field-matching is now substring AFTER stripping the policy's allowedFieldTerms
(e-mailadres/ip-adres) from the text. No RegExp is built from per-product data
anymore, so the injection/ReDoS surface is gone too. ContentPolicy gains
allowedFieldTerms; DigiPlein's seed (sub-project D) must provide it + curate
forbiddenFields to avoid generic-token false-positives (finding #5).
A focused adversarial recheck found the rev3 allowlist-strip RE-OPENED the
prefix-compound bypass from the allowlist side: a greedy substring strip of
'ip-adres' also ate the 'adres' inside larger compounds, so ip-adresboek,
ip-adreslijst, ip-adresgegevens, e-mailadresboek all passed. Plus an order-
dependence + space-fusion P3.

Allowed compounds are now masked only when STANDALONE (bounded by non-token
chars, hyphen counted as part of the compound), computed on the original
haystack so the verdict is order-independent and masking can never fuse a new
multi-word field. +4 regression tests (128/128).

Also synced the stale top-comment + schema comment to the rev3 shape (codex P3).
Round-3 review. codex (P2): the standalone-masking read neighbour chars as UTF-16
code units, so an astral letter/digit beside an allowed term was a lone surrogate
-> not a token char -> term wrongly 'standalone' -> masked. Repro: '𠀀ip-adres' /
'ip-adres𠀀' passed while 'xip-adres'/'ip-adres9' blocked. Now reads whole code
points (codePointBefore/codePointAtIndex).

Adversarial recheck (P3): a self-contradictory policy where an allowedFieldTerm
standalone-swallows a forbidden field (e.g. allowed 'adres' vs forbidden 'adres',
or allowed 'biologisch geslacht' vs forbidden 'geslacht') silently disabled it.
parseContentPolicy now fails closed on that (intended suffix-overlaps like 'adres'
inside 'e-mailadres' stay legal). +3 tests (131/131).
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
janpeter/scrum4me-shared!16
No description provided.