Data Quality Layers (A / B / C / D)

Codified 2026-05-12 (commits e43de39b → 7f2789c5 → d2567b7b → migration 068). Trigger: the 2026-05-12 audit found 258 abandoned docs (pending > 48 h), 40 duplicate doctor profiles, 312 Lorem ipsum chunks sitting silently in the production corpus for weeks. None caught by pytest. None caught by deploy.

The project enforces four parallel code-quality gates: ruff (style), pyright (types), tsc (frontend types), eslint (frontend style). Until this sprint, the data layer had no equivalent: bad data could land in the corpus and stay there for months without surfacing.

Layers A / B / C / D mirror the four code-side gates at the data layer. Each gate has a single owner, a single failure mode, and a clear escalation:

Layer	Mirror	What	Failure escalation
A	Daily observability	`audit_data_quality.py` cron at 03:30 UTC; emits markdown report; exit codes 0 / 1 / 2 (clean / warning / alert)	Structured log entry `data_quality_audit_completed` with `severity` field — existing log infra alerts on WARNING / ALERT
B	Canonicalization + dedup at ingest	Pre-insert `find_document_by_source_url` check + canonical-URL extraction; migration 068 partial unique index on `(tenant_id, normalized_title) WHERE status='completed'`	`IntegrityError` at the DB layer — application gets it as `DuplicateDocument`; the ingest pipeline skips the duplicate and logs it
C	Sanitization + extraction	`sanitize_corpus_text` (strips template Lorem ipsum) + `extract_consultation_schedule` (structured JSON from ZOL Drupal tables) called at `document_service.py:1262+` and `:1314+`	Bad chunks rejected at ingest; existing chunks backfilled by `scripts/backfill_consultation_schedule.py` (idempotent)
D	Post-completion verification	Verification logic inside Layer A audit script — confirms what landed matches what was crawled	Same as Layer A — markdown report + log severity

Layer A — Daily observability

backend/scripts/audit_data_quality.py (~325 lines, single file, no new deps). Wired into APScheduler at 03:30 UTC (30 min after the 03:00 ingest cron) in commit d2567b7b.

Metrics emitted to tests/evaluation/results/data-quality-<YYYYMMDD>.md:

Pending-age p50/p95 (soft >24 h, hard >48 h). PostgreSQL percentile uses PERCENTILE_DISC not PERCENTILE_CONT — interpolated timestamps don't cast back to timestamp with time zone (commit 9fa6fa16).
Failure-reason histogram (soft >25, hard >100). Two classes split per refinement 7f2789c5:
- failed_real — threshold-checked. Real ingest failures.
- intentional_softdeletes — informational only. Rows with processing_error LIKE 'superseded-by-%' OR 'broken-extraction-%'. The histogram view marks each soft-delete row with ✱.
Duplicate-title count (alert on ≥1; should always be 0 with Layer B's partial unique index).
Per-tenant doctor coverage (% with schedule, % with Lorem ipsum).
Chunk-quality stats (short / empty / lorem).

Layer B — Canonicalization + dedup gate

Two complementary mechanisms at document_service.py:1262+:

Canonical-URL extraction. The crawler already extracted <link rel="canonical"> but discarded it. Now stored in doc_metadata.canonical_url.
Pre-insert dedup check. Before INSERT, query for an existing completed doc whose normalized title (with | <brand> suffix stripped) matches. If found, skip.

Migration 068 (commit d2567b7b) adds a partial unique index on (tenant_id, normalized_title) WHERE status='completed'. Schema-level enforcement: even if application-layer dedup is bypassed (manual insert, race, future code path), the DB rejects the duplicate. Partial because pending/processing/failed rows transit through transient duplicate states by design.

Companion fix 26ba1562 rejects ingest of docs whose title normalizes to empty string (a common drift mode for very-short pages — a single image with no readable text would normalize to empty and silently dedup against other empty-normalized docs).

Layer C — Sanitization + extraction

Sanitizer (`cfcb4e2b`)

One-shot script (scripts/sanitize_lorem_chunks.py) that walks existing document_chunks, detects template Lorem ipsum runs (/lorem ipsum dolor sit amet/i over ≥10 consecutive tokens), and strips them. Idempotent — safe to run on a corpus that's been sanitized before. Backfill-safe — only chunks with status='completed' are touched.

Layer B's empty-normalized-title guard (26ba1562) is the prevention side of Layer C's cleanup: future Lorem-only docs are rejected at ingest, so Layer C's sanitizer drains a finite backlog.

Schedule-table extractor (`7f2789c5`)

extract_consultation_schedule() in data_quality.py. Parses the ZOL Drupal schedule-table format:

| | MA | Di | WO | DO | VR | ZA |
| VM | RP2w | RP2w | RP2w | RP | RP2w | |
| NM | RP   |      |      |    |      | |

into structured JSON stored at doc_metadata.consultation_schedule:

{
  "MA": {"VM": "RP2w", "NM": "RP"},
  "Di": {"VM": "RP2w"},
  "WO": {"VM": "RP2w"},
  "DO": {"VM": "RP"},
  "VR": {"VM": "RP2w"}
}

Empty cells are omitted; all-empty tables return None. Future cell codes (RP3w, etc.) pass through verbatim — no fixed vocabulary constraint.

Hooked into document_service.py:1314+ so every new ingested doc gets its schedule extracted. Existing docs backfilled by scripts/backfill_consultation_schedule.py (one-shot, idempotent).

Important pairing: this extractor is the structured-data fallback to ADR-0057's ZOL_DOCTOR_SCHEDULE_RULE prompt rule. The prompt rule teaches the LLM to parse markdown when consumers haven't been upgraded yet; the extractor makes the structured JSON available for any future code path that wants deterministic schedule answers without LLM table parsing.

extract_consultation_schedule is tenant-scoped to the ZOL format. Future tenants get their own extractor analogous to _TENANT_CHAT_ADDENDUMS — same architectural pattern as ADR-0057.

Layer D — Post-completion verification

Verification logic inside Layer A's audit script. Confirms what landed in documents + document_chunks matches what was crawled (URL → chunk count, status flow consistency, expected-vs-actual completion rate per ingest run).

Layer D is the what we check dimension; Layer A is the when we check it dimension. There's no separate Layer D commit beyond e43de39b — both ship together because the verification rules are most expressive in the same script that already walks the corpus.

Why four layers, not one

A monolithic "data quality" script would have three problems:

Ownership ambiguity — when the dedup gate fires, is that a Layer A audit problem or a Layer B ingest problem?
Latency mixing — Layer B runs synchronously per insert (must be fast); Layer A runs nightly (can be slow); Layer C runs opportunistically (per ingest); Layer D runs async after Layer A finishes. Bundling them would force one cadence to bend to another.
Failure isolation — a Layer C extractor bug (e.g., the empty-table edge case in e03ff4cb) should not block ingest. Splitting C from B keeps the gate close to the failure mode.

The four-layer split mirrors how the code-side gates work: ruff runs in seconds, pyright runs in minutes, tsc runs on the frontend tree only, eslint runs after build. Same separation of cadence + concern.

Companion fixes that surfaced during implementation

Commit	What	Why surfaced
`357725cb`	Use `d.metadata` column (DB) not `d.doc_metadata` (Python attribute)	Backfill SQL was reading the ORM attribute name
`e03ff4cb`	Pin department-overview format → None decision	Department-overview pages have no schedule by design; the extractor must return None, not raise
`f0a577aa` + `6c3ddfb5`	Multi-match handling in `find_document_by_source_url`	Layer B's dedup needs deterministic behavior when the same canonical URL surfaces in two different ingest passes
`9fa6fa16`	`PERCENTILE_DISC` for timestamp percentile	`PERCENTILE_CONT` interpolates between two timestamps, producing fractional-microsecond values PostgreSQL can't cast back to `timestamp with time zone`

References

Commits: e43de39b (Layer A/B/C/D shape) · 7f2789c5 (Layer C extractor + Layer A refinement) · d2567b7b (migration 068 + scheduler wiring) · cfcb4e2b (Lorem ipsum sanitizer)
Schema: backend/alembic/versions/068_data_quality_dedup_index.py — partial unique index on (tenant_id, normalized_title) WHERE status='completed'
Audit script: backend/scripts/audit_data_quality.py
Extractor: backend/app/services/data_quality.py — extract_consultation_schedule()
Backfill: backend/scripts/backfill_consultation_schedule.py (one-shot, idempotent)
Related: ADR-0057 Tenant-Scoped Prompt Addendums + Doctor-Profile Boost — Layer C's schedule extractor is the structured-data half of ADR-0057's defense-in-depth pair.

Layer A — Daily observability​

Layer B — Canonicalization + dedup gate​

Layer C — Sanitization + extraction​

Sanitizer (cfcb4e2b)​

Schedule-table extractor (7f2789c5)​

Layer D — Post-completion verification​

Why four layers, not one​

Companion fixes that surfaced during implementation​

References​