Data Quality Layers (A / B / C / D)
Codified 2026-05-12 (commits
e43de39b→7f2789c5→d2567b7b→ migration 068). Trigger: the 2026-05-12 audit found 258 abandoned docs (pending > 48 h), 40 duplicate doctor profiles, 312 Lorem ipsum chunks sitting silently in the production corpus for weeks. None caught bypytest. None caught by deploy.
The project enforces four parallel code-quality gates: ruff (style), pyright (types), tsc (frontend types), eslint (frontend style). Until this sprint, the data layer had no equivalent: bad data could land in the corpus and stay there for months without surfacing.
Layers A / B / C / D mirror the four code-side gates at the data layer. Each gate has a single owner, a single failure mode, and a clear escalation:
| Layer | Mirror | What | Failure escalation |
|---|---|---|---|
| A | Daily observability | audit_data_quality.py cron at 03:30 UTC; emits markdown report; exit codes 0 / 1 / 2 (clean / warning / alert) | Structured log entry data_quality_audit_completed with severity field — existing log infra alerts on WARNING / ALERT |
| B | Canonicalization + dedup at ingest | Pre-insert find_document_by_source_url check + canonical-URL extraction; migration 068 partial unique index on (tenant_id, normalized_title) WHERE status='completed' | IntegrityError at the DB layer — application gets it as DuplicateDocument; the ingest pipeline skips the duplicate and logs it |
| C | Sanitization + extraction | sanitize_corpus_text (strips template Lorem ipsum) + extract_consultation_schedule (structured JSON from ZOL Drupal tables) called at document_service.py:1262+ and :1314+ | Bad chunks rejected at ingest; existing chunks backfilled by scripts/backfill_consultation_schedule.py (idempotent) |
| D | Post-completion verification | Verification logic inside Layer A audit script — confirms what landed matches what was crawled | Same as Layer A — markdown report + log severity |
Layer A — Daily observability
backend/scripts/audit_data_quality.py (~325 lines, single file, no new deps). Wired into APScheduler at 03:30 UTC (30 min after the 03:00 ingest cron) in commit d2567b7b.
Metrics emitted to tests/evaluation/results/data-quality-<YYYYMMDD>.md:
- Pending-age p50/p95 (soft >24 h, hard >48 h). PostgreSQL percentile uses
PERCENTILE_DISCnotPERCENTILE_CONT— interpolated timestamps don't cast back totimestamp with time zone(commit9fa6fa16). - Failure-reason histogram (soft >25, hard >100). Two classes split per refinement
7f2789c5:failed_real— threshold-checked. Real ingest failures.intentional_softdeletes— informational only. Rows withprocessing_error LIKE 'superseded-by-%' OR 'broken-extraction-%'. The histogram view marks each soft-delete row with✱.
- Duplicate-title count (alert on ≥1; should always be 0 with Layer B's partial unique index).
- Per-tenant doctor coverage (% with schedule, % with Lorem ipsum).
- Chunk-quality stats (short / empty / lorem).
Layer B — Canonicalization + dedup gate
Two complementary mechanisms at document_service.py:1262+:
- Canonical-URL extraction. The crawler already extracted
<link rel="canonical">but discarded it. Now stored indoc_metadata.canonical_url. - Pre-insert dedup check. Before INSERT, query for an existing completed doc whose normalized title (with
| <brand>suffix stripped) matches. If found, skip.
Migration 068 (commit d2567b7b) adds a partial unique index on (tenant_id, normalized_title) WHERE status='completed'. Schema-level enforcement: even if application-layer dedup is bypassed (manual insert, race, future code path), the DB rejects the duplicate. Partial because pending/processing/failed rows transit through transient duplicate states by design.
Companion fix 26ba1562 rejects ingest of docs whose title normalizes to empty string (a common drift mode for very-short pages — a single image with no readable text would normalize to empty and silently dedup against other empty-normalized docs).
Layer C — Sanitization + extraction
Sanitizer (cfcb4e2b)
One-shot script (scripts/sanitize_lorem_chunks.py) that walks existing document_chunks, detects template Lorem ipsum runs (/lorem ipsum dolor sit amet/i over ≥10 consecutive tokens), and strips them. Idempotent — safe to run on a corpus that's been sanitized before. Backfill-safe — only chunks with status='completed' are touched.
Layer B's empty-normalized-title guard (26ba1562) is the prevention side of Layer C's cleanup: future Lorem-only docs are rejected at ingest, so Layer C's sanitizer drains a finite backlog.
Schedule-table extractor (7f2789c5)
extract_consultation_schedule() in data_quality.py. Parses the ZOL Drupal schedule-table format:
| | MA | Di | WO | DO | VR | ZA |
| VM | RP2w | RP2w | RP2w | RP | RP2w | |
| NM | RP | | | | | |
into structured JSON stored at doc_metadata.consultation_schedule:
{
"MA": {"VM": "RP2w", "NM": "RP"},
"Di": {"VM": "RP2w"},
"WO": {"VM": "RP2w"},
"DO": {"VM": "RP"},
"VR": {"VM": "RP2w"}
}
Empty cells are omitted; all-empty tables return None. Future cell codes (RP3w, etc.) pass through verbatim — no fixed vocabulary constraint.
Hooked into document_service.py:1314+ so every new ingested doc gets its schedule extracted. Existing docs backfilled by scripts/backfill_consultation_schedule.py (one-shot, idempotent).
Important pairing: this extractor is the structured-data fallback to ADR-0057's ZOL_DOCTOR_SCHEDULE_RULE prompt rule. The prompt rule teaches the LLM to parse markdown when consumers haven't been upgraded yet; the extractor makes the structured JSON available for any future code path that wants deterministic schedule answers without LLM table parsing.
extract_consultation_schedule is tenant-scoped to the ZOL format. Future tenants get their own extractor analogous to _TENANT_CHAT_ADDENDUMS — same architectural pattern as ADR-0057.
Layer D — Post-completion verification
Verification logic inside Layer A's audit script. Confirms what landed in documents + document_chunks matches what was crawled (URL → chunk count, status flow consistency, expected-vs-actual completion rate per ingest run).
Layer D is the what we check dimension; Layer A is the when we check it dimension. There's no separate Layer D commit beyond e43de39b — both ship together because the verification rules are most expressive in the same script that already walks the corpus.
Why four layers, not one
A monolithic "data quality" script would have three problems:
- Ownership ambiguity — when the dedup gate fires, is that a Layer A audit problem or a Layer B ingest problem?
- Latency mixing — Layer B runs synchronously per insert (must be fast); Layer A runs nightly (can be slow); Layer C runs opportunistically (per ingest); Layer D runs async after Layer A finishes. Bundling them would force one cadence to bend to another.
- Failure isolation — a Layer C extractor bug (e.g., the empty-table edge case in
e03ff4cb) should not block ingest. Splitting C from B keeps the gate close to the failure mode.
The four-layer split mirrors how the code-side gates work: ruff runs in seconds, pyright runs in minutes, tsc runs on the frontend tree only, eslint runs after build. Same separation of cadence + concern.
Companion fixes that surfaced during implementation
| Commit | What | Why surfaced |
|---|---|---|
357725cb | Use d.metadata column (DB) not d.doc_metadata (Python attribute) | Backfill SQL was reading the ORM attribute name |
e03ff4cb | Pin department-overview format → None decision | Department-overview pages have no schedule by design; the extractor must return None, not raise |
f0a577aa + 6c3ddfb5 | Multi-match handling in find_document_by_source_url | Layer B's dedup needs deterministic behavior when the same canonical URL surfaces in two different ingest passes |
9fa6fa16 | PERCENTILE_DISC for timestamp percentile | PERCENTILE_CONT interpolates between two timestamps, producing fractional-microsecond values PostgreSQL can't cast back to timestamp with time zone |
References
- Commits:
e43de39b(Layer A/B/C/D shape) ·7f2789c5(Layer C extractor + Layer A refinement) ·d2567b7b(migration 068 + scheduler wiring) ·cfcb4e2b(Lorem ipsum sanitizer) - Schema:
backend/alembic/versions/068_data_quality_dedup_index.py— partial unique index on(tenant_id, normalized_title) WHERE status='completed' - Audit script:
backend/scripts/audit_data_quality.py - Extractor:
backend/app/services/data_quality.py—extract_consultation_schedule() - Backfill:
backend/scripts/backfill_consultation_schedule.py(one-shot, idempotent) - Related: ADR-0057 Tenant-Scoped Prompt Addendums + Doctor-Profile Boost — Layer C's schedule extractor is the structured-data half of ADR-0057's defense-in-depth pair.