Skip to main content

Data Quality Layers (A / B / C / D)

Codified 2026-05-12 (commits e43de39b7f2789c5d2567b7b → migration 068). Trigger: the 2026-05-12 audit found 258 abandoned docs (pending > 48 h), 40 duplicate doctor profiles, 312 Lorem ipsum chunks sitting silently in the production corpus for weeks. None caught by pytest. None caught by deploy.

The project enforces four parallel code-quality gates: ruff (style), pyright (types), tsc (frontend types), eslint (frontend style). Until this sprint, the data layer had no equivalent: bad data could land in the corpus and stay there for months without surfacing.

Layers A / B / C / D mirror the four code-side gates at the data layer. Each gate has a single owner, a single failure mode, and a clear escalation:

LayerMirrorWhatFailure escalation
ADaily observabilityaudit_data_quality.py cron at 03:30 UTC; emits markdown report; exit codes 0 / 1 / 2 (clean / warning / alert)Structured log entry data_quality_audit_completed with severity field — existing log infra alerts on WARNING / ALERT
BCanonicalization + dedup at ingestPre-insert find_document_by_source_url check + canonical-URL extraction; migration 068 partial unique index on (tenant_id, normalized_title) WHERE status='completed'IntegrityError at the DB layer — application gets it as DuplicateDocument; the ingest pipeline skips the duplicate and logs it
CSanitization + extractionsanitize_corpus_text (strips template Lorem ipsum) + extract_consultation_schedule (structured JSON from ZOL Drupal tables) called at document_service.py:1262+ and :1314+Bad chunks rejected at ingest; existing chunks backfilled by scripts/backfill_consultation_schedule.py (idempotent)
DPost-completion verificationVerification logic inside Layer A audit script — confirms what landed matches what was crawledSame as Layer A — markdown report + log severity

Layer A — Daily observability

backend/scripts/audit_data_quality.py (~325 lines, single file, no new deps). Wired into APScheduler at 03:30 UTC (30 min after the 03:00 ingest cron) in commit d2567b7b.

Metrics emitted to tests/evaluation/results/data-quality-<YYYYMMDD>.md:

  • Pending-age p50/p95 (soft >24 h, hard >48 h). PostgreSQL percentile uses PERCENTILE_DISC not PERCENTILE_CONT — interpolated timestamps don't cast back to timestamp with time zone (commit 9fa6fa16).
  • Failure-reason histogram (soft >25, hard >100). Two classes split per refinement 7f2789c5:
    • failed_real — threshold-checked. Real ingest failures.
    • intentional_softdeletes — informational only. Rows with processing_error LIKE 'superseded-by-%' OR 'broken-extraction-%'. The histogram view marks each soft-delete row with .
  • Duplicate-title count (alert on ≥1; should always be 0 with Layer B's partial unique index).
  • Per-tenant doctor coverage (% with schedule, % with Lorem ipsum).
  • Chunk-quality stats (short / empty / lorem).

Layer B — Canonicalization + dedup gate

Two complementary mechanisms at document_service.py:1262+:

  1. Canonical-URL extraction. The crawler already extracted <link rel="canonical"> but discarded it. Now stored in doc_metadata.canonical_url.
  2. Pre-insert dedup check. Before INSERT, query for an existing completed doc whose normalized title (with | <brand> suffix stripped) matches. If found, skip.

Migration 068 (commit d2567b7b) adds a partial unique index on (tenant_id, normalized_title) WHERE status='completed'. Schema-level enforcement: even if application-layer dedup is bypassed (manual insert, race, future code path), the DB rejects the duplicate. Partial because pending/processing/failed rows transit through transient duplicate states by design.

Companion fix 26ba1562 rejects ingest of docs whose title normalizes to empty string (a common drift mode for very-short pages — a single image with no readable text would normalize to empty and silently dedup against other empty-normalized docs).

Layer C — Sanitization + extraction

Sanitizer (cfcb4e2b)

One-shot script (scripts/sanitize_lorem_chunks.py) that walks existing document_chunks, detects template Lorem ipsum runs (/lorem ipsum dolor sit amet/i over ≥10 consecutive tokens), and strips them. Idempotent — safe to run on a corpus that's been sanitized before. Backfill-safe — only chunks with status='completed' are touched.

Layer B's empty-normalized-title guard (26ba1562) is the prevention side of Layer C's cleanup: future Lorem-only docs are rejected at ingest, so Layer C's sanitizer drains a finite backlog.

Schedule-table extractor (7f2789c5)

extract_consultation_schedule() in data_quality.py. Parses the ZOL Drupal schedule-table format:

| | MA | Di | WO | DO | VR | ZA |
| VM | RP2w | RP2w | RP2w | RP | RP2w | |
| NM | RP | | | | | |

into structured JSON stored at doc_metadata.consultation_schedule:

{
"MA": {"VM": "RP2w", "NM": "RP"},
"Di": {"VM": "RP2w"},
"WO": {"VM": "RP2w"},
"DO": {"VM": "RP"},
"VR": {"VM": "RP2w"}
}

Empty cells are omitted; all-empty tables return None. Future cell codes (RP3w, etc.) pass through verbatim — no fixed vocabulary constraint.

Hooked into document_service.py:1314+ so every new ingested doc gets its schedule extracted. Existing docs backfilled by scripts/backfill_consultation_schedule.py (one-shot, idempotent).

Important pairing: this extractor is the structured-data fallback to ADR-0057's ZOL_DOCTOR_SCHEDULE_RULE prompt rule. The prompt rule teaches the LLM to parse markdown when consumers haven't been upgraded yet; the extractor makes the structured JSON available for any future code path that wants deterministic schedule answers without LLM table parsing.

extract_consultation_schedule is tenant-scoped to the ZOL format. Future tenants get their own extractor analogous to _TENANT_CHAT_ADDENDUMS — same architectural pattern as ADR-0057.

Layer D — Post-completion verification

Verification logic inside Layer A's audit script. Confirms what landed in documents + document_chunks matches what was crawled (URL → chunk count, status flow consistency, expected-vs-actual completion rate per ingest run).

Layer D is the what we check dimension; Layer A is the when we check it dimension. There's no separate Layer D commit beyond e43de39b — both ship together because the verification rules are most expressive in the same script that already walks the corpus.

Why four layers, not one

A monolithic "data quality" script would have three problems:

  1. Ownership ambiguity — when the dedup gate fires, is that a Layer A audit problem or a Layer B ingest problem?
  2. Latency mixing — Layer B runs synchronously per insert (must be fast); Layer A runs nightly (can be slow); Layer C runs opportunistically (per ingest); Layer D runs async after Layer A finishes. Bundling them would force one cadence to bend to another.
  3. Failure isolation — a Layer C extractor bug (e.g., the empty-table edge case in e03ff4cb) should not block ingest. Splitting C from B keeps the gate close to the failure mode.

The four-layer split mirrors how the code-side gates work: ruff runs in seconds, pyright runs in minutes, tsc runs on the frontend tree only, eslint runs after build. Same separation of cadence + concern.

Companion fixes that surfaced during implementation

CommitWhatWhy surfaced
357725cbUse d.metadata column (DB) not d.doc_metadata (Python attribute)Backfill SQL was reading the ORM attribute name
e03ff4cbPin department-overview format → None decisionDepartment-overview pages have no schedule by design; the extractor must return None, not raise
f0a577aa + 6c3ddfb5Multi-match handling in find_document_by_source_urlLayer B's dedup needs deterministic behavior when the same canonical URL surfaces in two different ingest passes
9fa6fa16PERCENTILE_DISC for timestamp percentilePERCENTILE_CONT interpolates between two timestamps, producing fractional-microsecond values PostgreSQL can't cast back to timestamp with time zone

References

  • Commits: e43de39b (Layer A/B/C/D shape) · 7f2789c5 (Layer C extractor + Layer A refinement) · d2567b7b (migration 068 + scheduler wiring) · cfcb4e2b (Lorem ipsum sanitizer)
  • Schema: backend/alembic/versions/068_data_quality_dedup_index.py — partial unique index on (tenant_id, normalized_title) WHERE status='completed'
  • Audit script: backend/scripts/audit_data_quality.py
  • Extractor: backend/app/services/data_quality.pyextract_consultation_schedule()
  • Backfill: backend/scripts/backfill_consultation_schedule.py (one-shot, idempotent)
  • Related: ADR-0057 Tenant-Scoped Prompt Addendums + Doctor-Profile Boost — Layer C's schedule extractor is the structured-data half of ADR-0057's defense-in-depth pair.