Document Ingestion Pipeline

Before the search system can answer questions, hospital content must be transformed from its raw formats into searchable representations. The ingestion pipeline converts documents into vector embeddings for semantic search, while the entity taxonomy is seeded separately from curated taxonomy data and hub pages.

Pipeline-Level Trade-offs

Decision	Chosen	Alternatives considered	Rejected because
Embedding model	OpenAI `text-embedding-3-large` (1,536-dim, @openai2024embeddings)	Self-hosted BGE-M3 (@chen2024bgem3) at 1,024-dim via Ollama; OpenAI `text-embedding-3-small` (1,536-dim, cheaper)	BGE-M3 was the previous baseline (ADR-0033). The migration (ADR-0048) accepted the ~$0.20/month cost for stronger Dutch retrieval (MTEB-NL ~64.6 vs 60.0) and removal of the Ollama operational dependency. text-embedding-3-small was rejected because the empirical retrieval gap on Dutch medical content was ~3 points MTEB-NL — small but measurable.
Chunking strategy	Markdown-header-aware split with 350-token target, 70-token overlap, hard 450-token ceiling	Fixed-token sliding window with no structural awareness; recursive character-based; semantic-similarity-based chunking	Hospital content (brochures, condition pages) is inherently sectioned. A fixed-token window severs section boundaries and dilutes the semantic signal; semantic-similarity chunking adds a per-document inference call we can't justify at corpus scale. Markdown-aware splitting respects existing structure with one regex pass.
Subprocess isolation for PDFs	`ProcessPoolExecutor` with 120 s timeout, image-only detection	In-process PyMuPDF; external service call	A misbehaving PDF in-process can OOM, segfault, or hang the FastAPI worker; we observed both during the initial 1,000-brochure import. An external service adds operational surface. Subprocess isolation kills only the worker on a bad PDF and lets the next URL through.
Re-ingestion strategy	Content-hash-based incremental update	Periodic full re-crawl; last-modified-header check	A full re-crawl is prohibitively slow; HTTP `Last-Modified` is rarely populated by Drupal correctly. Content-hash diff captures actual changes per Cho & Garcia-Molina (2003) findings (cited in §Incremental Updates below).

Pipeline Overview

Step 1: Content Extraction

The extraction layer normalizes diverse document formats into a common Markdown representation:

Source Format	Extraction Tool	Notes
PDF	PyMuPDF (fitz)	Subprocess-isolated extraction with 120s timeout
DOCX	python-docx	Preserves heading structure
HTML	crawl4ai	Async crawling with JavaScript rendering
Sitemap	Custom parser	Discovers URLs for crawl4ai

Subprocess-Isolated PDF Extraction

PDF extraction runs in a ProcessPoolExecutor rather than the main async event loop. This isolates the ingestion pipeline from problematic PDFs that could crash, hang, or consume excessive memory:

120-second timeout: Each PDF extraction call has a hard timeout. If a PDF takes longer (e.g., a 200-page scanned brochure), the subprocess is killed and the document is marked as failed.
Image-only PDF detection: The _is_image_only_pdf() helper inspects the first pages of a PDF for extractable text. If a PDF contains only scanned images with no text layer (common for older hospital brochures), it is classified as image-only and skipped with a descriptive error rather than producing empty chunks.
Process isolation: A segfault or memory leak in PyMuPDF kills only the worker process, not the FastAPI server.

Sitemap-driven crawling is particularly important for ZOL. The hospital website (built on Drupal by partner agency Novation) exposes a comprehensive sitemap. The crawler uses this sitemap to systematically discover and ingest all public-facing content, ensuring complete coverage.

The extracted Markdown is stored in MinIO as the canonical source, enabling re-processing without re-crawling when chunking or embedding strategies change.

Step 2: Text Chunking

Text chunking is the process of splitting documents into segments suitable for embedding. The chunking strategy significantly impacts retrieval quality -- chunks that are too large dilute semantic signals, while chunks that are too small lose context.

Chunking Configuration

Parameter	Value	Rationale
Target size	350 tokens	Optimal for `text-embedding-3-large` (empirically tested; unchanged from BGE-M3 baseline at the ADR-0048 migration)
Maximum size	450 tokens	Hard ceiling to prevent oversized chunks
Overlap	70 tokens	Preserves context across chunk boundaries
Tokenizer	Tiktoken (cl100k_base)	Matches OpenAI-family tokenization
Split awareness	Markdown headers	Respects document structure

Why Markdown-Aware Splitting?

Hospital content is inherently structured. A brochure about knee surgery has sections like "Voorbereiding" (Preparation), "Procedure", and "Nazorg" (Aftercare). Splitting at heading boundaries ensures that each chunk represents a coherent topic, rather than an arbitrary slice of text. This decision is documented in ADR-0001.

Metadata Attachment

Each chunk carries metadata that enables downstream filtering and boosting:

Source URL: The original page or document URL
Source title: The document or page title
Section headers: The Markdown heading hierarchy above this chunk
Category: Inferred from the source, normalized to title case (e.g., "Department", not "department")
Ingestion timestamp: For recency-based boosting
Canonical questions: 1-2 Dutch questions answered by this chunk (see Step 7 below)

Title keyword extraction replaced

The previous title_keywords metadata field has been replaced by BM25 tsvector search (ADR-007). The document title is now included directly in the tsvector, so title terms are searchable via BM25 without a separate metadata field.

Category Normalization

Categories are normalized to title case during ingestion (e.g., "department" becomes "Department"). This prevents silent metadata boost failures that could occur when category casing differs between ingested documents and the intent-to-category mapping used at query time.

Step 3: PII Detection

The PII detection layer scans content for personally identifiable information using regex patterns:

Pattern	Example	Action
Email addresses	facturatie@zol.be	Log detection
Phone numbers	089 32 50 50	Log detection
BSN (Dutch SSN)	9-digit pattern	Log detection

Intentional Design Choice

PII masking is disabled for ZOL content. Hospital contact information (email addresses, phone numbers) is intentionally public -- patients need this information to book appointments and reach departments. The PII detector still logs detections for audit purposes, but does not redact content. See PII Protection for the full rationale.

Step 4: Embedding Generation

Each text chunk is converted into a 1,536-dimensional dense vector using OpenAI text-embedding-3-large (see ADR-0048, @openai2024embeddings; the dense bi-encoder retrieval pattern follows @karpukhin2020dpr):

Model: text-embedding-3-large (OpenAI)
Dimensions: 1,536 (truncated from the model's native 3,072 to fit pgvector's HNSW 2,000-dim limit; see @pgvector_docs)
Tokenizer: cl100k_base (same OpenAI tokenizer used for chunk-size accounting → exact token counts)
Language support: Strong multilingual; MTEB-NL retrieval ~64.6 (above BGE-M3's 60.0)
Inference: OpenAI API; cost ~$0.13 per 1M tokens (75% prompt-cache discount; ~$0.20/month at 25,000 monthly queries)
Configuration: EMBEDDING_PROVIDER=openai, EMBEDDING_MODEL=text-embedding-3-large, EMBEDDING_DIMENSIONS=1536

Embeddings are sent as a batch API call to OpenAI with all texts at once, providing both quality and speed. The previous BGE-M3 / Ollama path is retained as the configurable fallback (EMBEDDING_PROVIDER=ollama) and as the model used for ColBERT reranking.

Step 5: Vector Storage (pgvector)

Embedded chunks are stored in PostgreSQL with the pgvector extension. Each record includes the chunk text, its embedding vector, and all metadata. An HNSW index enables approximate nearest neighbor search at query time.

Step 6: BM25 Search Vector Generation

After chunks are stored in pgvector, each chunk receives a PostgreSQL tsvector for BM25 keyword search (see ADR-007). The tsvector is constructed by concatenating the parent document's title with the chunk content, then applying PostgreSQL's to_tsvector() function. Only chunks that have not yet been indexed are processed, making the operation idempotent for incremental ingestion.

Key design choices:

'simple' text configuration: No language-specific stemming. Dutch medical terms like "cardioversie" and "orthopedie" should be matched exactly, not reduced to stems
Title included: The document title is prepended to each chunk's tsvector, so title terms are searchable via BM25
GIN index: A GIN index on search_vector enables sub-millisecond keyword lookup

Step 7: Canonical Question Generation (Background)

Conceptual overview

For the why behind canonical questions and page summaries — the HyDE-at-index-time and Anthropic-contextual-retrieval rationale, and how both feed the query-time retrieval-steering triad — see Ingestion Enrichment: Canonical Questions & Page Summaries. This section documents the procedural how.

After tsvector generation, the pipeline generates 1-2 Dutch questions per chunk using the Tier 2 (standard) model via OpenAI. When contextual embeddings are enabled, questions are generated synchronously (before embedding) with asyncio.gather concurrency. Otherwise, this runs as a background task.

Prompt (Dutch):

Gegeven deze tekst van een ziekenhuiswebsite, genereer 1-2 vragen (in het Nederlands) die door deze tekst beantwoord worden. Geef enkel de vragen terug, één per regel.

Example:

Chunk about visiting hours → "Wat zijn de bezoekuren op cardiologie?"
Chunk about Dr. Van den Berg → "Wie is de orthopedisch chirurg bij ZOL?"

The generated questions are:

Stored in chunk_metadata.canonical_questions (list of strings)
Appended to the search_vector tsvector, so BM25 search matches against both content and canonical questions

This enrichment significantly improves BM25 recall for question-phrased queries, which are the most common user input pattern.

Step 8: LLM Page Summary Generation

During graph extraction, the LLM generates a page summary for every page — a 2-3 sentence Dutch description of what the page covers. This summary is stored in chunk_metadata.page_summary (JSONB) and prepended to the first chunk from each document during context assembly, implementing Anthropic's contextual retrieval pattern.

Page summaries are generated for all pages, not just hub pages

Even though non-hub pages do not write entities to the taxonomy, the LLM still generates a page summary for every document. This ensures all chunks benefit from contextual retrieval at query time. See ADR-0014 for details.

Enrichment Retry & Gap Detection

LLM enrichment steps (canonical questions, page summaries, graph extraction) can fail for individual documents due to transient API errors, rate limits, or timeouts. Rather than re-running the entire pipeline, the system supports inline gap detection and backfill:

Gap detection: Before starting enrichment, the pipeline queries for documents that completed chunking but are missing enrichment artifacts (e.g., chunks without canonical questions, documents without a page summary).
Selective backfill: Only the missing enrichment steps are re-executed for gap documents, skipping the extraction and chunking stages entirely.
Retry integration: Gap backfill runs as part of the normal ingestion flow -- operators do not need a separate "retry" action.

LLM Rate Limiting

To avoid overwhelming the LLM API (and to stay within rate limits for cost control), the enrichment pipeline enforces a 200ms delay between consecutive LLM enrichment calls. This applies to canonical question generation, page summary generation, and graph extraction. The delay is implemented as an asyncio.sleep(0.2) between calls, providing a simple but effective throttle that prevents 429 (Too Many Requests) errors during large batch ingestion runs.

Taxonomy Seeding (Separate from Ingestion)

The entity taxonomy in PostgreSQL is not populated during regular page ingestion. Instead, it is seeded from two curated sources in a separate process:

Why separate taxonomy seeding?

The taxonomy requires high-quality, validated data to support structured lookups (e.g., "which doctors work in cardiology?"). Allowing every crawled page to write entities caused several quality problems during early development:

Cross-product relationships: Departments linked to ALL campuses instead of the correct one
Garbage entity names: Body parts and job titles parsed as doctor names
Scope leakage: Conditions associated with departments that don't treat them

The solution was to restrict taxonomy writes to two curated sources:

Frozen taxonomy (zol_taxonomy.py): A manually curated single source of truth containing all departments, conditions, treatments, examinations, campus mappings, and their relationships. This file encodes institutional knowledge (e.g., "Cardiologie is located at ZOL Sint-Jan" and "Cardiologie handles Hartfalen").
Hub pages: Structural listing pages (automatically classified as hub by an LLM binary classifier) that contain validated doctor-department-specialty information. Only these pages pass the graph_golden_only gate (enabled by default). The previous 8-type page classification (GOLDEN_SEED, GOLDEN_LISTING, etc.) has been replaced by a binary hub/detail classifier.

This architecture is documented in ADR-0028.

What happens during regular page ingestion?

When graph_golden_only = true (the default), the ingestion pipeline still runs regex extraction and LLM validation on every page, but only for generating page summaries — no entities are written to the taxonomy. The flow is:

Regex extraction runs → identifies entities and relationships
LLM validation runs → generates a page summary (2-3 sentence Dutch description)
Page summary stored in chunk_metadata.page_summary (pgvector)
Taxonomy storage skipped — entities are discarded, only the summary is kept
Entity types denormalized into doc_metadata for search boosting

Entity Type and Campus Denormalization

After graph extraction, entity types and campus names found in the document are denormalized back onto the document's metadata:

doc_metadata.entity_types -- list of entity types discovered (e.g., ["doctors", "departments", "conditions"])
doc_metadata.campus -- list of campus names referenced (e.g., ["ZOL Sint-Jan", "ZOL André Dumont"])

This denormalization enables the metadata boosting stage in the query pipeline to apply entity type match and campus match boosts without performing runtime graph lookups, keeping the boosting latency under 5ms.

URL Crawling & Orchestration

Before content processing begins, URLs must be discovered, classified, and managed. The crawl-to-ingest pipeline handles this with two persistent models:

Crawl Sessions

A CrawlSession represents a sitemap crawl that discovers URLs. The crawler parses the hospital's sitemap.xml, classifies each URL by type (HTML, PDF, DOCX, or ignored), and stores them as CrawledUrl records with status tracking (discovered → indexed / failed / skipped).

Ingestion Jobs

An IngestionJob is a batch operation that processes discovered URLs. Each URL gets an IngestionResult record for granular tracking. The orchestration layer provides:

Feature	Implementation	Purpose
Concurrent processing	10 async workers × batches of 50	Throughput without overwhelming external services
Live progress	Redis hash with SSE streaming	Real-time UI updates (current URL, chunk counts, entity names)
Job cancellation	Redis flag polled before each batch	Graceful stop without killing the server
Content deduplication	Title-based dedup check	ZOL's Drupal site has multiple URL paths to the same content
Fault tolerance	Rescue sessions for stuck results	If error recording fails, a rescue session forces "failed" status
Per-URL timeout	5-minute per-URL ingestion timeout	Prevents a single slow URL from blocking the entire batch
Timeout protection	120s document processing timeout	Raises error instead of silently accepting 0-chunk results
Infinite loop guard	Max iterations on batch loop	Prevents infinite re-fetch if results stay "pending"
Redis non-critical	try/except on all Redis helpers	Live details are UI candy — never crash processing

URL Type Classification

The system classifies URLs before processing to select the appropriate extraction strategy:

URL Type	Extension/Pattern	Extraction Method
HTML	`.html`, no extension	crawl4ai (JS rendering) + httpx fallback
PDF	`.pdf`	PyMuPDF (fitz) text extraction
DOCX	`.docx`	python-docx (paragraphs + tables)
Ignored	`.jpg`, `.png`, `.gif`, `.css`, `.js`, etc.	Skipped (not content)

Scheduled Ingestion

The ingestion pipeline runs nightly under operator-controlled gating. The scheduler is implemented in backend/app/services/scheduler_service.py and gated by the INGEST_MODE operational setting (backend/app/config.py: ingest_mode: Literal["off", "manual", "auto"]).

`INGEST_MODE`	Behaviour
`off`	Skip both crawl and ingest. Used during incident response or when a long manual ingestion is in flight.
`manual`	Crawl runs (sitemap discovery + URL classification), but the ingestion phase halts. Admins drive ingestion via the Pipeline Wizard UI.
`auto`	Full pipeline (fetch + chunk + embed + delete-handling) executes nightly. This is the production setting on pilot since 2026-04-22.

Audit Trail — `IngestRun`

Every nightly invocation, regardless of mode, creates one app.ingest_runs row (backend/app/models/database.py:1006, schema added in alembic migration 061). The row records:

Column	Purpose
`id`, `tenant_id`	Run identity, scoped per hospital
`started_at`, `completed_at`	Wall-clock duration
`mode`	The active `INGEST_MODE` for this run
`urls_discovered`, `urls_indexed`, `urls_failed`, `urls_skipped`	Final counts from the orchestration layer
`failure_class`	Coarse categorisation (e.g., `DEAD_EMPTY_CONTENT`, `SITEMAP_TIMEOUT`) for alerting
`notes`	Free-text context (most-frequent failure URLs, retry hints)

Disarming Auto-Ingest Without a Deploy

To disarm auto-ingest without a deploy: set INGEST_MODE=manual in backend/.env and restart uvicorn. The next 03:00 UTC cycle respects the new mode. The full design lives in the spec at docs/superpowers/specs/2026-04-17-nightly-ingest-design.md.

Incremental Updates

The pipeline supports incremental ingestion, a well-established best practice in web crawling literature. Olston and Najork (2010) identify three fundamental strategies for maintaining crawl freshness: periodic re-crawling, change-driven selective updates, and frequency-based scheduling. Cho and Garcia-Molina (2000) demonstrated that incremental crawlers that selectively update their index rather than performing batch refreshes achieve significantly higher "freshness" — defined as the fraction of the collection that is up-to-date at any given time.

The ZOL system implements a content-hash-based change detection strategy, which Cho and Garcia-Molina (2003) showed can improve crawl efficiency by up to 35% by focusing resources on content that has actually changed:

New documents: Crawl sitemap, identify new URLs, ingest only new content
Updated documents: Compare SHA-256 content hashes (content_hash in the document_chunks table), re-ingest only changed documents — this implements the hash-based change detection approach recommended by Olston and Najork (2010, §4.2)
Deleted documents: Mark chunks as inactive (soft delete for audit trail), preserving provenance
Re-ingestion: "Clear Databases" resets URL status to discovered without re-crawling, enabling re-processing with updated chunking or embedding strategies

This approach avoids full re-ingestion, which would be prohibitively slow for the ~1,000 brochures and hundreds of web pages in the ZOL corpus.

Academic Foundation

The incremental ingestion architecture draws on foundational web crawling research:

Freshness optimisation: Cho, J. & Garcia-Molina, H. (2000). The Evolution of the Web and Implications for an Incremental Crawler. VLDB 2000.
Change frequency estimation: Cho, J. & Garcia-Molina, H. (2003). Estimating Frequency of Change. ACM Transactions on Internet Technology, 3(3), 256–290.
Comprehensive survey: Olston, C. & Najork, M. (2010). Web Crawling. Foundations and Trends in Information Retrieval, 4(3), 175–246.

Pipeline-Level Trade-offs​

Pipeline Overview​

Step 1: Content Extraction​

Subprocess-Isolated PDF Extraction​

Step 2: Text Chunking​

Chunking Configuration​

Metadata Attachment​

Category Normalization​

Step 3: PII Detection​

Step 4: Embedding Generation​

Step 5: Vector Storage (pgvector)​

Step 6: BM25 Search Vector Generation​

Step 7: Canonical Question Generation (Background)​

Step 8: LLM Page Summary Generation​

Enrichment Retry & Gap Detection​

LLM Rate Limiting​

Taxonomy Seeding (Separate from Ingestion)​

Why separate taxonomy seeding?​

What happens during regular page ingestion?​

Entity Type and Campus Denormalization​

URL Crawling & Orchestration​

Crawl Sessions​

Ingestion Jobs​

URL Type Classification​

Scheduled Ingestion​

Audit Trail — IngestRun​

Disarming Auto-Ingest Without a Deploy​

Incremental Updates​