Document Ingestion Pipeline
Before the search system can answer questions, hospital content must be transformed from its raw formats into searchable representations. The ingestion pipeline converts documents into vector embeddings for semantic search, while the entity taxonomy is seeded separately from curated taxonomy data and hub pages.
Pipeline-Level Trade-offs
| Decision | Chosen | Alternatives considered | Rejected because |
|---|---|---|---|
| Embedding model | OpenAI text-embedding-3-large (1,536-dim, @openai2024embeddings) | Self-hosted BGE-M3 (@chen2024bgem3) at 1,024-dim via Ollama; OpenAI text-embedding-3-small (1,536-dim, cheaper) | BGE-M3 was the previous baseline (ADR-0033). The migration (ADR-0048) accepted the ~$0.20/month cost for stronger Dutch retrieval (MTEB-NL ~64.6 vs 60.0) and removal of the Ollama operational dependency. text-embedding-3-small was rejected because the empirical retrieval gap on Dutch medical content was ~3 points MTEB-NL — small but measurable. |
| Chunking strategy | Markdown-header-aware split with 350-token target, 70-token overlap, hard 450-token ceiling | Fixed-token sliding window with no structural awareness; recursive character-based; semantic-similarity-based chunking | Hospital content (brochures, condition pages) is inherently sectioned. A fixed-token window severs section boundaries and dilutes the semantic signal; semantic-similarity chunking adds a per-document inference call we can't justify at corpus scale. Markdown-aware splitting respects existing structure with one regex pass. |
| Subprocess isolation for PDFs | ProcessPoolExecutor with 120 s timeout, image-only detection | In-process PyMuPDF; external service call | A misbehaving PDF in-process can OOM, segfault, or hang the FastAPI worker; we observed both during the initial 1,000-brochure import. An external service adds operational surface. Subprocess isolation kills only the worker on a bad PDF and lets the next URL through. |
| Re-ingestion strategy | Content-hash-based incremental update | Periodic full re-crawl; last-modified-header check | A full re-crawl is prohibitively slow; HTTP Last-Modified is rarely populated by Drupal correctly. Content-hash diff captures actual changes per Cho & Garcia-Molina (2003) findings (cited in §Incremental Updates below). |
Pipeline Overview
Step 1: Content Extraction
The extraction layer normalizes diverse document formats into a common Markdown representation:
| Source Format | Extraction Tool | Notes |
|---|---|---|
| PyMuPDF (fitz) | Subprocess-isolated extraction with 120s timeout | |
| DOCX | python-docx | Preserves heading structure |
| HTML | crawl4ai | Async crawling with JavaScript rendering |
| Sitemap | Custom parser | Discovers URLs for crawl4ai |
Subprocess-Isolated PDF Extraction
PDF extraction runs in a ProcessPoolExecutor rather than the main async event loop. This isolates the ingestion pipeline from problematic PDFs that could crash, hang, or consume excessive memory:
- 120-second timeout: Each PDF extraction call has a hard timeout. If a PDF takes longer (e.g., a 200-page scanned brochure), the subprocess is killed and the document is marked as failed.
- Image-only PDF detection: The
_is_image_only_pdf()helper inspects the first pages of a PDF for extractable text. If a PDF contains only scanned images with no text layer (common for older hospital brochures), it is classified as image-only and skipped with a descriptive error rather than producing empty chunks. - Process isolation: A segfault or memory leak in PyMuPDF kills only the worker process, not the FastAPI server.
Sitemap-driven crawling is particularly important for ZOL. The hospital website (built on Drupal by partner agency Novation) exposes a comprehensive sitemap. The crawler uses this sitemap to systematically discover and ingest all public-facing content, ensuring complete coverage.
The extracted Markdown is stored in MinIO as the canonical source, enabling re-processing without re-crawling when chunking or embedding strategies change.
Step 2: Text Chunking
Text chunking is the process of splitting documents into segments suitable for embedding. The chunking strategy significantly impacts retrieval quality -- chunks that are too large dilute semantic signals, while chunks that are too small lose context.
Chunking Configuration
| Parameter | Value | Rationale |
|---|---|---|
| Target size | 350 tokens | Optimal for text-embedding-3-large (empirically tested; unchanged from BGE-M3 baseline at the ADR-0048 migration) |
| Maximum size | 450 tokens | Hard ceiling to prevent oversized chunks |
| Overlap | 70 tokens | Preserves context across chunk boundaries |
| Tokenizer | Tiktoken (cl100k_base) | Matches OpenAI-family tokenization |
| Split awareness | Markdown headers | Respects document structure |
Hospital content is inherently structured. A brochure about knee surgery has sections like "Voorbereiding" (Preparation), "Procedure", and "Nazorg" (Aftercare). Splitting at heading boundaries ensures that each chunk represents a coherent topic, rather than an arbitrary slice of text. This decision is documented in ADR-0001.
Metadata Attachment
Each chunk carries metadata that enables downstream filtering and boosting:
- Source URL: The original page or document URL
- Source title: The document or page title
- Section headers: The Markdown heading hierarchy above this chunk
- Category: Inferred from the source, normalized to title case (e.g.,
"Department", not"department") - Ingestion timestamp: For recency-based boosting
- Canonical questions: 1-2 Dutch questions answered by this chunk (see Step 7 below)
The previous title_keywords metadata field has been replaced by BM25 tsvector search (ADR-007). The document title is now included directly in the tsvector, so title terms are searchable via BM25 without a separate metadata field.
Category Normalization
Categories are normalized to title case during ingestion (e.g., "department" becomes "Department"). This prevents silent metadata boost failures that could occur when category casing differs between ingested documents and the intent-to-category mapping used at query time.
Step 3: PII Detection
The PII detection layer scans content for personally identifiable information using regex patterns:
| Pattern | Example | Action |
|---|---|---|
| Email addresses | facturatie@zol.be | Log detection |
| Phone numbers | 089 32 50 50 | Log detection |
| BSN (Dutch SSN) | 9-digit pattern | Log detection |
PII masking is disabled for ZOL content. Hospital contact information (email addresses, phone numbers) is intentionally public -- patients need this information to book appointments and reach departments. The PII detector still logs detections for audit purposes, but does not redact content. See PII Protection for the full rationale.
Step 4: Embedding Generation
Each text chunk is converted into a 1,536-dimensional dense vector using OpenAI text-embedding-3-large (see ADR-0048, @openai2024embeddings; the dense bi-encoder retrieval pattern follows @karpukhin2020dpr):
- Model:
text-embedding-3-large(OpenAI) - Dimensions: 1,536 (truncated from the model's native 3,072 to fit pgvector's HNSW 2,000-dim limit; see @pgvector_docs)
- Tokenizer: cl100k_base (same OpenAI tokenizer used for chunk-size accounting → exact token counts)
- Language support: Strong multilingual; MTEB-NL retrieval ~64.6 (above BGE-M3's 60.0)
- Inference: OpenAI API; cost ~$0.13 per 1M tokens (75% prompt-cache discount; ~$0.20/month at 25,000 monthly queries)
- Configuration:
EMBEDDING_PROVIDER=openai,EMBEDDING_MODEL=text-embedding-3-large,EMBEDDING_DIMENSIONS=1536
Embeddings are sent as a batch API call to OpenAI with all texts at once, providing both quality and speed. The previous BGE-M3 / Ollama path is retained as the configurable fallback (EMBEDDING_PROVIDER=ollama) and as the model used for ColBERT reranking.
Step 5: Vector Storage (pgvector)
Embedded chunks are stored in PostgreSQL with the pgvector extension. Each record includes the chunk text, its embedding vector, and all metadata. An HNSW index enables approximate nearest neighbor search at query time.
Step 6: BM25 Search Vector Generation
After chunks are stored in pgvector, each chunk receives a PostgreSQL tsvector for BM25 keyword search (see ADR-007). The tsvector is constructed by concatenating the parent document's title with the chunk content, then applying PostgreSQL's to_tsvector() function. Only chunks that have not yet been indexed are processed, making the operation idempotent for incremental ingestion.
Key design choices:
'simple'text configuration: No language-specific stemming. Dutch medical terms like "cardioversie" and "orthopedie" should be matched exactly, not reduced to stems- Title included: The document title is prepended to each chunk's tsvector, so title terms are searchable via BM25
- GIN index: A GIN index on
search_vectorenables sub-millisecond keyword lookup
Step 7: Canonical Question Generation (Background)
For the why behind canonical questions and page summaries — the HyDE-at-index-time and Anthropic-contextual-retrieval rationale, and how both feed the query-time retrieval-steering triad — see Ingestion Enrichment: Canonical Questions & Page Summaries. This section documents the procedural how.
After tsvector generation, the pipeline generates 1-2 Dutch questions per chunk using the Tier 2 (standard) model via OpenAI. When contextual embeddings are enabled, questions are generated synchronously (before embedding) with asyncio.gather concurrency. Otherwise, this runs as a background task.
Prompt (Dutch):
Gegeven deze tekst van een ziekenhuiswebsite, genereer 1-2 vragen (in het Nederlands) die door deze tekst beantwoord worden. Geef enkel de vragen terug, één per regel.
Example:
- Chunk about visiting hours → "Wat zijn de bezoekuren op cardiologie?"
- Chunk about Dr. Van den Berg → "Wie is de orthopedisch chirurg bij ZOL?"
The generated questions are:
- Stored in
chunk_metadata.canonical_questions(list of strings) - Appended to the
search_vectortsvector, so BM25 search matches against both content and canonical questions
This enrichment significantly improves BM25 recall for question-phrased queries, which are the most common user input pattern.
Step 8: LLM Page Summary Generation
During graph extraction, the LLM generates a page summary for every page — a 2-3 sentence Dutch description of what the page covers. This summary is stored in chunk_metadata.page_summary (JSONB) and prepended to the first chunk from each document during context assembly, implementing Anthropic's contextual retrieval pattern.
Even though non-hub pages do not write entities to the taxonomy, the LLM still generates a page summary for every document. This ensures all chunks benefit from contextual retrieval at query time. See ADR-0014 for details.
Enrichment Retry & Gap Detection
LLM enrichment steps (canonical questions, page summaries, graph extraction) can fail for individual documents due to transient API errors, rate limits, or timeouts. Rather than re-running the entire pipeline, the system supports inline gap detection and backfill:
- Gap detection: Before starting enrichment, the pipeline queries for documents that completed chunking but are missing enrichment artifacts (e.g., chunks without canonical questions, documents without a page summary).
- Selective backfill: Only the missing enrichment steps are re-executed for gap documents, skipping the extraction and chunking stages entirely.
- Retry integration: Gap backfill runs as part of the normal ingestion flow -- operators do not need a separate "retry" action.
LLM Rate Limiting
To avoid overwhelming the LLM API (and to stay within rate limits for cost control), the enrichment pipeline enforces a 200ms delay between consecutive LLM enrichment calls. This applies to canonical question generation, page summary generation, and graph extraction. The delay is implemented as an asyncio.sleep(0.2) between calls, providing a simple but effective throttle that prevents 429 (Too Many Requests) errors during large batch ingestion runs.
Taxonomy Seeding (Separate from Ingestion)
The entity taxonomy in PostgreSQL is not populated during regular page ingestion. Instead, it is seeded from two curated sources in a separate process:
Why separate taxonomy seeding?
The taxonomy requires high-quality, validated data to support structured lookups (e.g., "which doctors work in cardiology?"). Allowing every crawled page to write entities caused several quality problems during early development:
- Cross-product relationships: Departments linked to ALL campuses instead of the correct one
- Garbage entity names: Body parts and job titles parsed as doctor names
- Scope leakage: Conditions associated with departments that don't treat them
The solution was to restrict taxonomy writes to two curated sources:
-
Frozen taxonomy (
zol_taxonomy.py): A manually curated single source of truth containing all departments, conditions, treatments, examinations, campus mappings, and their relationships. This file encodes institutional knowledge (e.g., "Cardiologie is located at ZOL Sint-Jan" and "Cardiologie handles Hartfalen"). -
Hub pages: Structural listing pages (automatically classified as
hubby an LLM binary classifier) that contain validated doctor-department-specialty information. Only these pages pass thegraph_golden_onlygate (enabled by default). The previous 8-type page classification (GOLDEN_SEED,GOLDEN_LISTING, etc.) has been replaced by a binary hub/detail classifier.
This architecture is documented in ADR-0028.
What happens during regular page ingestion?
When graph_golden_only = true (the default), the ingestion pipeline still runs regex extraction and LLM validation on every page, but only for generating page summaries — no entities are written to the taxonomy. The flow is:
- Regex extraction runs → identifies entities and relationships
- LLM validation runs → generates a page summary (2-3 sentence Dutch description)
- Page summary stored in
chunk_metadata.page_summary(pgvector) - Taxonomy storage skipped — entities are discarded, only the summary is kept
- Entity types denormalized into
doc_metadatafor search boosting
Entity Type and Campus Denormalization
After graph extraction, entity types and campus names found in the document are denormalized back onto the document's metadata:
doc_metadata.entity_types-- list of entity types discovered (e.g.,["doctors", "departments", "conditions"])doc_metadata.campus-- list of campus names referenced (e.g.,["ZOL Sint-Jan", "ZOL André Dumont"])
This denormalization enables the metadata boosting stage in the query pipeline to apply entity type match and campus match boosts without performing runtime graph lookups, keeping the boosting latency under 5ms.
URL Crawling & Orchestration
Before content processing begins, URLs must be discovered, classified, and managed. The crawl-to-ingest pipeline handles this with two persistent models:
Crawl Sessions
A CrawlSession represents a sitemap crawl that discovers URLs. The crawler parses the hospital's sitemap.xml, classifies each URL by type (HTML, PDF, DOCX, or ignored), and stores them as CrawledUrl records with status tracking (discovered → indexed / failed / skipped).
Ingestion Jobs
An IngestionJob is a batch operation that processes discovered URLs. Each URL gets an IngestionResult record for granular tracking. The orchestration layer provides:
| Feature | Implementation | Purpose |
|---|---|---|
| Concurrent processing | 10 async workers × batches of 50 | Throughput without overwhelming external services |
| Live progress | Redis hash with SSE streaming | Real-time UI updates (current URL, chunk counts, entity names) |
| Job cancellation | Redis flag polled before each batch | Graceful stop without killing the server |
| Content deduplication | Title-based dedup check | ZOL's Drupal site has multiple URL paths to the same content |
| Fault tolerance | Rescue sessions for stuck results | If error recording fails, a rescue session forces "failed" status |
| Per-URL timeout | 5-minute per-URL ingestion timeout | Prevents a single slow URL from blocking the entire batch |
| Timeout protection | 120s document processing timeout | Raises error instead of silently accepting 0-chunk results |
| Infinite loop guard | Max iterations on batch loop | Prevents infinite re-fetch if results stay "pending" |
| Redis non-critical | try/except on all Redis helpers | Live details are UI candy — never crash processing |
URL Type Classification
The system classifies URLs before processing to select the appropriate extraction strategy:
| URL Type | Extension/Pattern | Extraction Method |
|---|---|---|
| HTML | .html, no extension | crawl4ai (JS rendering) + httpx fallback |
.pdf | PyMuPDF (fitz) text extraction | |
| DOCX | .docx | python-docx (paragraphs + tables) |
| Ignored | .jpg, .png, .gif, .css, .js, etc. | Skipped (not content) |
Scheduled Ingestion
The ingestion pipeline runs nightly under operator-controlled gating. The scheduler is implemented in backend/app/services/scheduler_service.py and gated by the INGEST_MODE operational setting (backend/app/config.py: ingest_mode: Literal["off", "manual", "auto"]).
INGEST_MODE | Behaviour |
|---|---|
off | Skip both crawl and ingest. Used during incident response or when a long manual ingestion is in flight. |
manual | Crawl runs (sitemap discovery + URL classification), but the ingestion phase halts. Admins drive ingestion via the Pipeline Wizard UI. |
auto | Full pipeline (fetch + chunk + embed + delete-handling) executes nightly. This is the production setting on pilot since 2026-04-22. |
Audit Trail — IngestRun
Every nightly invocation, regardless of mode, creates one app.ingest_runs row (backend/app/models/database.py:875, schema added in alembic migration 061). The row records:
| Column | Purpose |
|---|---|
id, tenant_id | Run identity, scoped per hospital |
started_at, completed_at | Wall-clock duration |
mode | The active INGEST_MODE for this run |
urls_discovered, urls_indexed, urls_failed, urls_skipped | Final counts from the orchestration layer |
failure_class | Coarse categorisation (e.g., DEAD_EMPTY_CONTENT, SITEMAP_TIMEOUT) for alerting |
notes | Free-text context (most-frequent failure URLs, retry hints) |
Disarming Auto-Ingest Without a Deploy
To disarm auto-ingest without a deploy: set INGEST_MODE=manual in backend/.env and restart uvicorn. The next 03:00 UTC cycle respects the new mode. The full design lives in the spec at docs/superpowers/specs/2026-04-17-nightly-ingest-design.md.
Incremental Updates
The pipeline supports incremental ingestion, a well-established best practice in web crawling literature. Olston and Najork (2010) identify three fundamental strategies for maintaining crawl freshness: periodic re-crawling, change-driven selective updates, and frequency-based scheduling. Cho and Garcia-Molina (2000) demonstrated that incremental crawlers that selectively update their index rather than performing batch refreshes achieve significantly higher "freshness" — defined as the fraction of the collection that is up-to-date at any given time.
The ZOL system implements a content-hash-based change detection strategy, which Cho and Garcia-Molina (2003) showed can improve crawl efficiency by up to 35% by focusing resources on content that has actually changed:
- New documents: Crawl sitemap, identify new URLs, ingest only new content
- Updated documents: Compare SHA-256 content hashes (
content_hashin thedocument_chunkstable), re-ingest only changed documents — this implements the hash-based change detection approach recommended by Olston and Najork (2010, §4.2) - Deleted documents: Mark chunks as inactive (soft delete for audit trail), preserving provenance
- Re-ingestion: "Clear Databases" resets URL status to
discoveredwithout re-crawling, enabling re-processing with updated chunking or embedding strategies
This approach avoids full re-ingestion, which would be prohibitively slow for the ~1,000 brochures and hundreds of web pages in the ZOL corpus.
The incremental ingestion architecture draws on foundational web crawling research:
- Freshness optimisation: Cho, J. & Garcia-Molina, H. (2000). The Evolution of the Web and Implications for an Incremental Crawler. VLDB 2000.
- Change frequency estimation: Cho, J. & Garcia-Molina, H. (2003). Estimating Frequency of Change. ACM Transactions on Internet Technology, 3(3), 256–290.
- Comprehensive survey: Olston, C. & Najork, M. (2010). Web Crawling. Foundations and Trends in Information Retrieval, 4(3), 175–246.