Skip to main content

Document Ingestion Pipeline

Before the search system can answer questions, hospital content must be transformed from its raw formats into searchable representations. The ingestion pipeline converts documents into vector embeddings for semantic search, while the entity taxonomy is seeded separately from curated taxonomy data and hub pages.

Pipeline-Level Trade-offs

DecisionChosenAlternatives consideredRejected because
Embedding modelOpenAI text-embedding-3-large (1,536-dim, @openai2024embeddings)Self-hosted BGE-M3 (@chen2024bgem3) at 1,024-dim via Ollama; OpenAI text-embedding-3-small (1,536-dim, cheaper)BGE-M3 was the previous baseline (ADR-0033). The migration (ADR-0048) accepted the ~$0.20/month cost for stronger Dutch retrieval (MTEB-NL ~64.6 vs 60.0) and removal of the Ollama operational dependency. text-embedding-3-small was rejected because the empirical retrieval gap on Dutch medical content was ~3 points MTEB-NL — small but measurable.
Chunking strategyMarkdown-header-aware split with 350-token target, 70-token overlap, hard 450-token ceilingFixed-token sliding window with no structural awareness; recursive character-based; semantic-similarity-based chunkingHospital content (brochures, condition pages) is inherently sectioned. A fixed-token window severs section boundaries and dilutes the semantic signal; semantic-similarity chunking adds a per-document inference call we can't justify at corpus scale. Markdown-aware splitting respects existing structure with one regex pass.
Subprocess isolation for PDFsProcessPoolExecutor with 120 s timeout, image-only detectionIn-process PyMuPDF; external service callA misbehaving PDF in-process can OOM, segfault, or hang the FastAPI worker; we observed both during the initial 1,000-brochure import. An external service adds operational surface. Subprocess isolation kills only the worker on a bad PDF and lets the next URL through.
Re-ingestion strategyContent-hash-based incremental updatePeriodic full re-crawl; last-modified-header checkA full re-crawl is prohibitively slow; HTTP Last-Modified is rarely populated by Drupal correctly. Content-hash diff captures actual changes per Cho & Garcia-Molina (2003) findings (cited in §Incremental Updates below).

Pipeline Overview

Step 1: Content Extraction

The extraction layer normalizes diverse document formats into a common Markdown representation:

Source FormatExtraction ToolNotes
PDFPyMuPDF (fitz)Subprocess-isolated extraction with 120s timeout
DOCXpython-docxPreserves heading structure
HTMLcrawl4aiAsync crawling with JavaScript rendering
SitemapCustom parserDiscovers URLs for crawl4ai

Subprocess-Isolated PDF Extraction

PDF extraction runs in a ProcessPoolExecutor rather than the main async event loop. This isolates the ingestion pipeline from problematic PDFs that could crash, hang, or consume excessive memory:

  • 120-second timeout: Each PDF extraction call has a hard timeout. If a PDF takes longer (e.g., a 200-page scanned brochure), the subprocess is killed and the document is marked as failed.
  • Image-only PDF detection: The _is_image_only_pdf() helper inspects the first pages of a PDF for extractable text. If a PDF contains only scanned images with no text layer (common for older hospital brochures), it is classified as image-only and skipped with a descriptive error rather than producing empty chunks.
  • Process isolation: A segfault or memory leak in PyMuPDF kills only the worker process, not the FastAPI server.

Sitemap-driven crawling is particularly important for ZOL. The hospital website (built on Drupal by partner agency Novation) exposes a comprehensive sitemap. The crawler uses this sitemap to systematically discover and ingest all public-facing content, ensuring complete coverage.

The extracted Markdown is stored in MinIO as the canonical source, enabling re-processing without re-crawling when chunking or embedding strategies change.

Step 2: Text Chunking

Text chunking is the process of splitting documents into segments suitable for embedding. The chunking strategy significantly impacts retrieval quality -- chunks that are too large dilute semantic signals, while chunks that are too small lose context.

Chunking Configuration

ParameterValueRationale
Target size350 tokensOptimal for text-embedding-3-large (empirically tested; unchanged from BGE-M3 baseline at the ADR-0048 migration)
Maximum size450 tokensHard ceiling to prevent oversized chunks
Overlap70 tokensPreserves context across chunk boundaries
TokenizerTiktoken (cl100k_base)Matches OpenAI-family tokenization
Split awarenessMarkdown headersRespects document structure
Why Markdown-Aware Splitting?

Hospital content is inherently structured. A brochure about knee surgery has sections like "Voorbereiding" (Preparation), "Procedure", and "Nazorg" (Aftercare). Splitting at heading boundaries ensures that each chunk represents a coherent topic, rather than an arbitrary slice of text. This decision is documented in ADR-0001.

Metadata Attachment

Each chunk carries metadata that enables downstream filtering and boosting:

  • Source URL: The original page or document URL
  • Source title: The document or page title
  • Section headers: The Markdown heading hierarchy above this chunk
  • Category: Inferred from the source, normalized to title case (e.g., "Department", not "department")
  • Ingestion timestamp: For recency-based boosting
  • Canonical questions: 1-2 Dutch questions answered by this chunk (see Step 7 below)
Title keyword extraction replaced

The previous title_keywords metadata field has been replaced by BM25 tsvector search (ADR-007). The document title is now included directly in the tsvector, so title terms are searchable via BM25 without a separate metadata field.

Category Normalization

Categories are normalized to title case during ingestion (e.g., "department" becomes "Department"). This prevents silent metadata boost failures that could occur when category casing differs between ingested documents and the intent-to-category mapping used at query time.

Step 3: PII Detection

The PII detection layer scans content for personally identifiable information using regex patterns:

PatternExampleAction
Email addressesfacturatie@zol.beLog detection
Phone numbers089 32 50 50Log detection
BSN (Dutch SSN)9-digit patternLog detection
Intentional Design Choice

PII masking is disabled for ZOL content. Hospital contact information (email addresses, phone numbers) is intentionally public -- patients need this information to book appointments and reach departments. The PII detector still logs detections for audit purposes, but does not redact content. See PII Protection for the full rationale.

Step 4: Embedding Generation

Each text chunk is converted into a 1,536-dimensional dense vector using OpenAI text-embedding-3-large (see ADR-0048, @openai2024embeddings; the dense bi-encoder retrieval pattern follows @karpukhin2020dpr):

  • Model: text-embedding-3-large (OpenAI)
  • Dimensions: 1,536 (truncated from the model's native 3,072 to fit pgvector's HNSW 2,000-dim limit; see @pgvector_docs)
  • Tokenizer: cl100k_base (same OpenAI tokenizer used for chunk-size accounting → exact token counts)
  • Language support: Strong multilingual; MTEB-NL retrieval ~64.6 (above BGE-M3's 60.0)
  • Inference: OpenAI API; cost ~$0.13 per 1M tokens (75% prompt-cache discount; ~$0.20/month at 25,000 monthly queries)
  • Configuration: EMBEDDING_PROVIDER=openai, EMBEDDING_MODEL=text-embedding-3-large, EMBEDDING_DIMENSIONS=1536

Embeddings are sent as a batch API call to OpenAI with all texts at once, providing both quality and speed. The previous BGE-M3 / Ollama path is retained as the configurable fallback (EMBEDDING_PROVIDER=ollama) and as the model used for ColBERT reranking.

Step 5: Vector Storage (pgvector)

Embedded chunks are stored in PostgreSQL with the pgvector extension. Each record includes the chunk text, its embedding vector, and all metadata. An HNSW index enables approximate nearest neighbor search at query time.

Step 6: BM25 Search Vector Generation

After chunks are stored in pgvector, each chunk receives a PostgreSQL tsvector for BM25 keyword search (see ADR-007). The tsvector is constructed by concatenating the parent document's title with the chunk content, then applying PostgreSQL's to_tsvector() function. Only chunks that have not yet been indexed are processed, making the operation idempotent for incremental ingestion.

Key design choices:

  • 'simple' text configuration: No language-specific stemming. Dutch medical terms like "cardioversie" and "orthopedie" should be matched exactly, not reduced to stems
  • Title included: The document title is prepended to each chunk's tsvector, so title terms are searchable via BM25
  • GIN index: A GIN index on search_vector enables sub-millisecond keyword lookup

Step 7: Canonical Question Generation (Background)

Conceptual overview

For the why behind canonical questions and page summaries — the HyDE-at-index-time and Anthropic-contextual-retrieval rationale, and how both feed the query-time retrieval-steering triad — see Ingestion Enrichment: Canonical Questions & Page Summaries. This section documents the procedural how.

After tsvector generation, the pipeline generates 1-2 Dutch questions per chunk using the Tier 2 (standard) model via OpenAI. When contextual embeddings are enabled, questions are generated synchronously (before embedding) with asyncio.gather concurrency. Otherwise, this runs as a background task.

Prompt (Dutch):

Gegeven deze tekst van een ziekenhuiswebsite, genereer 1-2 vragen (in het Nederlands) die door deze tekst beantwoord worden. Geef enkel de vragen terug, één per regel.

Example:

  • Chunk about visiting hours → "Wat zijn de bezoekuren op cardiologie?"
  • Chunk about Dr. Van den Berg → "Wie is de orthopedisch chirurg bij ZOL?"

The generated questions are:

  1. Stored in chunk_metadata.canonical_questions (list of strings)
  2. Appended to the search_vector tsvector, so BM25 search matches against both content and canonical questions

This enrichment significantly improves BM25 recall for question-phrased queries, which are the most common user input pattern.

Step 8: LLM Page Summary Generation

During graph extraction, the LLM generates a page summary for every page — a 2-3 sentence Dutch description of what the page covers. This summary is stored in chunk_metadata.page_summary (JSONB) and prepended to the first chunk from each document during context assembly, implementing Anthropic's contextual retrieval pattern.

Page summaries are generated for all pages, not just hub pages

Even though non-hub pages do not write entities to the taxonomy, the LLM still generates a page summary for every document. This ensures all chunks benefit from contextual retrieval at query time. See ADR-0014 for details.

Enrichment Retry & Gap Detection

LLM enrichment steps (canonical questions, page summaries, graph extraction) can fail for individual documents due to transient API errors, rate limits, or timeouts. Rather than re-running the entire pipeline, the system supports inline gap detection and backfill:

  1. Gap detection: Before starting enrichment, the pipeline queries for documents that completed chunking but are missing enrichment artifacts (e.g., chunks without canonical questions, documents without a page summary).
  2. Selective backfill: Only the missing enrichment steps are re-executed for gap documents, skipping the extraction and chunking stages entirely.
  3. Retry integration: Gap backfill runs as part of the normal ingestion flow -- operators do not need a separate "retry" action.

LLM Rate Limiting

To avoid overwhelming the LLM API (and to stay within rate limits for cost control), the enrichment pipeline enforces a 200ms delay between consecutive LLM enrichment calls. This applies to canonical question generation, page summary generation, and graph extraction. The delay is implemented as an asyncio.sleep(0.2) between calls, providing a simple but effective throttle that prevents 429 (Too Many Requests) errors during large batch ingestion runs.

Taxonomy Seeding (Separate from Ingestion)

The entity taxonomy in PostgreSQL is not populated during regular page ingestion. Instead, it is seeded from two curated sources in a separate process:

Why separate taxonomy seeding?

The taxonomy requires high-quality, validated data to support structured lookups (e.g., "which doctors work in cardiology?"). Allowing every crawled page to write entities caused several quality problems during early development:

  • Cross-product relationships: Departments linked to ALL campuses instead of the correct one
  • Garbage entity names: Body parts and job titles parsed as doctor names
  • Scope leakage: Conditions associated with departments that don't treat them

The solution was to restrict taxonomy writes to two curated sources:

  1. Frozen taxonomy (zol_taxonomy.py): A manually curated single source of truth containing all departments, conditions, treatments, examinations, campus mappings, and their relationships. This file encodes institutional knowledge (e.g., "Cardiologie is located at ZOL Sint-Jan" and "Cardiologie handles Hartfalen").

  2. Hub pages: Structural listing pages (automatically classified as hub by an LLM binary classifier) that contain validated doctor-department-specialty information. Only these pages pass the graph_golden_only gate (enabled by default). The previous 8-type page classification (GOLDEN_SEED, GOLDEN_LISTING, etc.) has been replaced by a binary hub/detail classifier.

This architecture is documented in ADR-0028.

What happens during regular page ingestion?

When graph_golden_only = true (the default), the ingestion pipeline still runs regex extraction and LLM validation on every page, but only for generating page summaries — no entities are written to the taxonomy. The flow is:

  1. Regex extraction runs → identifies entities and relationships
  2. LLM validation runs → generates a page summary (2-3 sentence Dutch description)
  3. Page summary stored in chunk_metadata.page_summary (pgvector)
  4. Taxonomy storage skipped — entities are discarded, only the summary is kept
  5. Entity types denormalized into doc_metadata for search boosting

Entity Type and Campus Denormalization

After graph extraction, entity types and campus names found in the document are denormalized back onto the document's metadata:

  • doc_metadata.entity_types -- list of entity types discovered (e.g., ["doctors", "departments", "conditions"])
  • doc_metadata.campus -- list of campus names referenced (e.g., ["ZOL Sint-Jan", "ZOL André Dumont"])

This denormalization enables the metadata boosting stage in the query pipeline to apply entity type match and campus match boosts without performing runtime graph lookups, keeping the boosting latency under 5ms.

URL Crawling & Orchestration

Before content processing begins, URLs must be discovered, classified, and managed. The crawl-to-ingest pipeline handles this with two persistent models:

Crawl Sessions

A CrawlSession represents a sitemap crawl that discovers URLs. The crawler parses the hospital's sitemap.xml, classifies each URL by type (HTML, PDF, DOCX, or ignored), and stores them as CrawledUrl records with status tracking (discoveredindexed / failed / skipped).

Ingestion Jobs

An IngestionJob is a batch operation that processes discovered URLs. Each URL gets an IngestionResult record for granular tracking. The orchestration layer provides:

FeatureImplementationPurpose
Concurrent processing10 async workers × batches of 50Throughput without overwhelming external services
Live progressRedis hash with SSE streamingReal-time UI updates (current URL, chunk counts, entity names)
Job cancellationRedis flag polled before each batchGraceful stop without killing the server
Content deduplicationTitle-based dedup checkZOL's Drupal site has multiple URL paths to the same content
Fault toleranceRescue sessions for stuck resultsIf error recording fails, a rescue session forces "failed" status
Per-URL timeout5-minute per-URL ingestion timeoutPrevents a single slow URL from blocking the entire batch
Timeout protection120s document processing timeoutRaises error instead of silently accepting 0-chunk results
Infinite loop guardMax iterations on batch loopPrevents infinite re-fetch if results stay "pending"
Redis non-criticaltry/except on all Redis helpersLive details are UI candy — never crash processing

URL Type Classification

The system classifies URLs before processing to select the appropriate extraction strategy:

URL TypeExtension/PatternExtraction Method
HTML.html, no extensioncrawl4ai (JS rendering) + httpx fallback
PDF.pdfPyMuPDF (fitz) text extraction
DOCX.docxpython-docx (paragraphs + tables)
Ignored.jpg, .png, .gif, .css, .js, etc.Skipped (not content)

Scheduled Ingestion

The ingestion pipeline runs nightly under operator-controlled gating. The scheduler is implemented in backend/app/services/scheduler_service.py and gated by the INGEST_MODE operational setting (backend/app/config.py: ingest_mode: Literal["off", "manual", "auto"]).

INGEST_MODEBehaviour
offSkip both crawl and ingest. Used during incident response or when a long manual ingestion is in flight.
manualCrawl runs (sitemap discovery + URL classification), but the ingestion phase halts. Admins drive ingestion via the Pipeline Wizard UI.
autoFull pipeline (fetch + chunk + embed + delete-handling) executes nightly. This is the production setting on pilot since 2026-04-22.

Audit Trail — IngestRun

Every nightly invocation, regardless of mode, creates one app.ingest_runs row (backend/app/models/database.py:875, schema added in alembic migration 061). The row records:

ColumnPurpose
id, tenant_idRun identity, scoped per hospital
started_at, completed_atWall-clock duration
modeThe active INGEST_MODE for this run
urls_discovered, urls_indexed, urls_failed, urls_skippedFinal counts from the orchestration layer
failure_classCoarse categorisation (e.g., DEAD_EMPTY_CONTENT, SITEMAP_TIMEOUT) for alerting
notesFree-text context (most-frequent failure URLs, retry hints)

Disarming Auto-Ingest Without a Deploy

To disarm auto-ingest without a deploy: set INGEST_MODE=manual in backend/.env and restart uvicorn. The next 03:00 UTC cycle respects the new mode. The full design lives in the spec at docs/superpowers/specs/2026-04-17-nightly-ingest-design.md.

Incremental Updates

The pipeline supports incremental ingestion, a well-established best practice in web crawling literature. Olston and Najork (2010) identify three fundamental strategies for maintaining crawl freshness: periodic re-crawling, change-driven selective updates, and frequency-based scheduling. Cho and Garcia-Molina (2000) demonstrated that incremental crawlers that selectively update their index rather than performing batch refreshes achieve significantly higher "freshness" — defined as the fraction of the collection that is up-to-date at any given time.

The ZOL system implements a content-hash-based change detection strategy, which Cho and Garcia-Molina (2003) showed can improve crawl efficiency by up to 35% by focusing resources on content that has actually changed:

  1. New documents: Crawl sitemap, identify new URLs, ingest only new content
  2. Updated documents: Compare SHA-256 content hashes (content_hash in the document_chunks table), re-ingest only changed documents — this implements the hash-based change detection approach recommended by Olston and Najork (2010, §4.2)
  3. Deleted documents: Mark chunks as inactive (soft delete for audit trail), preserving provenance
  4. Re-ingestion: "Clear Databases" resets URL status to discovered without re-crawling, enabling re-processing with updated chunking or embedding strategies

This approach avoids full re-ingestion, which would be prohibitively slow for the ~1,000 brochures and hundreds of web pages in the ZOL corpus.

Academic Foundation

The incremental ingestion architecture draws on foundational web crawling research: