Skip to main content
Architectural Update (March 2026)

This ADR was written when the system used Neo4j for entity storage. As of March 2026, Neo4j has been fully removed and replaced by PostgreSQL taxonomy tables (taxonomy_entities, taxonomy_relationships). The decision rationale documented here remains valid; the storage layer has changed.

ADR-0014: LLM Entity Validation and Contextual Retrieval

Date: 2026-02-08 | Status: Accepted

Context

The knowledge graph extraction pipeline uses regex-based MedicalEntityExtractor to extract doctors, departments, conditions, treatments, and relationships from ZOL hospital web pages. While fast and cost-free, regex extraction produces systematic semantic errors that blocklist expansion cannot solve:

ProblemExampleWhy Regex Can't Fix It
Fake doctor names"Borstkas" (chest), "Hoofdverpleegkundige" (head nurse) parsed as person namesThese are valid Dutch words that match the capitalized-noun-after-prefix pattern
Boilerplate hub nodes"Behandeling" (treatment), "Onderzoek" (examination) connected to dozens of pagesBlocking these removes legitimate specific uses too
Wrong entity typesA department classified as a treatmentRequires understanding the meaning, not just the pattern
Implausible relationshipsA department "treats" a campusCo-occurrence inference has no semantic validation

Adding more regex blocklists is an iterative patching approach — each fix creates new edge cases. An LLM understands Dutch language context and can make the nuanced validation decisions that regex fundamentally cannot.

The Contextual Retrieval Opportunity

Since an LLM call per page is now required for validation, we can generate a page summary at near-zero marginal cost. This implements a technique that Anthropic describes in their research on Contextual Retrieval:

The core insight is that individual chunks lose context about their parent document. A chunk that says "De raadpleging duurt gemiddeld 30 minuten" (The consultation lasts approximately 30 minutes) gives the embedding model no signal about which department this refers to. By prepending a brief document summary, the embedding and the LLM gain critical context.

Anthropic's research demonstrates that contextual retrieval can reduce retrieval failure rates by 49% when combined with hybrid search (BM25 + embeddings). Since the ZOL system already employs hybrid search (ADR-0012), adding page summaries is the natural next step to maximize retrieval quality.

Decision

1. Post-Extraction LLM Validation Gate

A new LLMEntityValidator sits between regex extraction and Neo4j storage:

A single LLM call per page performs three tasks simultaneously:

  1. Entity validation: keep, reject, or rename each extracted entity
  2. Relationship validation: keep or reject each inferred relationship
  3. Page summary generation: 2-3 sentences in Dutch describing the page content

2. Cross-Page Entity Cache

Once an entity is validated (e.g., "Hoofdverpleegkundige" rejected as not a real doctor name), the decision is cached in-memory. If the same entity appears on subsequent pages, the cached decision is applied without an LLM call. This reduces total LLM calls by an estimated 10-25%.

3. Page Summary Storage (Contextual Retrieval)

Page summaries are stored in the existing chunk_metadata JSONB column — no database migration required:

{
"section_header": "Cardiologie",
"page_summary": "Deze pagina beschrijft de afdeling Cardiologie van ZOL, inclusief de artsen, behandelingen en consultatiemogelijkheden op campus Sint-Jan en André Dumont."
}

4. RAG Context Enhancement

During query processing, the page summary is prepended to the first chunk from each document in the context window:

[1] Uit cardiologie.pdf (pagina 1):
[Pagina context: Deze pagina beschrijft de afdeling Cardiologie van ZOL,
inclusief de artsen, behandelingen en consultatiemogelijkheden op campus
Sint-Jan en André Dumont.]

De afdeling Cardiologie biedt gespecialiseerde zorg voor patiënten met
hart- en vaatziekten...

Summaries are deduplicated per document — when multiple chunks from the same document are retrieved, only the first chunk gets the summary prefix. This prevents wasting context window tokens on repeated summaries.

Why This Improves RAG Quality

The page summary serves two distinct purposes:

  1. Better embedding similarity (at ingestion time): When chunks are re-embedded in the future, the summary provides document-level context that disambiguates generic chunks. A chunk about "consultatie-uren" (consultation hours) without context could match any department; with a summary mentioning "Cardiologie", the embedding captures the correct domain.

  2. Better LLM comprehension (at query time): The response generation model receives not just a chunk fragment, but also a summary of what the source document is about. This helps the model synthesize more accurate and contextually appropriate answers.

Anthropic Research

This technique is based on Anthropic's research on Contextual Retrieval, which demonstrates that prepending document context to chunks reduces retrieval failure rates by 49% when combined with BM25 hybrid search — a combination the ZOL system already employs.

Configuration

SettingDefaultDescription
GRAPH_LLM_VALIDATION_ENABLEDtrueEnable/disable the validation gate
GRAPH_VALIDATION_MODELTier 2 model via OpenAILLM model for validation via OpenAI

Taxonomy Context in Validation Prompt

The validation system prompt is enriched with context from the taxonomy module (ADR-0015: zol_taxonomy.py). This includes the list of valid campuses, canonical department names, known dual-entities, and doctor name blocklists. By grounding the LLM in authoritative taxonomy data, validation accuracy improves — the LLM does not need to rely solely on its training data to determine whether an entity is valid in the ZOL context.

Fault Tolerance

If the LLM call fails (timeout, rate limit, JSON parse error), the original regex extraction result passes through unchanged. The system never blocks on validation failures — this is a quality enhancement, not a critical path dependency.

Consequences

Positive

  • Eliminates semantic garbage: LLM understands that "Borstkas" is a body part, not a doctor name
  • Catches wrong entity types: Departments misclassified as treatments are corrected
  • Normalizes name variants: "Cardiologie" and "Dienst Cardiologie" are recognized as the same entity
  • Improves RAG relevance: Page summaries provide document-level context to chunks (research basis)
  • Low cost: ~$0.50-1 per full corpus extraction (~2000 pages with the Tier 2 model)
  • No migration needed: Uses existing JSONB column for summary storage

Negative

  • Extraction time increase: ~30 minutes added to full corpus extraction (2000 sequential LLM calls)
  • External dependency: Extraction now requires OpenAI API access (previously offline-capable)
  • Cost per run: ~$0.50-1 per full extraction (was $0 with regex-only)

Neutral

  • Regex extraction still runs first — validated entities are a strict subset of regex output
  • Existing regex patterns and blocklists remain in place (LLM is additive quality gate)
  • No changes to query pipeline latency (summaries are pre-computed at ingestion time)
  • TypedNodeStorage and Neo4j schema are unchanged

Alternatives Considered

Alternative 1: Expanded Regex Blocklists

More regex patterns to reject known bad entities. Rejected: fundamentally limited by regex's inability to understand language semantics. Each blocklist entry creates new edge cases.

Alternative 2: GLiNER / NER Model

Replace regex with a trained Named Entity Recognition model. Deferred: better entity boundary detection but still cannot validate entity plausibility. Also lacks page summary capability. May be combined with LLM validation in the future.

Alternative 3: LLM-Only Extraction (Replace Regex)

Skip regex entirely and use the LLM for both extraction and validation. Rejected: 10-50x more expensive, slower, non-deterministic, and harder to debug. The regex+LLM hybrid is more cost-effective and debuggable.

Implementation

FilePurpose
backend/app/services/graph/llm_entity_validation.pyLLMEntityValidator class
backend/app/config.pyConfiguration fields
backend/scripts/extract_and_populate_graph.pyIntegration with extraction pipeline
backend/app/services/graph/query_service.pyPage summary injection in context building

References