This ADR was written when the system used Neo4j for entity storage. As of March 2026, Neo4j has been fully removed and replaced by PostgreSQL taxonomy tables (taxonomy_entities, taxonomy_relationships). The decision rationale documented here remains valid; the storage layer has changed.
ADR-0014: LLM Entity Validation and Contextual Retrieval
Date: 2026-02-08 | Status: Accepted
Context
The knowledge graph extraction pipeline uses regex-based MedicalEntityExtractor to extract doctors, departments, conditions, treatments, and relationships from ZOL hospital web pages. While fast and cost-free, regex extraction produces systematic semantic errors that blocklist expansion cannot solve:
| Problem | Example | Why Regex Can't Fix It |
|---|---|---|
| Fake doctor names | "Borstkas" (chest), "Hoofdverpleegkundige" (head nurse) parsed as person names | These are valid Dutch words that match the capitalized-noun-after-prefix pattern |
| Boilerplate hub nodes | "Behandeling" (treatment), "Onderzoek" (examination) connected to dozens of pages | Blocking these removes legitimate specific uses too |
| Wrong entity types | A department classified as a treatment | Requires understanding the meaning, not just the pattern |
| Implausible relationships | A department "treats" a campus | Co-occurrence inference has no semantic validation |
Adding more regex blocklists is an iterative patching approach — each fix creates new edge cases. An LLM understands Dutch language context and can make the nuanced validation decisions that regex fundamentally cannot.
The Contextual Retrieval Opportunity
Since an LLM call per page is now required for validation, we can generate a page summary at near-zero marginal cost. This implements a technique that Anthropic describes in their research on Contextual Retrieval:
The core insight is that individual chunks lose context about their parent document. A chunk that says "De raadpleging duurt gemiddeld 30 minuten" (The consultation lasts approximately 30 minutes) gives the embedding model no signal about which department this refers to. By prepending a brief document summary, the embedding and the LLM gain critical context.
Anthropic's research demonstrates that contextual retrieval can reduce retrieval failure rates by 49% when combined with hybrid search (BM25 + embeddings). Since the ZOL system already employs hybrid search (ADR-0012), adding page summaries is the natural next step to maximize retrieval quality.
Decision
1. Post-Extraction LLM Validation Gate
A new LLMEntityValidator sits between regex extraction and Neo4j storage:
A single LLM call per page performs three tasks simultaneously:
- Entity validation: keep, reject, or rename each extracted entity
- Relationship validation: keep or reject each inferred relationship
- Page summary generation: 2-3 sentences in Dutch describing the page content
2. Cross-Page Entity Cache
Once an entity is validated (e.g., "Hoofdverpleegkundige" rejected as not a real doctor name), the decision is cached in-memory. If the same entity appears on subsequent pages, the cached decision is applied without an LLM call. This reduces total LLM calls by an estimated 10-25%.
3. Page Summary Storage (Contextual Retrieval)
Page summaries are stored in the existing chunk_metadata JSONB column — no database migration required:
{
"section_header": "Cardiologie",
"page_summary": "Deze pagina beschrijft de afdeling Cardiologie van ZOL, inclusief de artsen, behandelingen en consultatiemogelijkheden op campus Sint-Jan en André Dumont."
}
4. RAG Context Enhancement
During query processing, the page summary is prepended to the first chunk from each document in the context window:
[1] Uit cardiologie.pdf (pagina 1):
[Pagina context: Deze pagina beschrijft de afdeling Cardiologie van ZOL,
inclusief de artsen, behandelingen en consultatiemogelijkheden op campus
Sint-Jan en André Dumont.]
De afdeling Cardiologie biedt gespecialiseerde zorg voor patiënten met
hart- en vaatziekten...
Summaries are deduplicated per document — when multiple chunks from the same document are retrieved, only the first chunk gets the summary prefix. This prevents wasting context window tokens on repeated summaries.
Why This Improves RAG Quality
The page summary serves two distinct purposes:
-
Better embedding similarity (at ingestion time): When chunks are re-embedded in the future, the summary provides document-level context that disambiguates generic chunks. A chunk about "consultatie-uren" (consultation hours) without context could match any department; with a summary mentioning "Cardiologie", the embedding captures the correct domain.
-
Better LLM comprehension (at query time): The response generation model receives not just a chunk fragment, but also a summary of what the source document is about. This helps the model synthesize more accurate and contextually appropriate answers.
This technique is based on Anthropic's research on Contextual Retrieval, which demonstrates that prepending document context to chunks reduces retrieval failure rates by 49% when combined with BM25 hybrid search — a combination the ZOL system already employs.
Configuration
| Setting | Default | Description |
|---|---|---|
GRAPH_LLM_VALIDATION_ENABLED | true | Enable/disable the validation gate |
GRAPH_VALIDATION_MODEL | Tier 2 model via OpenAI | LLM model for validation via OpenAI |
Taxonomy Context in Validation Prompt
The validation system prompt is enriched with context from the taxonomy module (ADR-0015: zol_taxonomy.py). This includes the list of valid campuses, canonical department names, known dual-entities, and doctor name blocklists. By grounding the LLM in authoritative taxonomy data, validation accuracy improves — the LLM does not need to rely solely on its training data to determine whether an entity is valid in the ZOL context.
Fault Tolerance
If the LLM call fails (timeout, rate limit, JSON parse error), the original regex extraction result passes through unchanged. The system never blocks on validation failures — this is a quality enhancement, not a critical path dependency.
Consequences
Positive
- Eliminates semantic garbage: LLM understands that "Borstkas" is a body part, not a doctor name
- Catches wrong entity types: Departments misclassified as treatments are corrected
- Normalizes name variants: "Cardiologie" and "Dienst Cardiologie" are recognized as the same entity
- Improves RAG relevance: Page summaries provide document-level context to chunks (research basis)
- Low cost: ~$0.50-1 per full corpus extraction (~2000 pages with the Tier 2 model)
- No migration needed: Uses existing JSONB column for summary storage
Negative
- Extraction time increase: ~30 minutes added to full corpus extraction (2000 sequential LLM calls)
- External dependency: Extraction now requires OpenAI API access (previously offline-capable)
- Cost per run: ~$0.50-1 per full extraction (was $0 with regex-only)
Neutral
- Regex extraction still runs first — validated entities are a strict subset of regex output
- Existing regex patterns and blocklists remain in place (LLM is additive quality gate)
- No changes to query pipeline latency (summaries are pre-computed at ingestion time)
- TypedNodeStorage and Neo4j schema are unchanged
Alternatives Considered
Alternative 1: Expanded Regex Blocklists
More regex patterns to reject known bad entities. Rejected: fundamentally limited by regex's inability to understand language semantics. Each blocklist entry creates new edge cases.
Alternative 2: GLiNER / NER Model
Replace regex with a trained Named Entity Recognition model. Deferred: better entity boundary detection but still cannot validate entity plausibility. Also lacks page summary capability. May be combined with LLM validation in the future.
Alternative 3: LLM-Only Extraction (Replace Regex)
Skip regex entirely and use the LLM for both extraction and validation. Rejected: 10-50x more expensive, slower, non-deterministic, and harder to debug. The regex+LLM hybrid is more cost-effective and debuggable.
Implementation
| File | Purpose |
|---|---|
backend/app/services/graph/llm_entity_validation.py | LLMEntityValidator class |
backend/app/config.py | Configuration fields |
backend/scripts/extract_and_populate_graph.py | Integration with extraction pipeline |
backend/app/services/graph/query_service.py | Page summary injection in context building |
References
- Anthropic. (2024). Introducing contextual retrieval. https://www.anthropic.com/news/contextual-retrieval
- Hogan, A., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1--37. https://doi.org/10.1145/3447772