Skip to main content

What is Retrieval-Augmented Generation?

Retrieval-Augmented Generation (RAG) is the family of architectures formalised by Lewis et al. 2020 in which a language model is conditioned at inference time on documents retrieved from an external corpus, rather than relying on parametric memory alone. ZOL Intelligent Search is a production RAG system: every answer the user reads is generated against a context window built from chunks retrieved from the hospital's own published content, with each claim citation-traceable to a specific source.

A foundation model used in isolation has three structural limitations that disqualify it from a hospital information role:

LimitationConsequence in a hospital context
Training-data cutoffThe model cannot know about content published after its cutoff — new doctors, new department phone numbers, the hospital's current parking tariff.
HallucinationWithout a retrieval substrate the model fills gaps by generating plausible-but-fabricated text. A hallucinated doctor name or department directs a patient to the wrong place.
No source attributionEven when the parametric output happens to be correct, the user cannot verify it. Every claim must be traceable to a source the operator can audit.

In a setting where a wrong answer can send a patient to the wrong campus or imply a treatment that the hospital does not offer, grounding outputs in retrieved-and-cited evidence is a safety constraint, not a quality optimisation.

The retrieve–augment–generate paradigm

Lewis et al. 2020 decomposed knowledge-grounded generation into three phases — Retrieve, Augment, Generate — each of which our pipeline implements with substantial extensions.

Phase 1 — Retrieve

The user query is embedded with the same model used to embed the corpus, and the embedding is queried against an HNSW-indexed pgvector store of document chunks (see the dense bi-encoder pattern from @karpukhin2020dpr). In our pipeline this dense path is fused with sparse lexical search (BM25 over PostgreSQL tsvector) and a taxonomy lookup over typed entity tables; see Hybrid Search for the fusion algorithm.

Phase 2 — Augment

Retrieved chunks are not handed to the LLM verbatim. The Context Assembly stage expands each chunk with its ±1 neighbours, deduplicates the chunking overlap, groups by document, and enforces a token budget calibrated against the Lost-in-the-Middle finding that LLMs under-attend to mid-context tokens. The augmented prompt then carries the system prompt, the prior conversation turns, and the assembled context.

Phase 3 — Generate

The LLM generates the answer conditioned on the augmented prompt. The system prompt enforces strict grounding: the model is instructed to cite with [N] markers, to refuse claims unsupported by the context, and never to invent a citation. See Prompt Engineering for the section-by-section breakdown of the rules that govern generation.

Trade-offs in our RAG shape

Three high-level shape decisions distinguish ZOL Intelligent Search from a textbook RAG implementation. Each was made deliberately and is captured in an ADR with its rejected alternatives.

DecisionChosenAlternatives consideredRejected because
Retrieval shapeHybrid: dense pgvector + sparse BM25 + typed taxonomyVector-only (@karpukhin2020dpr); BM25-only (Robertson & Zaragoza 2009); ColBERT-only (@khattab2020colbert)Vector-only loses recall on rare Dutch medical terms ("cardioversie", "Dr. Vanderstraeten") that BM25 picks up by exact match; BM25-only loses recall on cross-language and synonymic queries that vectors capture; ColBERT-only forces re-encoding the entire corpus on every model change, which is operationally untenable.
Knowledge representationTyped PostgreSQL taxonomy tables (doctors, departments, conditions and their relationships) sitting alongside the chunk storeEmbed graph structure into chunk text; standalone Neo4j Graph Data Science (@neo4j_gds_manual)Embedding the graph into chunk text loses the structured-lookup property used by the doctor-and-department resolvers; a separate Neo4j service added an operational system, a second access-control plane, and a synchronisation surface against the relational tenant + taxonomy data without ever being load-bearing in retrieval. See ADR-0053 for the consolidation.
Generation groundingCitation-traceable inline [N] markers + always-on safety guardFree-form generation with post-hoc citation extraction; trust-the-LLM groundingA medical-information assistant cannot ship "we'll check the citation later". The prompt enforces inline [N] markers immediately after each claim; the post-generation safety pass blocks medical-advice patterns that survive the retrieval gate. See Safety Architecture.

Why RAG fits the ZOL use case

The ZOL search problem is an archetypal RAG application:

RequirementHow RAG addresses it
Content changes weekly (new brochures, doctor roster, tariffs)Nightly auto-ingest re-embeds new and changed pages; the next query sees the new content without retraining anything.
Accuracy must be defensibleEvery claim must be traceable to a chunk the operator can audit; the parametric model cannot be the source of truth.
Sources must be verifiable by the userCitations resolve to clickable hospital URLs; the user can confirm the answer in the source brochure.
Multilingual input over Dutch contentThe intent classifier reformulates input from any of eight supported languages into canonical Dutch before retrieval; embeddings handle cross-lingual matches; the LLM generates back in the user's language.
Mixed content types (HTML pages + PDF brochures + structured taxonomy)All three are embedded into the same pgvector store; retrieval is shape-agnostic.

How three retrieval paradigms answer the same question

Consider "Wat moet ik meebrengen voor een knieoperatie?"what should I bring for a knee operation?

Keyword search (Elasticsearch baseline)

Tokenises the query and matches against an inverted index, ranking by TF-IDF or BM25 score. Returns ranked links; the patient does the synthesis. Fails completely on paraphrase ("voorbereiding heupprothese" — hip prosthesis preparation) because there's no vocabulary bridge between query and corpus.

Embeds the query into a vector space and returns chunks ranked by cosine similarity. Bridges the vocabulary gap (knows that knieoperatie and orthopedische ingreep are related) but still returns links — the patient still does the synthesis.

RAG

Retrieves the same semantically relevant chunks and synthesises them into a direct answer, with [N] markers anchoring each claim to a specific source URL the patient can click. The operator can audit; the patient gets an actionable answer in one step.

How ZOL extends canonical RAG

The pipeline implements the Lewis 2020 retrieve-then-generate paradigm and then layers seven extensions on top of it. Each extension is documented in its own page; the table below is an index.

ExtensionStage in our pipelineWhere to read more
Hybrid retrievalStage 5 — pgvector + BM25 + taxonomy fused via Reciprocal Rank Fusion (Cormack, Clarke & Büttcher 2009)Hybrid Search
Intent-aware routingStage 2 — twelve-class classifier; four classes are blocked at the gateQuery Pipeline
Multi-layer safetyStage 2 (pre-retrieval) + post-generation validation + fast quality gateSafety Overview
Quality evaluationStage 8 — fast embedding-similarity gate + asynchronous DeepEval (Faithfulness, Answer Relevancy)Quality Evaluation
Conversational contextStage 3 — combined classify-and-rewrite pass resolves anaphora using prior turn citations as topic hintsQuery Pipeline §Stage 3
Contextual retrievalStage 7 — page-level summaries pre-computed at ingest time and prepended to the first chunk per document at query time (49 % retrieval-failure reduction reported by Anthropic when paired with hybrid search)Context Assembly
Value Framework affinity rerankStage 5b — intent × content_category multiplier prevents wheelchair-vs-cardiology cross-category contaminationReranking & Evaluation, Query Pipeline §Stage 5b
Synthetic doctor-list injectionStage 5c — guarantees the LLM has the full department roster when the user asks an "all doctors of X" questionTaxonomy Query Enrichment, Query Pipeline §Stage 5c
Query decompositionStage 1b — feature-flagged LLM gate splits multi-hop questions into focused sub-questions retrieved in parallelQuery Decomposition

Where this pipeline fits in the published RAG taxonomy

Gao et al. 2024 (Gao et al. 2024) distinguish three RAG generations:

GenerationWhat it addsZOL implementation
Naive RAGSingle-shot retrieve-then-generateBaseline capability — retained as the fallback path when downstream stages disable themselves.
Advanced RAGPre-retrieval rewriting; hybrid retrieval; post-retrieval rerankingImplemented in full — intent classification + canonical reformulation (pre-retrieval); pgvector + BM25 + taxonomy fusion (hybrid); Value Framework affinity + cross-encoder reranking (post-retrieval).
Modular RAGComposable modules with routing, fusion, and feedbackImplemented in part — intent-driven strategy selection routes to HYBRID by default; the fast quality gate provides a feedback signal; query decomposition is a routed module that fires only when the heuristic gate accepts it.

ZOL Intelligent Search is a production-grade Advanced + Modular RAG system, with the Modular elements introduced for the cross-category contamination, list-completeness, and multi-hop failure modes specific to hospital information delivery.

References