What is Retrieval-Augmented Generation?

Retrieval-Augmented Generation (RAG) is the family of architectures formalised by Lewis et al. 2020 in which a language model is conditioned at inference time on documents retrieved from an external corpus, rather than relying on parametric memory alone. ZOL Intelligent Search is a production RAG system: every answer the user reads is generated against a context window built from chunks retrieved from the hospital's own published content, with each claim citation-traceable to a specific source.

Why a parametric LLM is not enough for hospital search

A foundation model used in isolation has three structural limitations that disqualify it from a hospital information role:

Limitation	Consequence in a hospital context
Training-data cutoff	The model cannot know about content published after its cutoff — new doctors, new department phone numbers, the hospital's current parking tariff.
Hallucination	Without a retrieval substrate the model fills gaps by generating plausible-but-fabricated text. A hallucinated doctor name or department directs a patient to the wrong place.
No source attribution	Even when the parametric output happens to be correct, the user cannot verify it. Every claim must be traceable to a source the operator can audit.

In a setting where a wrong answer can send a patient to the wrong campus or imply a treatment that the hospital does not offer, grounding outputs in retrieved-and-cited evidence is a safety constraint, not a quality optimisation.

The retrieve–augment–generate paradigm

Lewis et al. 2020 decomposed knowledge-grounded generation into three phases — Retrieve, Augment, Generate — each of which our pipeline implements with substantial extensions.

Phase 1 — Retrieve

The user query is embedded with the same model used to embed the corpus, and the embedding is queried against an HNSW-indexed pgvector store of document chunks (see the dense bi-encoder pattern from @karpukhin2020dpr). In our pipeline this dense path is fused with sparse lexical search (BM25 over PostgreSQL tsvector) and a taxonomy lookup over typed entity tables; see Hybrid Search for the fusion algorithm.

Phase 2 — Augment

Retrieved chunks are not handed to the LLM verbatim. The Context Assembly stage expands each chunk with its ±1 neighbours, deduplicates the chunking overlap, groups by document, and enforces a token budget calibrated against the Lost-in-the-Middle finding that LLMs under-attend to mid-context tokens. The augmented prompt then carries the system prompt, the prior conversation turns, and the assembled context.

Phase 3 — Generate

The LLM generates the answer conditioned on the augmented prompt. The system prompt enforces strict grounding: the model is instructed to cite with [N] markers, to refuse claims unsupported by the context, and never to invent a citation. See Prompt Engineering for the section-by-section breakdown of the rules that govern generation.

Trade-offs in our RAG shape

Three high-level shape decisions distinguish ZOL Intelligent Search from a textbook RAG implementation. Each was made deliberately and is captured in an ADR with its rejected alternatives.

Decision	Chosen	Alternatives considered	Rejected because
Retrieval shape	Hybrid: dense pgvector + sparse BM25 + typed taxonomy	Vector-only (@karpukhin2020dpr); BM25-only (Robertson & Zaragoza 2009); ColBERT-only (@khattab2020colbert)	Vector-only loses recall on rare Dutch medical terms ("cardioversie", "Dr. Vanderstraeten") that BM25 picks up by exact match; BM25-only loses recall on cross-language and synonymic queries that vectors capture; ColBERT-only forces re-encoding the entire corpus on every model change, which is operationally untenable.
Knowledge representation	Typed PostgreSQL taxonomy tables (doctors, departments, conditions and their relationships) sitting alongside the chunk store	Embed graph structure into chunk text; standalone Neo4j Graph Data Science (@neo4j_gds_manual)	Embedding the graph into chunk text loses the structured-lookup property used by the doctor-and-department resolvers; a separate Neo4j service added an operational system, a second access-control plane, and a synchronisation surface against the relational tenant + taxonomy data without ever being load-bearing in retrieval. See ADR-0053 for the consolidation.
Generation grounding	Citation-traceable inline `[N]` markers + always-on safety guard	Free-form generation with post-hoc citation extraction; trust-the-LLM grounding	A medical-information assistant cannot ship "we'll check the citation later". The prompt enforces inline `[N]` markers immediately after each claim; the post-generation safety pass blocks medical-advice patterns that survive the retrieval gate. See Safety Architecture.

Why RAG fits the ZOL use case

The ZOL search problem is an archetypal RAG application:

Requirement	How RAG addresses it
Content changes weekly (new brochures, doctor roster, tariffs)	Nightly auto-ingest re-embeds new and changed pages; the next query sees the new content without retraining anything.
Accuracy must be defensible	Every claim must be traceable to a chunk the operator can audit; the parametric model cannot be the source of truth.
Sources must be verifiable by the user	Citations resolve to clickable hospital URLs; the user can confirm the answer in the source brochure.
Multilingual input over Dutch content	The intent classifier reformulates input from any of eight supported languages into canonical Dutch before retrieval; embeddings handle cross-lingual matches; the LLM generates back in the user's language.
Mixed content types (HTML pages + PDF brochures + structured taxonomy)	All three are embedded into the same pgvector store; retrieval is shape-agnostic.

How three retrieval paradigms answer the same question

Consider "Wat moet ik meebrengen voor een knieoperatie?" — what should I bring for a knee operation?

Keyword search (Elasticsearch baseline)

Tokenises the query and matches against an inverted index, ranking by TF-IDF or BM25 score. Returns ranked links; the patient does the synthesis. Fails completely on paraphrase ("voorbereiding heupprothese" — hip prosthesis preparation) because there's no vocabulary bridge between query and corpus.

Semantic search

Embeds the query into a vector space and returns chunks ranked by cosine similarity. Bridges the vocabulary gap (knows that knieoperatie and orthopedische ingreep are related) but still returns links — the patient still does the synthesis.

RAG

Retrieves the same semantically relevant chunks and synthesises them into a direct answer, with [N] markers anchoring each claim to a specific source URL the patient can click. The operator can audit; the patient gets an actionable answer in one step.

How ZOL extends canonical RAG

The pipeline implements the Lewis 2020 retrieve-then-generate paradigm and then layers seven extensions on top of it. Each extension is documented in its own page; the table below is an index.

Extension	Stage in our pipeline	Where to read more
Hybrid retrieval	Stage 5 — pgvector + BM25 + taxonomy fused via Reciprocal Rank Fusion (Cormack, Clarke & Büttcher 2009)	Hybrid Search
Intent-aware routing	Stage 2 — twelve-class classifier; four classes are blocked at the gate	Query Pipeline
Multi-layer safety	Stage 2 (pre-retrieval) + post-generation validation + fast quality gate	Safety Overview
Quality evaluation	Stage 8 — fast embedding-similarity gate + asynchronous DeepEval (Faithfulness, Answer Relevancy)	Quality Evaluation
Conversational context	Stage 3 — combined classify-and-rewrite pass resolves anaphora using prior turn citations as topic hints	Query Pipeline §Stage 3
Contextual retrieval	Stage 7 — page-level summaries pre-computed at ingest time and prepended to the first chunk per document at query time (49 % retrieval-failure reduction reported by Anthropic when paired with hybrid search)	Context Assembly
Value Framework affinity rerank	Stage 5b — `intent × content_category` multiplier prevents wheelchair-vs-cardiology cross-category contamination	Reranking & Evaluation, Query Pipeline §Stage 5b
Synthetic doctor-list injection	Stage 5c — guarantees the LLM has the full department roster when the user asks an "all doctors of X" question	Taxonomy Query Enrichment, Query Pipeline §Stage 5c
Query decomposition	Stage 1b — feature-flagged LLM gate splits multi-hop questions into focused sub-questions retrieved in parallel	Query Decomposition

Where this pipeline fits in the published RAG taxonomy

Gao et al. 2024 (Gao et al. 2024) distinguish three RAG generations:

Generation	What it adds	ZOL implementation
Naive RAG	Single-shot retrieve-then-generate	Baseline capability — retained as the fallback path when downstream stages disable themselves.
Advanced RAG	Pre-retrieval rewriting; hybrid retrieval; post-retrieval reranking	Implemented in full — intent classification + canonical reformulation (pre-retrieval); pgvector + BM25 + taxonomy fusion (hybrid); Value Framework affinity + cross-encoder reranking (post-retrieval).
Modular RAG	Composable modules with routing, fusion, and feedback	Implemented in part — intent-driven strategy selection routes to HYBRID by default; the fast quality gate provides a feedback signal; query decomposition is a routed module that fires only when the heuristic gate accepts it.

ZOL Intelligent Search is a production-grade Advanced + Modular RAG system, with the Modular elements introduced for the cross-category contamination, list-completeness, and multi-hop failure modes specific to hospital information delivery.

References

Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
Karpukhin, V., Oğuz, B., Min, S., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.
Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020.
Liu, N. F., Lin, K., Hewitt, J., et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL 12.
pgvector contributors. (2024). pgvector — vector similarity for Postgres.
Anthropic. (2024). Introducing Contextual Retrieval. — 49 % retrieval-failure reduction with contextual embeddings + hybrid search.
Gao, Y., et al. (2024). Retrieval-augmented generation for large language models: A survey. arXiv 2312.10997. (Gao et al. 2024)

Why a parametric LLM is not enough for hospital search​

The retrieve–augment–generate paradigm​

Phase 1 — Retrieve​

Phase 2 — Augment​

Phase 3 — Generate​

Trade-offs in our RAG shape​

Why RAG fits the ZOL use case​

How three retrieval paradigms answer the same question​

Keyword search (Elasticsearch baseline)​

Semantic search​

RAG​

How ZOL extends canonical RAG​

Where this pipeline fits in the published RAG taxonomy​

References​