What is Retrieval-Augmented Generation?
Retrieval-Augmented Generation (RAG) is the family of architectures formalised by Lewis et al. 2020 in which a language model is conditioned at inference time on documents retrieved from an external corpus, rather than relying on parametric memory alone. ZOL Intelligent Search is a production RAG system: every answer the user reads is generated against a context window built from chunks retrieved from the hospital's own published content, with each claim citation-traceable to a specific source.
Why a parametric LLM is not enough for hospital search
A foundation model used in isolation has three structural limitations that disqualify it from a hospital information role:
| Limitation | Consequence in a hospital context |
|---|---|
| Training-data cutoff | The model cannot know about content published after its cutoff — new doctors, new department phone numbers, the hospital's current parking tariff. |
| Hallucination | Without a retrieval substrate the model fills gaps by generating plausible-but-fabricated text. A hallucinated doctor name or department directs a patient to the wrong place. |
| No source attribution | Even when the parametric output happens to be correct, the user cannot verify it. Every claim must be traceable to a source the operator can audit. |
In a setting where a wrong answer can send a patient to the wrong campus or imply a treatment that the hospital does not offer, grounding outputs in retrieved-and-cited evidence is a safety constraint, not a quality optimisation.
The retrieve–augment–generate paradigm
Lewis et al. 2020 decomposed knowledge-grounded generation into three phases — Retrieve, Augment, Generate — each of which our pipeline implements with substantial extensions.
Phase 1 — Retrieve
The user query is embedded with the same model used to embed the corpus, and the embedding is queried against an HNSW-indexed pgvector store of document chunks (see the dense bi-encoder pattern from @karpukhin2020dpr). In our pipeline this dense path is fused with sparse lexical search (BM25 over PostgreSQL tsvector) and a taxonomy lookup over typed entity tables; see Hybrid Search for the fusion algorithm.
Phase 2 — Augment
Retrieved chunks are not handed to the LLM verbatim. The Context Assembly stage expands each chunk with its ±1 neighbours, deduplicates the chunking overlap, groups by document, and enforces a token budget calibrated against the Lost-in-the-Middle finding that LLMs under-attend to mid-context tokens. The augmented prompt then carries the system prompt, the prior conversation turns, and the assembled context.
Phase 3 — Generate
The LLM generates the answer conditioned on the augmented prompt. The system prompt enforces strict grounding: the model is instructed to cite with [N] markers, to refuse claims unsupported by the context, and never to invent a citation. See Prompt Engineering for the section-by-section breakdown of the rules that govern generation.
Trade-offs in our RAG shape
Three high-level shape decisions distinguish ZOL Intelligent Search from a textbook RAG implementation. Each was made deliberately and is captured in an ADR with its rejected alternatives.
| Decision | Chosen | Alternatives considered | Rejected because |
|---|---|---|---|
| Retrieval shape | Hybrid: dense pgvector + sparse BM25 + typed taxonomy | Vector-only (@karpukhin2020dpr); BM25-only (Robertson & Zaragoza 2009); ColBERT-only (@khattab2020colbert) | Vector-only loses recall on rare Dutch medical terms ("cardioversie", "Dr. Vanderstraeten") that BM25 picks up by exact match; BM25-only loses recall on cross-language and synonymic queries that vectors capture; ColBERT-only forces re-encoding the entire corpus on every model change, which is operationally untenable. |
| Knowledge representation | Typed PostgreSQL taxonomy tables (doctors, departments, conditions and their relationships) sitting alongside the chunk store | Embed graph structure into chunk text; standalone Neo4j Graph Data Science (@neo4j_gds_manual) | Embedding the graph into chunk text loses the structured-lookup property used by the doctor-and-department resolvers; a separate Neo4j service added an operational system, a second access-control plane, and a synchronisation surface against the relational tenant + taxonomy data without ever being load-bearing in retrieval. See ADR-0053 for the consolidation. |
| Generation grounding | Citation-traceable inline [N] markers + always-on safety guard | Free-form generation with post-hoc citation extraction; trust-the-LLM grounding | A medical-information assistant cannot ship "we'll check the citation later". The prompt enforces inline [N] markers immediately after each claim; the post-generation safety pass blocks medical-advice patterns that survive the retrieval gate. See Safety Architecture. |
Why RAG fits the ZOL use case
The ZOL search problem is an archetypal RAG application:
| Requirement | How RAG addresses it |
|---|---|
| Content changes weekly (new brochures, doctor roster, tariffs) | Nightly auto-ingest re-embeds new and changed pages; the next query sees the new content without retraining anything. |
| Accuracy must be defensible | Every claim must be traceable to a chunk the operator can audit; the parametric model cannot be the source of truth. |
| Sources must be verifiable by the user | Citations resolve to clickable hospital URLs; the user can confirm the answer in the source brochure. |
| Multilingual input over Dutch content | The intent classifier reformulates input from any of eight supported languages into canonical Dutch before retrieval; embeddings handle cross-lingual matches; the LLM generates back in the user's language. |
| Mixed content types (HTML pages + PDF brochures + structured taxonomy) | All three are embedded into the same pgvector store; retrieval is shape-agnostic. |
How three retrieval paradigms answer the same question
Consider "Wat moet ik meebrengen voor een knieoperatie?" — what should I bring for a knee operation?
Keyword search (Elasticsearch baseline)
Tokenises the query and matches against an inverted index, ranking by TF-IDF or BM25 score. Returns ranked links; the patient does the synthesis. Fails completely on paraphrase ("voorbereiding heupprothese" — hip prosthesis preparation) because there's no vocabulary bridge between query and corpus.
Semantic search
Embeds the query into a vector space and returns chunks ranked by cosine similarity. Bridges the vocabulary gap (knows that knieoperatie and orthopedische ingreep are related) but still returns links — the patient still does the synthesis.
RAG
Retrieves the same semantically relevant chunks and synthesises them into a direct answer, with [N] markers anchoring each claim to a specific source URL the patient can click. The operator can audit; the patient gets an actionable answer in one step.
How ZOL extends canonical RAG
The pipeline implements the Lewis 2020 retrieve-then-generate paradigm and then layers seven extensions on top of it. Each extension is documented in its own page; the table below is an index.
| Extension | Stage in our pipeline | Where to read more |
|---|---|---|
| Hybrid retrieval | Stage 5 — pgvector + BM25 + taxonomy fused via Reciprocal Rank Fusion (Cormack, Clarke & Büttcher 2009) | Hybrid Search |
| Intent-aware routing | Stage 2 — twelve-class classifier; four classes are blocked at the gate | Query Pipeline |
| Multi-layer safety | Stage 2 (pre-retrieval) + post-generation validation + fast quality gate | Safety Overview |
| Quality evaluation | Stage 8 — fast embedding-similarity gate + asynchronous DeepEval (Faithfulness, Answer Relevancy) | Quality Evaluation |
| Conversational context | Stage 3 — combined classify-and-rewrite pass resolves anaphora using prior turn citations as topic hints | Query Pipeline §Stage 3 |
| Contextual retrieval | Stage 7 — page-level summaries pre-computed at ingest time and prepended to the first chunk per document at query time (49 % retrieval-failure reduction reported by Anthropic when paired with hybrid search) | Context Assembly |
| Value Framework affinity rerank | Stage 5b — intent × content_category multiplier prevents wheelchair-vs-cardiology cross-category contamination | Reranking & Evaluation, Query Pipeline §Stage 5b |
| Synthetic doctor-list injection | Stage 5c — guarantees the LLM has the full department roster when the user asks an "all doctors of X" question | Taxonomy Query Enrichment, Query Pipeline §Stage 5c |
| Query decomposition | Stage 1b — feature-flagged LLM gate splits multi-hop questions into focused sub-questions retrieved in parallel | Query Decomposition |
Where this pipeline fits in the published RAG taxonomy
Gao et al. 2024 (Gao et al. 2024) distinguish three RAG generations:
| Generation | What it adds | ZOL implementation |
|---|---|---|
| Naive RAG | Single-shot retrieve-then-generate | Baseline capability — retained as the fallback path when downstream stages disable themselves. |
| Advanced RAG | Pre-retrieval rewriting; hybrid retrieval; post-retrieval reranking | Implemented in full — intent classification + canonical reformulation (pre-retrieval); pgvector + BM25 + taxonomy fusion (hybrid); Value Framework affinity + cross-encoder reranking (post-retrieval). |
| Modular RAG | Composable modules with routing, fusion, and feedback | Implemented in part — intent-driven strategy selection routes to HYBRID by default; the fast quality gate provides a feedback signal; query decomposition is a routed module that fires only when the heuristic gate accepts it. |
ZOL Intelligent Search is a production-grade Advanced + Modular RAG system, with the Modular elements introduced for the cross-category contamination, list-completeness, and multi-hop failure modes specific to hospital information delivery.
References
- Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
- Karpukhin, V., Oğuz, B., Min, S., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.
- Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020.
- Liu, N. F., Lin, K., Hewitt, J., et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. TACL 12.
- pgvector contributors. (2024). pgvector — vector similarity for Postgres.
- Anthropic. (2024). Introducing Contextual Retrieval. — 49 % retrieval-failure reduction with contextual embeddings + hybrid search.
- Gao, Y., et al. (2024). Retrieval-augmented generation for large language models: A survey. arXiv 2312.10997. (Gao et al. 2024)