Query Processing Pipeline

The query processing pipeline is the central nervous system of the ZOL Intelligent Search. Following the retrieve-then-generate paradigm of Lewis et al. 2020, every user query traverses a carefully orchestrated sequence of stages, each designed to progressively refine understanding, retrieve relevant information, and generate a grounded, safe response.

This page vs. Context Retrieval Architecture

This page is the canonical end-to-end pipeline map (cache check → intent → retrieval → assembly → generation → quality gate). For a retrieval-focused companion view that numbers only the retrieval stages and goes deeper on the three channels and fusion, see Context Retrieval Architecture, which carries an explicit stage-number mapping back to this page.

Pipeline-Level Trade-offs

Decision	Chosen	Alternatives considered	Rejected because
Retrieval shape	Hybrid (dense pgvector + sparse BM25 + taxonomy)	Vector-only (@karpukhin2020dpr); BM25-only; ColBERT late-interaction reranker only @khattab2020colbert	Vector-only loses recall on Dutch medical terms ("cardioversie", "Dr. Vanderstraeten") that BM25 catches by exact match; BM25-only loses recall on cross-language and synonym queries that vectors capture; ColBERT-only requires re-encoding the entire corpus at every model change, which is operationally untenable. ColBERT survives as a feature-flagged reranker (ADR-0039), not the primary path.
Result fusion	Reciprocal Rank Fusion (RRF) with k=60	Weighted linear combination of similarity scores; learned-to-rank	Score scales differ across vector cosine, BM25 ts_rank, and graph priority; weighted linear fusion required hand-tuned coefficients that drifted with each embedding swap. RRF is score-agnostic (sums `1/(k + r + 1)` over rank-only) and stable across model migrations. See ADR-0020.
Intent classification	LLM (Tier 2 via the `structured_call` helper)	Regex / keyword classifier; small fine-tuned BERT	Regex misses novel phrasings, especially in Turkish, Arabic, or Romanian patient input; a fine-tuned BERT classifier requires labelled corpora we don't have at pilot scale. The `structured_call` helper provides JSON-schema-validated output with retries and a typed `StructuredCallError` fallback, replacing the previous `json.loads`-and-pray pattern.
Cross-category contamination prevention	Value Framework affinity rerank (Stage 5b)	Per-intent hard category filter; cross-encoder rerank only	A hard filter excludes correct chunks when the user's intent class is misclassified at the boundary (e.g., `condition_information` vs `treatment_or_exam_information`). A cross-encoder rerank is cheap to add but does not address the wheelchair-vs-cardiology category leak observed in the wheelchair regression — the cross-encoder ranks within retrieved chunks but cannot rebalance categories. The affinity rerank multiplies score by `intent × category` coefficients, providing a graded penalty that recovers when intent confidence is low.

Pipeline Overview

Pipeline Stages in Detail

Stage 1: Semantic Cache Check (PostgreSQL + pgvector)

Latency: 1–30ms | Purpose: Avoid redundant LLM calls | ADR: ADR-0031

Before invoking any AI model, the pipeline checks the two-tier semantic query cache stored in PostgreSQL (not Redis). The cache operates on LLM-reformulated queries — the intent classifier normalizes language, spelling, and word order before caching, which maximizes hit rates across languages.

Tier	Mechanism	Latency	What It Catches
Tier 1	SHA-256 hash of reformulated query text	~1ms	Identical reformulations — including cross-language queries that normalize to the same Dutch form
Tier 2	Cosine similarity via HNSW index on 1,536-dim `text-embedding-3-large` embeddings (ADR-0048, @openai2024embeddings; @pgvector_docs)	~30ms	Semantically equivalent queries with different wording (e.g., "Welke artsen werken op orthopedie?" vs "Welke dokters zijn er op de afdeling orthopedie?")

On a cache hit, the stored response text and citations are returned immediately, skipping all downstream stages (intent classification, retrieval, LLM generation, quality evaluation). Entries expire after 1 hour (configurable via admin Settings UI). See Storage Architecture for schema details.

Tier 2 Safety Guards

Tier 2 employs two protective mechanisms against cross-topic cache contamination — a phenomenon where structurally similar queries with different medical entities return incorrect cached responses:

1. Intent Filter: Tier 2 only considers cache entries with the same intent classification as the incoming query. This prevents, for example, a doctor lookup response from being returned for a condition query. The SQL filter AND (:intent IS NULL OR intent IS NULL OR intent = :intent) is applied before the cosine similarity ranking.

2. Jaccard Keyword Guard: Even within the same intent, embedding similarity alone is insufficient. Queries like "Welke cardiologen werken bij ZOL?" and "Wie zijn de orthopedisten bij ZOL?" produce cosine similarity >0.97 because their sentence structure is identical — only the entity name differs. The Jaccard Keyword Guard prevents this by computing the Jaccard similarity coefficient (|A ∩ B| / |A ∪ B|) over the content words of both queries. Content words are extracted by filtering tokens ≥4 characters and removing Dutch stop words and generic medical terms ("artsen", "werken", "afdeling", etc.). If the Jaccard similarity falls below 0.5, the cache hit is rejected.

New Query	Cached Query	Content Words	Jaccard	Verdict
"Welke orthopedisten bij ZOL?"	"Welke cardiologen bij ZOL?"	{orthopedie} vs {cardiologie}	0.00	Rejected
"Hoeveel kost parkeren?"	"Wat kost het parkeren bij ZOL?"	{parkeren, kost} vs {parkeren, kost}	1.00	Accepted
"borst onderzoek ZOL"	"borstonderzoek bij ZOL"	{borstonderzoek} vs {borstonderzoek}	1.00	Accepted

This two-layer defense (intent + Jaccard) ensures that Tier 2 cache hits are both semantically close and topically aligned, eliminating cross-topic contamination without sacrificing cache effectiveness for legitimate reformulations.

Stage 2: Intent Classification

Latency: ~400 ms (combined classify-and-rewrite call when conversation history exists; not yet measured at p95 in pilot) | Model: Tier 2 (standard) via the structured_call helper | Purpose: Route, guard, and detect follow-ups

Intent classification is the most consequential stage in the pipeline. It determines both the retrieval strategy (which data sources to query) and whether the query is safe to process at all. The classifier also detects whether the query is a follow-up to a previous conversation turn, which affects downstream filtering and boosting behavior. The classifier maps each query to one of twelve intent types.

Implementation note — structured_call helper: backend/app/services/intent_classification_service.py invokes the LLM through structured_call(output_model=IntentClassificationOutput) (from app.llm.structured). The helper enforces JSON-schema validation on the response (intent enum, entities, confidence, is_followup), retries on validation failure, and raises a typed StructuredCallError only when the model still cannot satisfy the schema after retries. This ~190-LOC helper replaced Pydantic AI on 2026-05-12 (commit b8d8da67) after production telemetry showed Agent.run() added ~720 ms per call without buying any capability structured_call lacks; it preserves the validation contract at zero latency tax. See the Decision-Cost Rubric case study for the post-mortem.

Intent Routing Strategy

Intent	Retrieval Strategy	Rationale
`doctor_lookup`	HYBRID	Doctor entities exist in the graph; vector search provides fallback when specific entities are missing
`department_or_service_lookup`	HYBRID	Department and service data is in the graph, supplemented by vector search for descriptions
`condition_information`	HYBRID	Conditions map to departments via HANDLES relationships in the graph; vector search retrieves brochure content
`treatment_or_exam_information`	HYBRID	Treatments and examinations have OFFERS/PERFORMS graph relationships; vector search retrieves procedural details
`navigation_or_practical_info`	HYBRID	Practical information is primarily textual but benefits from campus/facility graph context
`booking_or_contact`	HYBRID	May need both contact entities and procedural text
`ambiguous_symptom_description`	HYBRID	Symptoms map to conditions (vector) and departments (graph) via taxonomy alias resolution
`unknown`	HYBRID	Fallback uses both sources for maximum coverage
`out_of_scope_medical_advice`	BLOCKED	Never processed -- immediate safety response
`off_topic`	BLOCKED	Queries unrelated to ZOL hospital -- immediate redirection
`other_hospital`	BLOCKED	Queries about other hospitals -- polite refusal with ZOL contact info
`vague_input`	BLOCKED	Insufficient input to process -- request for clarification

HYBRID-Everywhere Strategy

All non-blocked intents default to the HYBRID retrieval strategy, which runs vector search and taxonomy search sequentially. This design decision reflects the empirical observation that taxonomy context improves response quality even for intents that were initially classified as vector-only (e.g., condition information benefits significantly from HANDLES relationship traversal). The retrieval strategy enum retains VECTOR_ONLY and GRAPH_ONLY values for future use, but the current production configuration maps all safe intents to HYBRID.

Query Reformulation: Canonical Templates

Every query — regardless of input language — is reformulated into a well-formed Dutch search query during intent classification. This reformulation serves three purposes:

Cross-language normalization: Queries in English, Turkish, Romanian, etc. are translated to Dutch clinical language
Cache effectiveness: Consistent reformulations maximize semantic cache hit rates (see ADR-0031)
Retrieval quality: Well-formed Dutch queries match the indexed Dutch medical content more precisely

To ensure reformulations are deterministic and consistent, the intent classification prompt includes canonical templates — fixed sentence patterns per intent type:

Intent	Canonical Template
`doctor_lookup` + department	"Welke artsen werken bij de afdeling {Department} van ZOL?"
`doctor_lookup` + name	"Wie is Dr. {Name} en op welke afdeling werkt Dr. {Name} bij ZOL?"
`department_lookup`	"Wat doet de afdeling {Department} bij ZOL en welke zorg biedt deze aan?"
`condition_info`	"Wat is {Condition} en welke behandelingen biedt ZOL aan?"
`treatment_info`	"Hoe verloopt {Treatment/Examination} bij ZOL?"
`booking_contact`	"Hoe maak ik een afspraak bij de afdeling {Department} van ZOL?"
`symptom_description`	"Welke afdelingen bij ZOL behandelen {symptom}?"

These templates enforce two critical rules:

Always "ZOL" (never "Ziekenhuis Oost-Limburg") — consistent naming
Always "afdeling X" (never "X-ische artsen") — consistent structure

Why canonical templates matter

Without templates, the LLM freely varies vocabulary and structure. For example, "who are the orthopedic doctors?" was reformulated as "Wie zijn de orthopedische artsen bij Ziekenhuis Oost-Limburg?" instead of "Welke artsen werken bij de afdeling Orthopedie van ZOL?" — producing only 0.746 cosine similarity to the Dutch equivalent, defeating the semantic cache and degrading retrieval quality (4 doctors returned vs. 12). Canonical templates eliminate this variance.

Stage 3: Query Rewriting and Follow-Up Detection

Latency: ~0ms extra (combined with intent classification) | Model: Tier 2 (standard) | Purpose: Conversational context resolution and follow-up detection

When conversation history exists, intent classification and query rewriting are performed in a single combined LLM call via classify_and_rewrite(), eliminating the need for a separate rewriting step. Without history, only intent classification runs (the reformulation from canonical templates still applies). Follow-up queries often contain anaphoric references that are meaningless in isolation. Query rewriting resolves these references:

Turn 1: "Welke dokters zijn er bij orthopedie?" (Which doctors are in orthopedics?)
Turn 2: "Wanneer heeft hij consultatie?" (When does he have consultation?)
Rewritten: "Wanneer heeft de orthopedisch arts consultatie bij ZOL?" (When does the orthopedic doctor have consultation at ZOL?)

Enhanced History with Citation Context

The conversation history passed to the LLM rewrite prompt now includes citation filenames from previous turns, formatted as "Onderwerp" (Topic). Citation filenames are cleaned (extension removed, separators replaced with spaces) to give the model natural topic context for resolving ambiguous references:

Vraag 1: Waar kan ik parkeren bij het ziekenhuis?
Antwoord 1: U kunt parkeren op campus Sint-Jan in parkeergarage P1...
Onderwerp 1: parking campus sint jan, bezoekersinfo zol

Follow-Up Detection Heuristics

Follow-up detection operates at two levels:

LLM detection (primary): When the LLM classification succeeds with confidence >= 0.1, its is_followup flag is trusted.
Short-query heuristic (fallback): When the LLM fails (returns unknown with < 0.1 confidence), queries under 6 words with existing conversation history are automatically treated as follow-ups. This catches common conversational patterns like "en de kosten?" or "welke campus?" that are meaningless without context.

Stage 4: Metadata Filtering

Latency: < 5ms | Purpose: Narrow search scope

The classified intent maps to metadata filters that constrain the retrieval search space. For example, a condition_info intent triggers a filter for documents in the "conditions" category, preventing the retrieval of irrelevant administrative content.

Auto-Filter Bypass for Follow-Up Queries

When a query is detected as a follow-up, intent-based category filtering is automatically skipped. This is because follow-up conversations frequently cross category boundaries -- a patient may ask about a doctor (department_lookup) and then ask about preparation for a specific treatment (treatment_info). Applying the second query's category filter would exclude the doctor-related content from the first turn that may still be relevant.

The filtering priority is:

Explicit categories from the request take precedence (always applied)
Follow-up queries skip category filtering entirely
New queries apply intent-to-category mapping when auto_filter is enabled

Stage 5: Sequential Hybrid Retrieval

Latency: ~800ms | Purpose: Gather evidence from multiple sources

Sequential Execution

Vector and taxonomy search execute sequentially due to asyncpg's single-session constraint. This adds ~50ms latency but guarantees both complete reliably.

Three retrieval mechanisms execute sequentially, each contributing complementary results:

Vector search (pgvector): Cosine similarity against 1,536-dimensional text-embedding-3-large embeddings (ADR-0048, @openai2024embeddings; the dense bi-encoder retrieval pattern follows @karpukhin2020dpr), returning the top-k most semantically similar document chunks
BM25 keyword search (PostgreSQL tsvector): Term matching via to_tsquery('simple', ...) with OR-joined query terms (e.g., "afspraak | zol | maken") and ts_rank scoring. Uses OR logic instead of plainto_tsquery's AND logic, which works better for natural language Dutch queries. Catches exact terms (e.g., "Dr. Vanderstraeten", "cardioversie") that vector search may rank lower. See ADR-007
Taxonomy search (PostgreSQL): Entity lookups against taxonomy tables, routed by LLM-extracted entities from the intent classification call (see ADR-0030)

Results are merged using Reciprocal Rank Fusion (RRF) with k=60. For each document at rank r in a result list: score = 1/(k + r + 1). Documents appearing in multiple lists (vector, BM25) have their RRF scores summed. This score-agnostic fusion consistently outperforms weighted linear combination. See ADR-0020. Graph results are prepended to the merged list in priority order (typed node results first, then vector chunks, then semantic graph results).

Stage 6: Metadata Boosting

Latency: < 5ms | Purpose: Re-rank results by relevance signals

Retrieved results are re-scored using nine metadata boost signals applied multiplicatively by _apply_metadata_boosts. These leverage enriched document metadata populated during ingestion (see Document Ingestion Pipeline).

#	Boost Signal	Multiplier	Condition	Source field
1	Category match	×1.20	Intent category matches `document.category`	`document.category`
2	Recency	×1.15	`prefer_recent=True` and `content_freshness == "time_sensitive"` and < 7 days old	`document.content_freshness`, `document.processed_at`
3	Section header match	×1.10	Query term found in `chunk_metadata.section_header`	`chunk_metadata.section_header`
4	Entity type match	×1.10	`doc_metadata.entity_types` overlaps intent categories	`doc_metadata.entity_types`
5	Campus match	×1.10	Query tokens overlap with campus name tokens in `doc_metadata.campus`	`doc_metadata.campus`
6	Conversation context	×1.25	Follow-up only: document was cited in the previous turn	`context_document_ids` (from citation history)
7	Content keyword match	×1.05 – ×1.40	Discriminating query terms (≥4 chars) appear in chunk content; per-term increment: short (4-5 chars) +5%, medium (6-7 chars) +10%, long (8+ chars) +20%; capped at +40%	`chunk.content`
8	Authority dampening	×`authority_score`	`authority_score` < 1.0 only; no boost applied when ≥ 1.0	`document.authority_score`
9	URL tier boost	×1.15 / ×1.05	tier 1 URL paths: ×1.15; tier 2: ×1.05; tier 3 (default): ×1.0	`doc_metadata.url_tier`

Final score: boosted_score = min(similarity × product_of_all_multipliers, 1.0)

The conversation context boost is the strongest individual signal. It is only applied when the query is detected as a follow-up (see Stage 3). Document IDs are extracted from the last turn's citations and passed to _apply_metadata_boosts via context_document_ids. This ensures that when a patient asks a follow-up like "en de kosten?" (and the costs?), the documents about parking that were cited in the previous turn are prioritized over unrelated cost-related content.

Category-Aware Authority Boosting for Navigational Queries

For navigational intents (doctor_lookup, department_or_service_lookup, booking_or_contact, location_or_directions, navigation_or_practical_info), boost signal #1 applies an additional intent-aware authority differential on top of the standard category match. This was introduced to prevent informational brochures from outranking authoritative directory pages when a user is navigating to a department or looking up a doctor.

The differential is computed inside _boost_category before the standard ×1.20 base boost is returned:

Document category	Navigational multiplier	Combined with base (×1.20)
`Department`, `Doctor`, `Contact`, `Location`, `Appointment`	×1.50 (authority up)	×1.80 effective
`Brochure`, `News`, `Miscellaneous`	×0.70 (authority down)	×0.84 effective (or ×0.70 if no base match)
All other categories	×1.00 (no change)	×1.20 if base match, ×1.00 otherwise

Non-navigational intents are unaffected — only the standard ×1.20 category match applies.

Stage 5b: Value Framework Affinity Rerank

Latency: ~2 ms in-process | Code: backend/app/services/value_framework/affinity.py:apply_intent_category_affinity | Telemetry: app.category_mismatch_telemetry (migration 066)

Stage 5b runs after the metadata-boost rescore and before context assembly. Its purpose is to prevent cross-category contamination — the failure mode where chunks from a wrong-category document outrank correct-category chunks because their similarity score was high in isolation. The motivating case is the wheelchair-vs-cardiology regression: a query about wheelchair access surfaced cardiology brochures because the cardiology chunks had high vector similarity to the word "rolstoel" via incidental co-occurrence in shared paragraphs, despite the user's intent being navigation_or_practical_info (in which clinical_info chunks should be down-weighted).

Algorithm

Each retrieved chunk is classified into one of six content categories by classify_chunk_category (cached lazily on the chunk dict):

Category	Examples
`practical`	Visiting hours, parking, route descriptions, building numbers
`appointments`	Booking instructions, contact phone numbers, online forms
`general`	Neutral hospital-level descriptions, mission statements
`legal_admin`	Patient rights, complaints procedure, identity documents
`clinical_info`	Condition descriptions, treatment details, recovery guidance
`regulatory`	Privacy policy, billing, insurance, data subject rights

The intent class (set by Stage 2) keys into DEFAULT_AFFINITY to retrieve a row of category multipliers. For example, intent navigation_or_practical_info carries:

practical:     1.30   (boost)
appointments:  1.05
general:       1.00
legal_admin:   0.85
clinical_info: 0.65   (penalty)
regulatory:    0.55   (penalty)

Each chunk's score is multiplied by the matching coefficient and clamped to [0, 1]:

new_score = max(0.0, min(1.0, original_score * affinity[intent_class][chunk.category]))

Chunks are then re-sorted by the new score. Chunks with no readable score pass through to the bottom of the list unchanged. Unknown intent classes produce a neutral pass-through (multiplier 1.0 for all categories) — the rerank never makes a result list worse than its pre-rerank state.

Telemetry

After the rerank, the pipeline writes one row to app.category_mismatch_telemetry per non-cached query (backend/alembic/versions/066_category_mismatch_telemetry.py):

Column	Meaning
`intent_class`	The classified intent string
`primary_category`	The dominant category among the top-K post-rerank chunks
`mismatch_rate`	Fraction of top-K chunks whose category is OFF the intent's preferred set
`chunks_total` / `chunks_off_category`	Numerator/denominator of the rate
`query_preview`	First 200 characters of the query (for operator review)

The CategoryMismatchTrend chart on the Operations / costs tab plots mismatch_rate over time per tenant; spikes indicate retrieval steering needs retuning.

Stage 5c: Synthetic Department-Doctor-List Injection

Latency: ~5 ms (1 indexed taxonomy query when triggered) | Code: backend/app/services/rag_service.py:_qs_maybe_inject_doctor_list

Stage 5c fires only when all three of the following hold:

The classified intent is DOCTOR_LOOKUP or DEPARTMENT_OR_SERVICE_LOOKUP
The user query contains a list-signal phrase matched by _LIST_SIGNAL_RE (e.g., alle, welke artsen, wie werkt er, list all, tous les médecins)
A department or specialty hint can be resolved from either the classifier-extracted entities (classification.entities.department / .service / .doctor) or from a regex over the rewritten query

When these hold, the stage queries the taxonomy for all doctors associated with the resolved department, builds a synthetic chunk listing them, and inserts it into the retrieved-chunks set before context assembly. This guarantees the LLM has the full roster available so the system prompt's "list all members" exception rule can fire faithfully.

The stage was introduced as a regression fix for the 2026-05-09 incident in which a 6-turn voice conversation about dermatologists capped at the same 2 names — the vector retrieval surfaced individual doctor brochure pages without the shared department roster. See the source comment block at _qs_maybe_inject_doctor_list for the original conversation reference.

When any of the three conditions are not met (e.g., the query is "Wie is Dr. X?", a single-doctor lookup), Stage 5c is a no-op and adds zero latency.

Stage 6b: Context Assembly and Citation Building

Latency: ~50 ms | Purpose: Expand retrieved snippets into coherent text blocks and build aligned citations

The key insight from serious RAG architecture: "What you retrieve is not what the model reads." Retrieved chunks are 350-token fragments. The context assembly service transforms them into coherent reading material for the LLM. The 8,000-token budget and chunk-ordering choices are informed by Liu et al. 2024 — Lost in the Middle, which empirically demonstrated that LLMs under-attend to mid-context tokens; we keep the most-relevant document at the top, drop low-relevance trailing blocks first, and avoid placing critical evidence in the middle of a long context. See Context Assembly for full details.

Expand: For each retrieved chunk, fetch ±1 adjacent chunks from the same document (single batched DB query per document)
Deduplicate: Strip the ~70-token overlap between consecutive chunks to avoid redundant text
Group: Merge chunks from the same document into coherent blocks, ordered by chunk_index
Budget: Cap total context at 8,000 tokens, dropping lowest-relevance document blocks first (@liu2024lostinmiddle)
Build citations: Generate Citation objects from the post-assembly chunk order

Citation Ordering Fix

Citations are built after context assembly, not before. This is critical because context assembly reorders and groups chunks by document. If citations were built from the pre-assembly chunk order, the [1], [2] numbering in the LLM's response would not match the citation list shown to the user. The pipeline enforces the invariant: assemble_context() first, then _build_citations_from_chunks().

Stage 7: Response Generation

Latency: ~3,000ms | Model: Tier 2 via direct OpenAI (both standard and full mode) | Purpose: Synthesize a grounded response

The response generator receives the ranked retrieval results as context and generates a natural language response. The generation prompt enforces strict grounding: every claim must be traceable to a retrieved source. Responses are streamed via WebSocket, providing progressive rendering in the chat interface. Full mode uses the same Tier 2 model as standard mode, but via direct OpenAI API for lowest latency, with always-on reranking and higher max_tokens. See LLM Stack for details.

Stage 8: Hybrid Quality Evaluation

Latency: ~600ms blocking + async background | Purpose: Validate response quality

Quality evaluation operates in two phases:

Fast Quality Gate (~600 ms, blocking): Computes a weighted-average embedding cosine similarity: 0.7 * answer_context_similarity + 0.3 * question_answer_similarity. This blended semantic_similarity score must exceed 50% for the response to pass (QUALITY_THRESHOLD = 0.50 in backend/app/services/evaluation_service.py; raised from 0.40 to 0.50 on 2026-05-10 per Wave 2.C.1 empirical recalibration of the threshold).
Background Analytics (40-60s, non-blocking): Fires asynchronously after the response is delivered. Uses DeepEval metrics (Faithfulness, Answer Relevancy) for comprehensive quality monitoring. Results feed into Prometheus metrics for long-term tracking.

Timing Breakdown

The Gantt below shows a representative query timing — it is not measured p50 / p95. Stage durations are estimates derived from the dominant cost on each stage; verifiable per-stage measurements live in app.pipeline_telemetry.duration_ms and are surfaced via GET /api/v1/admin/feedback/telemetry-stats (see Feedback Dashboard Metrics). Wave 2.D leaves the per-stage p50 / p95 numbers as not yet measured rather than fabricating values; the Gantt is a working estimate, not a contract.

Performance Insight

The total blocking latency is approximately 5.5 seconds at this representative breakdown. Intent classification and query rewriting are combined into a single LLM call (~400 ms). Vector, BM25, and taxonomy search execute sequentially (due to asyncpg's single-session constraint), adding ~800 ms total retrieval time. The Value Framework affinity rerank (Stage 5b) and the conditional doctor-list injection (Stage 5c) add ~2 ms and ≤5 ms respectively. Context assembly adds ~65 ms for chunk expansion and deduplication. The most expensive component remains response generation (~3 s), which uses streaming to mitigate perceived latency. Background analytics (40–60 s) run entirely asynchronously and never impact user experience. Real measurements at p50 / p95 are pending — see the System Overview's End-to-End Latency Budgets table for the verification path.

Follow-Up Suggestion Generation

Latency: ~200ms (async, non-blocking) | Model: Tier 1 (fast) | Purpose: Guide the user toward relevant next questions

After the response is streamed, the pipeline generates 3 contextual follow-up questions in Dutch that a patient or visitor would logically ask next. These appear as clickable suggestion pills below the response, reducing cognitive load and encouraging deeper exploration of the hospital's services.

Why a Separate Lightweight Model?

Follow-up suggestions are a UX convenience feature, not a safety-critical pipeline stage. Using a dedicated lightweight Tier 1 (fast) model instead of the main response model has three benefits:

Criterion	Main Model (Tier 2 / Tier 3)	Follow-Up Model (Tier 1)
Cost	$0.40-2.00 / M input tokens	$0.10 / M input tokens
Latency	1-5 seconds	~200ms
Output format	Free-form text + reasoning	Strict JSON array (3 strings)
Failure impact	Critical (no response)	Graceful (empty suggestions)

The follow-up model is configured separately via rag_followup_model in the admin settings, defaulting to the Tier 1 (fast) model. This prevents reasoning models — which produce internal thinking tokens that break JSON parsing — from being used for structured output tasks.

Fault Tolerance

Follow-up generation is wrapped in a try/except — if the model returns unparseable JSON or the call fails entirely, the pipeline returns an empty suggestion list. The response is never delayed or degraded by suggestion failures.

Error Handling

Each pipeline stage includes fail-safe behavior:

Cache failure: Proceed without cache (log warning)
Intent classification failure: Default to HYBRID retrieval strategy
Retrieval failure: Return partial results from whichever source succeeded
Generation failure: Return a polite error message with general navigation links
Evaluation failure: Deliver the response but flag for manual review

Real-Time Pipeline Visibility

Every pipeline stage emits structured progress data via WebSocket, enabling the frontend to display real-time technical information during query execution. This is critical for academic evaluation, where every architectural decision must be demonstrable.

Progress Emissions

Each stage emits a StreamChunk with type="progress" containing a PipelineProgress object:

Stage	Emitted Data
UNDERSTANDING	Intent type, confidence score, retrieval strategy, query rewriting details
SEARCHING	Document count, search method (hybrid/vector-only), BM25 match count, canonical question matches
ANALYZING	Chunks retrieved vs. chunks after context expansion
GENERATING	Progress percentage
EVALUATING	Faithfulness %, relevancy %, overall score with A-F grade, quality gate pass/fail

Frontend Debug Panel

The admin interface renders these emissions as color-coded badges in the chat interface. For full documentation of the debug panel capabilities, including document detail view metadata and analytics reset, see the dedicated Frontend Debug Panel page.

Example: What an Examiner Sees

During a live query, the debug panel progressively shows:

Understanding: Medische vraag 92% HYBRID hybrid_bm25_vector
Searching: 8 documents BM25: 3 CQ: 5
Analyzing: 8 → 12 chunks (context expansion)
Generating: Progress bar filling
Evaluating: Score: 78% (B) Faithfulness: 82% Relevancy: 74%

Pipeline-Level Trade-offs​

Pipeline Overview​

Pipeline Stages in Detail​

Stage 1: Semantic Cache Check (PostgreSQL + pgvector)​

Tier 2 Safety Guards​

Stage 2: Intent Classification​

Intent Routing Strategy​

Query Reformulation: Canonical Templates​

Stage 3: Query Rewriting and Follow-Up Detection​

Enhanced History with Citation Context​

Follow-Up Detection Heuristics​

Stage 4: Metadata Filtering​

Auto-Filter Bypass for Follow-Up Queries​

Stage 5: Sequential Hybrid Retrieval​

Stage 6: Metadata Boosting​

Category-Aware Authority Boosting for Navigational Queries​

Stage 5b: Value Framework Affinity Rerank​

Algorithm​

Telemetry​

Stage 5c: Synthetic Department-Doctor-List Injection​

Stage 6b: Context Assembly and Citation Building​

Stage 7: Response Generation​

Stage 8: Hybrid Quality Evaluation​

Timing Breakdown​

Follow-Up Suggestion Generation​

Why a Separate Lightweight Model?​

Fault Tolerance​

Error Handling​

Real-Time Pipeline Visibility​

Progress Emissions​

Frontend Debug Panel​

Example: What an Examiner Sees​