Query Processing Pipeline
The query processing pipeline is the central nervous system of the ZOL Intelligent Search. Following the retrieve-then-generate paradigm of Lewis et al. 2020, every user query traverses a carefully orchestrated sequence of stages, each designed to progressively refine understanding, retrieve relevant information, and generate a grounded, safe response.
This page is the canonical end-to-end pipeline map (cache check → intent → retrieval → assembly → generation → quality gate). For a retrieval-focused companion view that numbers only the retrieval stages and goes deeper on the three channels and fusion, see Context Retrieval Architecture, which carries an explicit stage-number mapping back to this page.
Pipeline-Level Trade-offs
| Decision | Chosen | Alternatives considered | Rejected because |
|---|---|---|---|
| Retrieval shape | Hybrid (dense pgvector + sparse BM25 + taxonomy) | Vector-only (@karpukhin2020dpr); BM25-only; ColBERT late-interaction reranker only @khattab2020colbert | Vector-only loses recall on Dutch medical terms ("cardioversie", "Dr. Vanderstraeten") that BM25 catches by exact match; BM25-only loses recall on cross-language and synonym queries that vectors capture; ColBERT-only requires re-encoding the entire corpus at every model change, which is operationally untenable. ColBERT survives as a feature-flagged reranker (ADR-0039), not the primary path. |
| Result fusion | Reciprocal Rank Fusion (RRF) with k=60 | Weighted linear combination of similarity scores; learned-to-rank | Score scales differ across vector cosine, BM25 ts_rank, and graph priority; weighted linear fusion required hand-tuned coefficients that drifted with each embedding swap. RRF is score-agnostic (sums 1/(k + r + 1) over rank-only) and stable across model migrations. See ADR-0020. |
| Intent classification | LLM (Tier 2 via the structured_call helper) | Regex / keyword classifier; small fine-tuned BERT | Regex misses novel phrasings, especially in Turkish, Arabic, or Romanian patient input; a fine-tuned BERT classifier requires labelled corpora we don't have at pilot scale. The structured_call helper provides JSON-schema-validated output with retries and a typed StructuredCallError fallback, replacing the previous json.loads-and-pray pattern. |
| Cross-category contamination prevention | Value Framework affinity rerank (Stage 5b) | Per-intent hard category filter; cross-encoder rerank only | A hard filter excludes correct chunks when the user's intent class is misclassified at the boundary (e.g., condition_information vs treatment_or_exam_information). A cross-encoder rerank is cheap to add but does not address the wheelchair-vs-cardiology category leak observed in the wheelchair regression — the cross-encoder ranks within retrieved chunks but cannot rebalance categories. The affinity rerank multiplies score by intent × category coefficients, providing a graded penalty that recovers when intent confidence is low. |
Pipeline Overview
Pipeline Stages in Detail
Stage 1: Semantic Cache Check (PostgreSQL + pgvector)
Latency: 1–30ms | Purpose: Avoid redundant LLM calls | ADR: ADR-0031
Before invoking any AI model, the pipeline checks the two-tier semantic query cache stored in PostgreSQL (not Redis). The cache operates on LLM-reformulated queries — the intent classifier normalizes language, spelling, and word order before caching, which maximizes hit rates across languages.
| Tier | Mechanism | Latency | What It Catches |
|---|---|---|---|
| Tier 1 | SHA-256 hash of reformulated query text | ~1ms | Identical reformulations — including cross-language queries that normalize to the same Dutch form |
| Tier 2 | Cosine similarity via HNSW index on 1,536-dim text-embedding-3-large embeddings (ADR-0048, @openai2024embeddings; @pgvector_docs) | ~30ms | Semantically equivalent queries with different wording (e.g., "Welke artsen werken op orthopedie?" vs "Welke dokters zijn er op de afdeling orthopedie?") |
On a cache hit, the stored response text and citations are returned immediately, skipping all downstream stages (intent classification, retrieval, LLM generation, quality evaluation). Entries expire after 1 hour (configurable via admin Settings UI). See Storage Architecture for schema details.
Tier 2 Safety Guards
Tier 2 employs two protective mechanisms against cross-topic cache contamination — a phenomenon where structurally similar queries with different medical entities return incorrect cached responses:
1. Intent Filter: Tier 2 only considers cache entries with the same intent classification as the incoming query. This prevents, for example, a doctor lookup response from being returned for a condition query. The SQL filter AND (:intent IS NULL OR intent IS NULL OR intent = :intent) is applied before the cosine similarity ranking.
2. Jaccard Keyword Guard: Even within the same intent, embedding similarity alone is insufficient. Queries like "Welke cardiologen werken bij ZOL?" and "Wie zijn de orthopedisten bij ZOL?" produce cosine similarity >0.97 because their sentence structure is identical — only the entity name differs. The Jaccard Keyword Guard prevents this by computing the Jaccard similarity coefficient (|A ∩ B| / |A ∪ B|) over the content words of both queries. Content words are extracted by filtering tokens ≥4 characters and removing Dutch stop words and generic medical terms ("artsen", "werken", "afdeling", etc.). If the Jaccard similarity falls below 0.5, the cache hit is rejected.
| New Query | Cached Query | Content Words | Jaccard | Verdict |
|---|---|---|---|---|
| "Welke orthopedisten bij ZOL?" | "Welke cardiologen bij ZOL?" | {orthopedie} vs {cardiologie} | 0.00 | Rejected |
| "Hoeveel kost parkeren?" | "Wat kost het parkeren bij ZOL?" | {parkeren, kost} vs {parkeren, kost} | 1.00 | Accepted |
| "borst onderzoek ZOL" | "borstonderzoek bij ZOL" | {borstonderzoek} vs {borstonderzoek} | 1.00 | Accepted |
This two-layer defense (intent + Jaccard) ensures that Tier 2 cache hits are both semantically close and topically aligned, eliminating cross-topic contamination without sacrificing cache effectiveness for legitimate reformulations.
Stage 2: Intent Classification
Latency: ~400 ms (combined classify-and-rewrite call when conversation history exists; not yet measured at p95 in pilot) | Model: Tier 2 (standard) via the structured_call helper | Purpose: Route, guard, and detect follow-ups
Intent classification is the most consequential stage in the pipeline. It determines both the retrieval strategy (which data sources to query) and whether the query is safe to process at all. The classifier also detects whether the query is a follow-up to a previous conversation turn, which affects downstream filtering and boosting behavior. The classifier maps each query to one of twelve intent types.
Implementation note — structured_call helper: backend/app/services/intent_classification_service.py invokes the LLM through structured_call(output_model=IntentClassificationOutput) (from app.llm.structured). The helper enforces JSON-schema validation on the response (intent enum, entities, confidence, is_followup), retries on validation failure, and raises a typed StructuredCallError only when the model still cannot satisfy the schema after retries. This ~190-LOC helper replaced Pydantic AI on 2026-05-12 (commit b8d8da67) after production telemetry showed Agent.run() added ~720 ms per call without buying any capability structured_call lacks; it preserves the validation contract at zero latency tax. See the Decision-Cost Rubric case study for the post-mortem.
Intent Routing Strategy
| Intent | Retrieval Strategy | Rationale |
|---|---|---|
doctor_lookup | HYBRID | Doctor entities exist in the graph; vector search provides fallback when specific entities are missing |
department_or_service_lookup | HYBRID | Department and service data is in the graph, supplemented by vector search for descriptions |
condition_information | HYBRID | Conditions map to departments via HANDLES relationships in the graph; vector search retrieves brochure content |
treatment_or_exam_information | HYBRID | Treatments and examinations have OFFERS/PERFORMS graph relationships; vector search retrieves procedural details |
navigation_or_practical_info | HYBRID | Practical information is primarily textual but benefits from campus/facility graph context |
booking_or_contact | HYBRID | May need both contact entities and procedural text |
ambiguous_symptom_description | HYBRID | Symptoms map to conditions (vector) and departments (graph) via taxonomy alias resolution |
unknown | HYBRID | Fallback uses both sources for maximum coverage |
out_of_scope_medical_advice | BLOCKED | Never processed -- immediate safety response |
off_topic | BLOCKED | Queries unrelated to ZOL hospital -- immediate redirection |
other_hospital | BLOCKED | Queries about other hospitals -- polite refusal with ZOL contact info |
vague_input | BLOCKED | Insufficient input to process -- request for clarification |
All non-blocked intents default to the HYBRID retrieval strategy, which runs vector search and taxonomy search sequentially. This design decision reflects the empirical observation that taxonomy context improves response quality even for intents that were initially classified as vector-only (e.g., condition information benefits significantly from HANDLES relationship traversal). The retrieval strategy enum retains VECTOR_ONLY and GRAPH_ONLY values for future use, but the current production configuration maps all safe intents to HYBRID.
Query Reformulation: Canonical Templates
Every query — regardless of input language — is reformulated into a well-formed Dutch search query during intent classification. This reformulation serves three purposes:
- Cross-language normalization: Queries in English, Turkish, Romanian, etc. are translated to Dutch clinical language
- Cache effectiveness: Consistent reformulations maximize semantic cache hit rates (see ADR-0031)
- Retrieval quality: Well-formed Dutch queries match the indexed Dutch medical content more precisely
To ensure reformulations are deterministic and consistent, the intent classification prompt includes canonical templates — fixed sentence patterns per intent type:
| Intent | Canonical Template |
|---|---|
doctor_lookup + department | "Welke artsen werken bij de afdeling {Department} van ZOL?" |
doctor_lookup + name | "Wie is Dr. {Name} en op welke afdeling werkt Dr. {Name} bij ZOL?" |
department_lookup | "Wat doet de afdeling {Department} bij ZOL en welke zorg biedt deze aan?" |
condition_info | "Wat is {Condition} en welke behandelingen biedt ZOL aan?" |
treatment_info | "Hoe verloopt {Treatment/Examination} bij ZOL?" |
booking_contact | "Hoe maak ik een afspraak bij de afdeling {Department} van ZOL?" |
symptom_description | "Welke afdelingen bij ZOL behandelen {symptom}?" |
These templates enforce two critical rules:
- Always "ZOL" (never "Ziekenhuis Oost-Limburg") — consistent naming
- Always "afdeling X" (never "X-ische artsen") — consistent structure
Without templates, the LLM freely varies vocabulary and structure. For example, "who are the orthopedic doctors?" was reformulated as "Wie zijn de orthopedische artsen bij Ziekenhuis Oost-Limburg?" instead of "Welke artsen werken bij de afdeling Orthopedie van ZOL?" — producing only 0.746 cosine similarity to the Dutch equivalent, defeating the semantic cache and degrading retrieval quality (4 doctors returned vs. 12). Canonical templates eliminate this variance.
Stage 3: Query Rewriting and Follow-Up Detection
Latency: ~0ms extra (combined with intent classification) | Model: Tier 2 (standard) | Purpose: Conversational context resolution and follow-up detection
When conversation history exists, intent classification and query rewriting are performed in a single combined LLM call via classify_and_rewrite(), eliminating the need for a separate rewriting step. Without history, only intent classification runs (the reformulation from canonical templates still applies). Follow-up queries often contain anaphoric references that are meaningless in isolation. Query rewriting resolves these references:
- Turn 1: "Welke dokters zijn er bij orthopedie?" (Which doctors are in orthopedics?)
- Turn 2: "Wanneer heeft hij consultatie?" (When does he have consultation?)
- Rewritten: "Wanneer heeft de orthopedisch arts consultatie bij ZOL?" (When does the orthopedic doctor have consultation at ZOL?)
Enhanced History with Citation Context
The conversation history passed to the LLM rewrite prompt now includes citation filenames from previous turns, formatted as "Onderwerp" (Topic). Citation filenames are cleaned (extension removed, separators replaced with spaces) to give the model natural topic context for resolving ambiguous references:
Vraag 1: Waar kan ik parkeren bij het ziekenhuis?
Antwoord 1: U kunt parkeren op campus Sint-Jan in parkeergarage P1...
Onderwerp 1: parking campus sint jan, bezoekersinfo zol
Follow-Up Detection Heuristics
Follow-up detection operates at two levels:
- LLM detection (primary): When the LLM classification succeeds with confidence >= 0.1, its
is_followupflag is trusted. - Short-query heuristic (fallback): When the LLM fails (returns
unknownwith < 0.1 confidence), queries under 6 words with existing conversation history are automatically treated as follow-ups. This catches common conversational patterns like "en de kosten?" or "welke campus?" that are meaningless without context.
Stage 4: Metadata Filtering
Latency: < 5ms | Purpose: Narrow search scope
The classified intent maps to metadata filters that constrain the retrieval search space. For example, a condition_info intent triggers a filter for documents in the "conditions" category, preventing the retrieval of irrelevant administrative content.
Auto-Filter Bypass for Follow-Up Queries
When a query is detected as a follow-up, intent-based category filtering is automatically skipped. This is because follow-up conversations frequently cross category boundaries -- a patient may ask about a doctor (department_lookup) and then ask about preparation for a specific treatment (treatment_info). Applying the second query's category filter would exclude the doctor-related content from the first turn that may still be relevant.
The filtering priority is:
- Explicit categories from the request take precedence (always applied)
- Follow-up queries skip category filtering entirely
- New queries apply intent-to-category mapping when
auto_filteris enabled
Stage 5: Sequential Hybrid Retrieval
Latency: ~800ms | Purpose: Gather evidence from multiple sources
Vector and taxonomy search execute sequentially due to asyncpg's single-session constraint. This adds ~50ms latency but guarantees both complete reliably.
Three retrieval mechanisms execute sequentially, each contributing complementary results:
- Vector search (pgvector): Cosine similarity against 1,536-dimensional
text-embedding-3-largeembeddings (ADR-0048, @openai2024embeddings; the dense bi-encoder retrieval pattern follows @karpukhin2020dpr), returning the top-k most semantically similar document chunks - BM25 keyword search (PostgreSQL tsvector): Term matching via
to_tsquery('simple', ...)with OR-joined query terms (e.g.,"afspraak | zol | maken") andts_rankscoring. Uses OR logic instead ofplainto_tsquery's AND logic, which works better for natural language Dutch queries. Catches exact terms (e.g., "Dr. Vanderstraeten", "cardioversie") that vector search may rank lower. See ADR-007 - Taxonomy search (PostgreSQL): Entity lookups against taxonomy tables, routed by LLM-extracted entities from the intent classification call (see ADR-0030)
Results are merged using Reciprocal Rank Fusion (RRF) with k=60. For each document at rank r in a result list: score = 1/(k + r + 1). Documents appearing in multiple lists (vector, BM25) have their RRF scores summed. This score-agnostic fusion consistently outperforms weighted linear combination. See ADR-0020. Graph results are prepended to the merged list in priority order (typed node results first, then vector chunks, then semantic graph results).
Stage 6: Metadata Boosting
Latency: < 5ms | Purpose: Re-rank results by relevance signals
Retrieved results are re-scored using nine metadata boost signals applied multiplicatively by _apply_metadata_boosts. These leverage enriched document metadata populated during ingestion (see Document Ingestion Pipeline).
| # | Boost Signal | Multiplier | Condition | Source field |
|---|---|---|---|---|
| 1 | Category match | ×1.20 | Intent category matches document.category | document.category |
| 2 | Recency | ×1.15 | prefer_recent=True and content_freshness == "time_sensitive" and < 7 days old | document.content_freshness, document.processed_at |
| 3 | Section header match | ×1.10 | Query term found in chunk_metadata.section_header | chunk_metadata.section_header |
| 4 | Entity type match | ×1.10 | doc_metadata.entity_types overlaps intent categories | doc_metadata.entity_types |
| 5 | Campus match | ×1.10 | Query tokens overlap with campus name tokens in doc_metadata.campus | doc_metadata.campus |
| 6 | Conversation context | ×1.25 | Follow-up only: document was cited in the previous turn | context_document_ids (from citation history) |
| 7 | Content keyword match | ×1.05 – ×1.40 | Discriminating query terms (≥4 chars) appear in chunk content; per-term increment: short (4-5 chars) +5%, medium (6-7 chars) +10%, long (8+ chars) +20%; capped at +40% | chunk.content |
| 8 | Authority dampening | ×authority_score | authority_score < 1.0 only; no boost applied when ≥ 1.0 | document.authority_score |
| 9 | URL tier boost | ×1.15 / ×1.05 | tier 1 URL paths: ×1.15; tier 2: ×1.05; tier 3 (default): ×1.0 | doc_metadata.url_tier |
Final score: boosted_score = min(similarity × product_of_all_multipliers, 1.0)
The conversation context boost is the strongest individual signal. It is only applied when the query is detected as a follow-up (see Stage 3). Document IDs are extracted from the last turn's citations and passed to _apply_metadata_boosts via context_document_ids. This ensures that when a patient asks a follow-up like "en de kosten?" (and the costs?), the documents about parking that were cited in the previous turn are prioritized over unrelated cost-related content.
Category-Aware Authority Boosting for Navigational Queries
For navigational intents (doctor_lookup, department_or_service_lookup, booking_or_contact, location_or_directions, navigation_or_practical_info), boost signal #1 applies an additional intent-aware authority differential on top of the standard category match. This was introduced to prevent informational brochures from outranking authoritative directory pages when a user is navigating to a department or looking up a doctor.
The differential is computed inside _boost_category before the standard ×1.20 base boost is returned:
| Document category | Navigational multiplier | Combined with base (×1.20) |
|---|---|---|
Department, Doctor, Contact, Location, Appointment | ×1.50 (authority up) | ×1.80 effective |
Brochure, News, Miscellaneous | ×0.70 (authority down) | ×0.84 effective (or ×0.70 if no base match) |
| All other categories | ×1.00 (no change) | ×1.20 if base match, ×1.00 otherwise |
Non-navigational intents are unaffected — only the standard ×1.20 category match applies.
Stage 5b: Value Framework Affinity Rerank
Latency: ~2 ms in-process | Code: backend/app/services/value_framework/affinity.py:apply_intent_category_affinity | Telemetry: app.category_mismatch_telemetry (migration 066)
Stage 5b runs after the metadata-boost rescore and before context assembly. Its purpose is to prevent cross-category contamination — the failure mode where chunks from a wrong-category document outrank correct-category chunks because their similarity score was high in isolation. The motivating case is the wheelchair-vs-cardiology regression: a query about wheelchair access surfaced cardiology brochures because the cardiology chunks had high vector similarity to the word "rolstoel" via incidental co-occurrence in shared paragraphs, despite the user's intent being navigation_or_practical_info (in which clinical_info chunks should be down-weighted).
Algorithm
Each retrieved chunk is classified into one of six content categories by classify_chunk_category (cached lazily on the chunk dict):
| Category | Examples |
|---|---|
practical | Visiting hours, parking, route descriptions, building numbers |
appointments | Booking instructions, contact phone numbers, online forms |
general | Neutral hospital-level descriptions, mission statements |
legal_admin | Patient rights, complaints procedure, identity documents |
clinical_info | Condition descriptions, treatment details, recovery guidance |
regulatory | Privacy policy, billing, insurance, data subject rights |
The intent class (set by Stage 2) keys into DEFAULT_AFFINITY to retrieve a row of category multipliers. For example, intent navigation_or_practical_info carries:
practical: 1.30 (boost)
appointments: 1.05
general: 1.00
legal_admin: 0.85
clinical_info: 0.65 (penalty)
regulatory: 0.55 (penalty)
Each chunk's score is multiplied by the matching coefficient and clamped to [0, 1]:
new_score = max(0.0, min(1.0, original_score * affinity[intent_class][chunk.category]))
Chunks are then re-sorted by the new score. Chunks with no readable score pass through to the bottom of the list unchanged. Unknown intent classes produce a neutral pass-through (multiplier 1.0 for all categories) — the rerank never makes a result list worse than its pre-rerank state.
Telemetry
After the rerank, the pipeline writes one row to app.category_mismatch_telemetry per non-cached query (backend/alembic/versions/066_category_mismatch_telemetry.py):
| Column | Meaning |
|---|---|
intent_class | The classified intent string |
primary_category | The dominant category among the top-K post-rerank chunks |
mismatch_rate | Fraction of top-K chunks whose category is OFF the intent's preferred set |
chunks_total / chunks_off_category | Numerator/denominator of the rate |
query_preview | First 200 characters of the query (for operator review) |
The CategoryMismatchTrend chart on the Operations / costs tab plots mismatch_rate over time per tenant; spikes indicate retrieval steering needs retuning.
Stage 5c: Synthetic Department-Doctor-List Injection
Latency: ~5 ms (1 indexed taxonomy query when triggered) | Code: backend/app/services/rag_service.py:_qs_maybe_inject_doctor_list
Stage 5c fires only when all three of the following hold:
- The classified intent is
DOCTOR_LOOKUPorDEPARTMENT_OR_SERVICE_LOOKUP - The user query contains a list-signal phrase matched by
_LIST_SIGNAL_RE(e.g.,alle,welke artsen,wie werkt er,list all,tous les médecins) - A department or specialty hint can be resolved from either the classifier-extracted entities (
classification.entities.department / .service / .doctor) or from a regex over the rewritten query
When these hold, the stage queries the taxonomy for all doctors associated with the resolved department, builds a synthetic chunk listing them, and inserts it into the retrieved-chunks set before context assembly. This guarantees the LLM has the full roster available so the system prompt's "list all members" exception rule can fire faithfully.
The stage was introduced as a regression fix for the 2026-05-09 incident in which a 6-turn voice conversation about dermatologists capped at the same 2 names — the vector retrieval surfaced individual doctor brochure pages without the shared department roster. See the source comment block at _qs_maybe_inject_doctor_list for the original conversation reference.
When any of the three conditions are not met (e.g., the query is "Wie is Dr. X?", a single-doctor lookup), Stage 5c is a no-op and adds zero latency.
Stage 6b: Context Assembly and Citation Building
Latency: ~50 ms | Purpose: Expand retrieved snippets into coherent text blocks and build aligned citations
The key insight from serious RAG architecture: "What you retrieve is not what the model reads." Retrieved chunks are 350-token fragments. The context assembly service transforms them into coherent reading material for the LLM. The 8,000-token budget and chunk-ordering choices are informed by Liu et al. 2024 — Lost in the Middle, which empirically demonstrated that LLMs under-attend to mid-context tokens; we keep the most-relevant document at the top, drop low-relevance trailing blocks first, and avoid placing critical evidence in the middle of a long context. See Context Assembly for full details.
- Expand: For each retrieved chunk, fetch ±1 adjacent chunks from the same document (single batched DB query per document)
- Deduplicate: Strip the ~70-token overlap between consecutive chunks to avoid redundant text
- Group: Merge chunks from the same document into coherent blocks, ordered by chunk_index
- Budget: Cap total context at 8,000 tokens, dropping lowest-relevance document blocks first (@liu2024lostinmiddle)
- Build citations: Generate
Citationobjects from the post-assembly chunk order
Citations are built after context assembly, not before. This is critical because context assembly reorders and groups chunks by document. If citations were built from the pre-assembly chunk order, the [1], [2] numbering in the LLM's response would not match the citation list shown to the user. The pipeline enforces the invariant: assemble_context() first, then _build_citations_from_chunks().
Stage 7: Response Generation
Latency: ~3,000ms | Model: Tier 2 via direct OpenAI (both standard and full mode) | Purpose: Synthesize a grounded response
The response generator receives the ranked retrieval results as context and generates a natural language response. The generation prompt enforces strict grounding: every claim must be traceable to a retrieved source. Responses are streamed via WebSocket, providing progressive rendering in the chat interface. Full mode uses the same Tier 2 model as standard mode, but via direct OpenAI API for lowest latency, with always-on reranking and higher max_tokens. See LLM Stack for details.
Stage 8: Hybrid Quality Evaluation
Latency: ~600ms blocking + async background | Purpose: Validate response quality
Quality evaluation operates in two phases:
-
Fast Quality Gate (~600 ms, blocking): Computes a weighted-average embedding cosine similarity:
0.7 * answer_context_similarity + 0.3 * question_answer_similarity. This blendedsemantic_similarityscore must exceed 50% for the response to pass (QUALITY_THRESHOLD = 0.50inbackend/app/services/evaluation_service.py; raised from 0.40 to 0.50 on 2026-05-10 per Wave 2.C.1 empirical recalibration of the threshold). -
Background Analytics (40-60s, non-blocking): Fires asynchronously after the response is delivered. Uses DeepEval metrics (Faithfulness, Answer Relevancy) for comprehensive quality monitoring. Results feed into Prometheus metrics for long-term tracking.
Timing Breakdown
The Gantt below shows a representative query timing — it is not measured p50 / p95. Stage durations are estimates derived from the dominant cost on each stage; verifiable per-stage measurements live in app.pipeline_telemetry.duration_ms and are surfaced via GET /api/v1/admin/feedback/telemetry-stats (see Feedback Dashboard Metrics). Wave 2.D leaves the per-stage p50 / p95 numbers as not yet measured rather than fabricating values; the Gantt is a working estimate, not a contract.
The total blocking latency is approximately 5.5 seconds at this representative breakdown. Intent classification and query rewriting are combined into a single LLM call (~400 ms). Vector, BM25, and taxonomy search execute sequentially (due to asyncpg's single-session constraint), adding ~800 ms total retrieval time. The Value Framework affinity rerank (Stage 5b) and the conditional doctor-list injection (Stage 5c) add ~2 ms and ≤5 ms respectively. Context assembly adds ~65 ms for chunk expansion and deduplication. The most expensive component remains response generation (~3 s), which uses streaming to mitigate perceived latency. Background analytics (40–60 s) run entirely asynchronously and never impact user experience. Real measurements at p50 / p95 are pending — see the System Overview's End-to-End Latency Budgets table for the verification path.
Follow-Up Suggestion Generation
Latency: ~200ms (async, non-blocking) | Model: Tier 1 (fast) | Purpose: Guide the user toward relevant next questions
After the response is streamed, the pipeline generates 3 contextual follow-up questions in Dutch that a patient or visitor would logically ask next. These appear as clickable suggestion pills below the response, reducing cognitive load and encouraging deeper exploration of the hospital's services.
Why a Separate Lightweight Model?
Follow-up suggestions are a UX convenience feature, not a safety-critical pipeline stage. Using a dedicated lightweight Tier 1 (fast) model instead of the main response model has three benefits:
| Criterion | Main Model (Tier 2 / Tier 3) | Follow-Up Model (Tier 1) |
|---|---|---|
| Cost | $0.40-2.00 / M input tokens | $0.10 / M input tokens |
| Latency | 1-5 seconds | ~200ms |
| Output format | Free-form text + reasoning | Strict JSON array (3 strings) |
| Failure impact | Critical (no response) | Graceful (empty suggestions) |
The follow-up model is configured separately via rag_followup_model in the admin settings, defaulting to the Tier 1 (fast) model. This prevents reasoning models — which produce internal thinking tokens that break JSON parsing — from being used for structured output tasks.
Fault Tolerance
Follow-up generation is wrapped in a try/except — if the model returns unparseable JSON or the call fails entirely, the pipeline returns an empty suggestion list. The response is never delayed or degraded by suggestion failures.
Error Handling
Each pipeline stage includes fail-safe behavior:
- Cache failure: Proceed without cache (log warning)
- Intent classification failure: Default to HYBRID retrieval strategy
- Retrieval failure: Return partial results from whichever source succeeded
- Generation failure: Return a polite error message with general navigation links
- Evaluation failure: Deliver the response but flag for manual review
Real-Time Pipeline Visibility
Every pipeline stage emits structured progress data via WebSocket, enabling the frontend to display real-time technical information during query execution. This is critical for academic evaluation, where every architectural decision must be demonstrable.
Progress Emissions
Each stage emits a StreamChunk with type="progress" containing a PipelineProgress object:
| Stage | Emitted Data |
|---|---|
| UNDERSTANDING | Intent type, confidence score, retrieval strategy, query rewriting details |
| SEARCHING | Document count, search method (hybrid/vector-only), BM25 match count, canonical question matches |
| ANALYZING | Chunks retrieved vs. chunks after context expansion |
| GENERATING | Progress percentage |
| EVALUATING | Faithfulness %, relevancy %, overall score with A-F grade, quality gate pass/fail |
Frontend Debug Panel
The admin interface renders these emissions as color-coded badges in the chat interface. For full documentation of the debug panel capabilities, including document detail view metadata and analytics reset, see the dedicated Frontend Debug Panel page.
Example: What an Examiner Sees
During a live query, the debug panel progressively shows:
- Understanding:
Medische vraag 92%HYBRIDhybrid_bm25_vector - Searching:
8 documentsBM25: 3CQ: 5 - Analyzing:
8 → 12 chunks(context expansion) - Generating: Progress bar filling
- Evaluating:
Score: 78% (B)Faithfulness: 82%Relevancy: 74%