Skip to main content

Hybrid Search Strategy

ZOL Intelligent Search uses a hybrid retrieval mix of three complementary signals — dense vector search, sparse BM25 lexical search, and typed-entity taxonomy lookup — fused with Reciprocal Rank Fusion and post-fused with two re-ranking stages (Value Framework affinity rerank, synthetic doctor-list injection) before the chunks reach the context-assembly stage. This page documents the rationale for each component, the published precedents that motivate the design, and the trade-off matrix that shaped the current shape. See ADR-007 for the original decision; the embedding model has since migrated to text-embedding-3-large per ADR-0048.

Why no single retrieval signal is enough

Each retrieval signal has well-documented strengths and equally well-documented blind spots. The hybrid argument is straightforward: the union of the three covers what any one of them misses.

Vector (dense) search — bridges paraphrase, misses rare lexical surface forms

Dense retrieval — the dual-encoder pattern formalised by Karpukhin et al. 2020 — embeds query and corpus into a shared semantic space and scores by cosine similarity. It excels at paraphrase ("pijn in de borst" → "thoracale klachten") and cross-language matching, but it is weakest on:

  • Rare proper nounsDr. Vanderstraeten may have moderate similarity to a doctor brochure but should be an exact lexical match.
  • Domain-specific clinical termscardioversie, trombolyse — appear verbatim in the source content; lexical match is more reliable than learned similarity.
  • Structured entity questions"Which cardiologists work at Sint-Jan?" — require traversing typed relationships (Doctor → WORKS_IN → Department → LOCATED_AT → Campus). No amount of vector similarity reliably reconstructs that join.

BM25 (sparse) search — exact lexical match, no synonymy

BM25 (Okapi BM25) is the canonical probabilistic relevance scoring framework (Robertson & Zaragoza 2009), ranking documents by term frequency, inverse document frequency, and document-length normalisation. It catches the rare lexical cases that vector search misses — but it has no concept of synonymy:

  • "hoe bereid ik me voor?" (how do I prepare?) does not match a chunk titled voorbereiding (preparation) — same concept, no shared tokens.
  • "hartfilmpje" (colloquial Dutch for ECG) does not match content about elektrocardiogram.

Taxonomy (typed-entity) lookup — structured queries, no free text

The PostgreSQL taxonomy tables (departments, doctors, conditions, treatments, plus their typed *_relationships join tables) answer structured questions cheaply and exactly:

  • "Welke artsen werken bij Cardiologie?" → SQL join over doctor_departments, no embedding involved.
  • "Onder welke dienst valt Diabetes?" → SQL join over dept_handles_condition.

But the taxonomy carries no free text. A query like "hoe bereid ik me voor op een knieoperatie?" — whose answer lives in a brochure paragraph, not in any entity property — is invisible to taxonomy search.

Hybrid trade-off matrix

DecisionChosenAlternatives consideredRejected because
Number of retrieval signalsThree (vector + BM25 + taxonomy)Vector-only (@karpukhin2020dpr); vector + BM25 only; vector + ColBERT only (@khattab2020colbert)Vector-only fails the proper-noun and rare-clinical-term cases above; vector+BM25-without-taxonomy still requires the LLM to infer department-condition mappings from text, which produces inconsistent answers; ColBERT-only forces re-encoding the full corpus on every model swap, which is operationally untenable. ColBERT survives as a feature-flagged late-interaction reranker (see ADR-0039).
Fusion methodReciprocal Rank Fusion (RRF) with k=60 (Cormack, Clarke & Büttcher 2009)Weighted linear combination of similarity scores; learned-to-rankVector cosine, BM25 ts_rank, and graph priority scores live on incomparable scales. Weighted linear fusion required hand-tuned coefficients that drifted with each embedding swap (re-tuned at the BGE-M3 → text-embedding-3-large migration alone). RRF is score-agnostic — it depends only on rank position — and is empirically stable across model migrations. See ADR-0020.
Execution orderSequential vector → BM25 → taxonomy on a single asyncpg sessionParallel via three sessions; parallel with a sync ORMasyncpg's protocol does not support concurrent queries on the same session; opening three sessions per query has connection-pool consequences at our scale. The ~800 ms total cost of the sequential path is well below the LLM generation cost (~3 s) and preferable to a connection-pool spike under load.
Default strategyHYBRID for all non-blocked intentsPer-intent dynamic routing (vector-only for some, hybrid for others)Empirical: every intent benefits from at least the taxonomy hint, even when its primary signal is textual. The retrieval-strategy enum retains VECTOR_ONLY and GRAPH_ONLY for fallback, but production traffic routes to HYBRID for every safe intent.

Architecture

Stage 5b and Stage 5c are always-on rerank stages that execute after metadata boosting on every non-cached query — see Query Pipeline §Stage 5b and §Stage 5c for the algorithms and rationale, and Reranking & Evaluation for the broader rerank pipeline.

Intent-driven routing

Intent classification (see Query Pipeline §Stage 2) maps each safe query to a retrieval strategy:

StrategyUsed whenSources executed
HYBRID (default)All eight non-blocked intentsVector + BM25 + taxonomy (sequential)
VECTOR_ONLYFallback when taxonomy is unavailableVector + BM25 only
BLOCKEDout_of_scope_medical_advice, off_topic, other_hospital, vague_inputNone — immediate safety response

The four blocked intents short-circuit the retrieval stack entirely. They never reach Stage 5.

BM25 search runs as a PostgreSQL full-text query over a tsvector column built during ingestion:

search_vector := to_tsvector('simple', title || ' ' || content || ' ' || canonical_questions)

Two design decisions warrant explanation.

Why 'simple' instead of 'dutch'

Dutch full-text dictionaries apply stemming — behandeling (treatment) and behandelingen (treatments) collapse to a single stem. Useful for general retrieval; a problem for clinical content where the morphology often signals which clinical concept is meant. Worse, terms like cardioversie and cardioversies are domain-specific enough that the patient often quotes them verbatim, and we want exact matches. The 'simple' configuration preserves surface forms; missing-synonym recall is recovered by the dense vector path, not by stemming.

Two-step tsvector generation (canonical questions)

Ingestion writes the initial search_vector from title || content. After the LLM-driven enrichment step generates 1–2 canonical Dutch questions per chunk (e.g., for a visiting-hours chunk: "Wat zijn de bezoekuren bij ZOL?"), those questions are appended to the tsvector in a background pass. A user query that matches the canonical question gets a BM25 boost even if the chunk content uses different phrasing. This is a partial implementation of the HyPE pattern (full HyPE would also embed the canonical questions as separate vectors; we don't, currently — see Context Retrieval).

Query syntax

Queries use OR-joined tokens via to_tsquery('simple', ...) rather than plainto_tsquery's implicit AND:

to_tsquery('simple', 'afspraak | zol | maken')

OR semantics produce wider recall in the BM25 lane; precision is recovered by RRF fusion (a chunk that matches every term ranks higher than one matching only one) and by the post-retrieval reranking stages.

Reciprocal Rank Fusion (RRF)

RRF combines per-source rankings into a unified score using only rank position, ignoring raw relevance values. For a document at rank r in result list i:

$$ \text{rrf_score}(d) = \sum_{i \in \text{lists}} \frac{1}{k + r_i + 1} $$

with the constant k = 60 (the value Cormack et al. found empirically near-optimal across TREC tracks). Pseudocode:

def reciprocal_rank_fusion(
ranked_lists: list[list[Chunk]],
k: int = 60,
) -> list[Chunk]:
"""Score-agnostic fusion: combine N ranked lists using rank position only."""
scores: dict[ChunkId, float] = defaultdict(float)
by_id: dict[ChunkId, Chunk] = {}
for lst in ranked_lists:
for rank, chunk in enumerate(lst):
scores[chunk.id] += 1.0 / (k + rank + 1)
by_id[chunk.id] = chunk
return [by_id[cid] for cid, _ in sorted(scores.items(), key=lambda kv: -kv[1])]

A chunk that appears at rank 0 in vector and rank 5 in BM25 receives 1/(60+0+1) + 1/(60+5+1) ≈ 0.0316. A chunk that appears at rank 0 in vector but is absent from BM25 receives 1/61 ≈ 0.0164. The chunk that appears in both ranks higher — the desired property of multi-evidence promotion. See ADR-0020 for the original rationale.

Why RRF beats weighted linear combination on this corpus

Weighted linear combination is the obvious alternative — score = α · vector_cos + β · bm25_rank + γ · graph_priority. Three reasons we don't use it:

  1. Score scales differ. Vector cosine sits in [0.30, 0.95] for our corpus; BM25 ts_rank sits in [0.001, 0.4]; graph priority is fixed at 0.900.95. Weighting them requires per-corpus calibration.
  2. Migrations break the calibration. The BGE-M3 → text-embedding-3-large migration (ADR-0048) shifted the cosine distribution; weights tuned for BGE-M3 were no longer correct for text-embedding-3-large. RRF didn't flinch.
  3. The fusion benchmark literature consistently finds RRF competitive with or better than tuned linear combination across heterogeneous score sources (Cormack, Clarke & Büttcher 2009).

Graph result priority

Graph (taxonomy) results are merged into the fused ranking with priority ordering rather than via RRF:

  • Typed-entity matches (a doctor / department / condition resolved by ID) are placed first in the merged list and assigned a fixed similarity score in the 0.900.95 range. They reflect a high-precision match: the patient asked about Cardiologie, the taxonomy returned the row for Cardiologie.
  • Semantic-graph matches (relationship traversal results without an exact entity hit) follow vector results.

Graph results carry the structured properties (consultation hours, campus, phone number) as metadata, which can be appended to a vector result that references the same source URL — see Deduplication below.

Query expansion via taxonomy resolution

Before retrieval runs, Taxonomy Query Enrichment resolves patient-friendly terms to canonical taxonomy entities and appends them to the query. Examples:

OriginalResolvedEnriched (sent to vector + BM25)
"hartfilmpje"examination=ECG"hartfilmpje (ECG)"
"suikerziekte"condition=Diabetes Mellitus, dept=Endocrinologie"suikerziekte (Diabetes Mellitus, Endocrinologie)"
"orthopedische consultatie"dept=Orthopedie at campuses Sint-Jan and André Dumont"orthopedische consultatie (Orthopedie, Sint-Jan, André Dumont)"

This bridges the lay-vs-clinical vocabulary gap before the embedding is computed, so vector recall improves at the source rather than relying on RRF to rescue a near-miss.

Citation context for follow-up queries

For follow-up turns, the rewriter also extracts topic keywords from the previous turn's citations and threads them through as Onderwerp: (Subject:) hints:

Context: Waar kan ik parkeren bij het ziekenhuis?
Onderwerp: parking campus sint jan, bezoekersinfo zol
Question: en de kosten?

Without this, "en de kosten?"and the costs? — could pull cost-related content from any topic; with the topic hint, retrieval is biased toward parking. See Query Pipeline §Stage 3.

Deduplication

When vector and graph retrieval surface the same source URL — common for doctor and department pages — the merger:

  1. Keeps the highest relevance score across the two paths.
  2. Appends graph entity properties (consultation hours, campus, phone) onto the matching vector result.
  3. Emits a single result row carrying both the textual context and the structured metadata.

The downstream context-assembly step (chunk-overlap dedup) handles the intra-document overlap; the graph/vector dedup handles the inter-source overlap.

Empirical evidence

The hybrid strategy was validated against the golden evaluation set during the 2026-03 sprint that established the current shape:

Query typeVector-onlyGraph-onlyHybrid
Doctor lookupPoorExcellentExcellent
Condition infoExcellentPoorExcellent
"Doctor X at campus Y"ModerateGoodExcellent
Symptom-to-departmentGoodModerateExcellent

Quantitative measurements at the time of the BGE-M3 → text-embedding-3-large migration (April 2026) confirmed that the hybrid path retained recall while replacing vector-only as the primary signal — see Retrieval Improvements Roadmap for the per-migration deltas. End-to-end pipeline latency at p50 / p95 against the post-migration shape is not yet measured; representative numbers in Pipeline Latency are estimates pending the production telemetry pull.

Information-retrieval theory in one paragraph

The shape we ship rests on three published-paper claims: (1) dense bi-encoder retrieval bridges the vocabulary gap that pure lexical search cannot (@karpukhin2020dpr); (2) BM25 remains state-of-the-art for exact lexical matching of rare clinical terms (Robertson & Zaragoza 2009); and (3) score-agnostic rank fusion outperforms tuned linear combinations across heterogeneous score sources (Cormack, Clarke & Büttcher 2009). The fourth ZOL-specific element — typed-entity taxonomy lookup as a third retrieval lane — is not from the published RAG canon; it's an answer to the operational reality that a hospital website's structural questions ("which doctors / which department") deserve a structured-lookup path that does not depend on probabilistic retrieval at all.

References