Skip to main content

ADR-007: RAG Pipeline Enrichment

Date: 2026-02-07 | Status: Accepted

BM25 Hybrid Search, Context Assembly, Canonical Questions, and DeepEval Fix.

Context

Analysis of the RAGFlow article "Architecture for a Serious RAG System" (2025) revealed three major gaps in our RAG query pipeline (Lewis et al., 2020), plus a broken dependency:

  1. Vector-only retrieval -- Our search relies entirely on semantic similarity (pgvector). Exact keyword matches (e.g., "Dr. Vanderstraeten", "cardioversie") often score low on cosine similarity despite being the most relevant results.

  2. No context assembly -- Retrieved chunks are 350-token fragments. The LLM receives isolated snippets instead of coherent sections. The article's key insight: "What you retrieve is not what the model reads."

  3. No canonical questions -- Chunks about visiting hours do not contain the phrasing "Wat zijn de bezoekuren?" that users actually type.

  4. DeepEval not installed -- The evaluator catches ImportError and returns None scores, converted to 0.01 (1%). Background quality evaluation has been silently broken.

Decision

1. Fix DeepEval

Add deepeval>=2.0.0 to pyproject.toml so background quality evaluation produces real scores.

2. BM25 via PostgreSQL tsvector

Add a search_vector column (TSVECTOR) with GIN index to document_chunks. Hybrid scoring now uses Reciprocal Rank Fusion (RRF, k=60) — see ADR-0020.

  • Uses 'simple' text configuration (no stemming -- better for Dutch medical terms)
  • Replaces the title_keywords boost (15%) which was a fragile proxy for keyword matching
  • No new infrastructure required

3. Per-Chunk Canonical Questions

During ingestion, the Tier 2 (standard) model generates 1-2 Dutch questions per chunk. These are stored in chunk_metadata.canonical_questions and included in the tsvector for BM25 matching.

4. Context Assembly Service

A new ContextAssemblyService between retrieval and LLM generation:

  1. Expand: Fetch ±1 adjacent chunks per document (batched DB query)
  2. Deduplicate: Strip ~70-token overlap between consecutive chunks
  3. Group: Merge chunks by document, order by relevance
  4. Budget: Cap at 4,000 tokens (later increased to 8,000 tokens), drop lowest-relevance blocks first

Proposed Pipeline

Query → Intent Classification → Hybrid Search (Vector + BM25) → Metadata Boosting
→ Context Assembly (expand + dedup + group) → LLM Generation → Evaluation

Consequences

Positive

  • Exact keyword matching via BM25 (names, medical terms)
  • Coherent context blocks instead of isolated fragments
  • Canonical questions bridge vocabulary gap between user queries and content
  • Working DeepEval quality metrics
  • No new infrastructure -- BM25 via PostgreSQL, context assembly is pure logic
  • All features individually configurable

Negative

  • BM25 adds ~20-50ms (mitigated by parallel execution)
  • Context assembly adds ~50-65ms
  • Canonical question generation adds ~1s per chunk during ingestion (non-blocking)
  • tsvector column adds ~10-20% to row size
  • Migration required for existing chunks

Alternatives Considered

AlternativeWhy Rejected
Elasticsearch/TypesenseAdditional infrastructure for ~550 chunks is overkill
Client-side BM25 (rank-bm25)O(N) scan per query, does not scale
Larger chunks instead of context assemblyReduces embedding precision
LLM query expansionAdds latency to every query; canonical questions are amortized

Verification

  1. Query "cardioversie" → BM25 score > 0, exact match found
  2. Query "Dr. Vanderstraeten" → exact match via BM25
  3. 7 retrieved chunks → expanded to ~12, overlaps stripped, total ≤ 4,000 tokens (later increased to 8,000 tokens)
  4. Debug panel shows search_method, bm25_matches, chunk expansion stats
  5. DeepEval logs show real faithfulness/relevancy scores

References