ADR-007: RAG Pipeline Enrichment

Date: 2026-02-07 | Status: Accepted

BM25 Hybrid Search, Context Assembly, Canonical Questions, and DeepEval Fix.

Context

Analysis of the RAGFlow article "Architecture for a Serious RAG System" (2025) revealed three major gaps in our RAG query pipeline (Lewis et al., 2020), plus a broken dependency:

Vector-only retrieval -- Our search relies entirely on semantic similarity (pgvector). Exact keyword matches (e.g., "Dr. Vanderstraeten", "cardioversie") often score low on cosine similarity despite being the most relevant results.
No context assembly -- Retrieved chunks are 350-token fragments. The LLM receives isolated snippets instead of coherent sections. The article's key insight: "What you retrieve is not what the model reads."
No canonical questions -- Chunks about visiting hours do not contain the phrasing "Wat zijn de bezoekuren?" that users actually type.
DeepEval not installed -- The evaluator catches ImportError and returns None scores, converted to 0.01 (1%). Background quality evaluation has been silently broken.

Decision

1. Fix DeepEval

Add deepeval>=2.0.0 to pyproject.toml so background quality evaluation produces real scores.

2. BM25 via PostgreSQL tsvector

Add a search_vector column (TSVECTOR) with GIN index to document_chunks. Hybrid scoring now uses Reciprocal Rank Fusion (RRF, k=60) — see ADR-0020.

Uses 'simple' text configuration (no stemming -- better for Dutch medical terms)
Replaces the title_keywords boost (15%) which was a fragile proxy for keyword matching
No new infrastructure required

3. Per-Chunk Canonical Questions

During ingestion, the Tier 2 (standard) model generates 1-2 Dutch questions per chunk. These are stored in chunk_metadata.canonical_questions and included in the tsvector for BM25 matching.

4. Context Assembly Service

A new ContextAssemblyService between retrieval and LLM generation:

Expand: Fetch ±1 adjacent chunks per document (batched DB query)
Deduplicate: Strip ~70-token overlap between consecutive chunks
Group: Merge chunks by document, order by relevance
Budget: Cap at 4,000 tokens (later increased to 8,000 tokens), drop lowest-relevance blocks first

Proposed Pipeline

Query → Intent Classification → Hybrid Search (Vector + BM25) → Metadata Boosting
      → Context Assembly (expand + dedup + group) → LLM Generation → Evaluation

Consequences

Positive

Exact keyword matching via BM25 (names, medical terms)
Coherent context blocks instead of isolated fragments
Canonical questions bridge vocabulary gap between user queries and content
Working DeepEval quality metrics
No new infrastructure -- BM25 via PostgreSQL, context assembly is pure logic
All features individually configurable

Negative

BM25 adds ~20-50ms (mitigated by parallel execution)
Context assembly adds ~50-65ms
Canonical question generation adds ~1s per chunk during ingestion (non-blocking)
tsvector column adds ~10-20% to row size
Migration required for existing chunks

Alternatives Considered

Alternative	Why Rejected
Elasticsearch/Typesense	Additional infrastructure for ~550 chunks is overkill
Client-side BM25 (rank-bm25)	O(N) scan per query, does not scale
Larger chunks instead of context assembly	Reduces embedding precision
LLM query expansion	Adds latency to every query; canonical questions are amortized

Verification

Query "cardioversie" → BM25 score > 0, exact match found
Query "Dr. Vanderstraeten" → exact match via BM25
7 retrieved chunks → expanded to ~12, overlaps stripped, total ≤ 4,000 tokens (later increased to 8,000 tokens)
Debug panel shows search_method, bm25_matches, chunk expansion stats
DeepEval logs show real faithfulness/relevancy scores

References

RAGFlow. (2025). Architecture for a serious RAG system. https://www.ragflow.io/blog/rag-review-2025-from-rag-to-context
The PostgreSQL Global Development Group. (2024). Full text search. In PostgreSQL 16 Documentation. https://www.postgresql.org/docs/current/textsearch.html

Context​

Decision​

1. Fix DeepEval​

2. BM25 via PostgreSQL tsvector​

3. Per-Chunk Canonical Questions​

4. Context Assembly Service​

Proposed Pipeline​

Consequences​

Positive​

Negative​

Alternatives Considered​

Verification​

References​