ADR-007: RAG Pipeline Enrichment
Date: 2026-02-07 | Status: Accepted
BM25 Hybrid Search, Context Assembly, Canonical Questions, and DeepEval Fix.
Context
Analysis of the RAGFlow article "Architecture for a Serious RAG System" (2025) revealed three major gaps in our RAG query pipeline (Lewis et al., 2020), plus a broken dependency:
-
Vector-only retrieval -- Our search relies entirely on semantic similarity (pgvector). Exact keyword matches (e.g., "Dr. Vanderstraeten", "cardioversie") often score low on cosine similarity despite being the most relevant results.
-
No context assembly -- Retrieved chunks are 350-token fragments. The LLM receives isolated snippets instead of coherent sections. The article's key insight: "What you retrieve is not what the model reads."
-
No canonical questions -- Chunks about visiting hours do not contain the phrasing "Wat zijn de bezoekuren?" that users actually type.
-
DeepEval not installed -- The evaluator catches
ImportErrorand returnsNonescores, converted to 0.01 (1%). Background quality evaluation has been silently broken.
Decision
1. Fix DeepEval
Add deepeval>=2.0.0 to pyproject.toml so background quality evaluation produces real scores.
2. BM25 via PostgreSQL tsvector
Add a search_vector column (TSVECTOR) with GIN index to document_chunks. Hybrid scoring now uses Reciprocal Rank Fusion (RRF, k=60) — see ADR-0020.
- Uses
'simple'text configuration (no stemming -- better for Dutch medical terms) - Replaces the title_keywords boost (15%) which was a fragile proxy for keyword matching
- No new infrastructure required
3. Per-Chunk Canonical Questions
During ingestion, the Tier 2 (standard) model generates 1-2 Dutch questions per chunk. These are stored in chunk_metadata.canonical_questions and included in the tsvector for BM25 matching.
4. Context Assembly Service
A new ContextAssemblyService between retrieval and LLM generation:
- Expand: Fetch ±1 adjacent chunks per document (batched DB query)
- Deduplicate: Strip ~70-token overlap between consecutive chunks
- Group: Merge chunks by document, order by relevance
- Budget: Cap at 4,000 tokens (later increased to 8,000 tokens), drop lowest-relevance blocks first
Proposed Pipeline
Query → Intent Classification → Hybrid Search (Vector + BM25) → Metadata Boosting
→ Context Assembly (expand + dedup + group) → LLM Generation → Evaluation
Consequences
Positive
- Exact keyword matching via BM25 (names, medical terms)
- Coherent context blocks instead of isolated fragments
- Canonical questions bridge vocabulary gap between user queries and content
- Working DeepEval quality metrics
- No new infrastructure -- BM25 via PostgreSQL, context assembly is pure logic
- All features individually configurable
Negative
- BM25 adds ~20-50ms (mitigated by parallel execution)
- Context assembly adds ~50-65ms
- Canonical question generation adds ~1s per chunk during ingestion (non-blocking)
- tsvector column adds ~10-20% to row size
- Migration required for existing chunks
Alternatives Considered
| Alternative | Why Rejected |
|---|---|
| Elasticsearch/Typesense | Additional infrastructure for ~550 chunks is overkill |
| Client-side BM25 (rank-bm25) | O(N) scan per query, does not scale |
| Larger chunks instead of context assembly | Reduces embedding precision |
| LLM query expansion | Adds latency to every query; canonical questions are amortized |
Verification
- Query "cardioversie" → BM25 score > 0, exact match found
- Query "Dr. Vanderstraeten" → exact match via BM25
- 7 retrieved chunks → expanded to ~12, overlaps stripped, total ≤ 4,000 tokens (later increased to 8,000 tokens)
- Debug panel shows search_method, bm25_matches, chunk expansion stats
- DeepEval logs show real faithfulness/relevancy scores
References
- RAGFlow. (2025). Architecture for a serious RAG system. https://www.ragflow.io/blog/rag-review-2025-from-rag-to-context
- The PostgreSQL Global Development Group. (2024). Full text search. In PostgreSQL 16 Documentation. https://www.postgresql.org/docs/current/textsearch.html