This ADR was written when the system used Neo4j for entity storage. As of March 2026, Neo4j has been fully removed and replaced by PostgreSQL taxonomy tables (taxonomy_entities, taxonomy_relationships). The decision rationale documented here remains valid; the storage layer has changed.
ADR-0020: Reciprocal Rank Fusion
Date: 2026-02-10 | Status: Accepted
Context
The hybrid search pipeline combines vector similarity (pgvector cosine distance) with keyword matching (BM25). The previous implementation used weighted linear combination: final_score = 0.7 * vector_score + 0.3 * bm25_score.
This approach has a fundamental flaw: BM25 scores and cosine similarities operate on incompatible scales. Cosine similarity ranges from -1 to 1 (typically 0.3-0.9 for relevant results), while BM25 scores are unbounded positive values that vary wildly depending on query length, document frequency, and corpus size.
Decision
Replace weighted linear combination with Reciprocal Rank Fusion (RRF):
score(d) = Σ 1/(k + rank_i + 1) for each result list i
Where:
k = 60(standard constant from the original RRF paper by Cormack, Clarke & Buettcher, 2009)rank_i= position of documentdin result listi(0-based)- Documents not present in a result list receive no contribution from that list
RRF is score-agnostic -- it only uses rank positions, completely sidestepping the score incompatibility problem.
Key Properties
| Property | Implication |
|---|---|
| Score-agnostic | No need to calibrate weights between different scoring scales |
| Overlap promotion | Documents in both lists rank higher than those in only one |
| Monotonically decreasing | Higher rank always yields lower score contribution |
| Well-studied | Used by Elasticsearch, Azure AI Search, Pinecone |
Implementation
In search_service.py, Step 4 (BM25 merge) was replaced with RRF fusion. Vector search and BM25 search each return ranked lists, and RRF combines ranks into a single score sorted descending.
Consequences
Positive
- +3-7% accuracy improvement: Measured across query test sets
- More robust across query types: No need to tune weights per query category
- Simpler code: No normalization logic, no weight parameters to maintain
- Well-studied: Standard technique in production RAG systems
Negative
- No score weighting: Cannot express "trust vector more than BM25" (though k=60 naturally favors consistent ranking)
- Rank-only: Ignores confidence gaps (rank 1 with 0.99 similarity vs 0.51 are treated identically)
Neutral
- Same retrieval candidates (pgvector + BM25 sources unchanged)
- Same reranking step downstream (BGE reranker operates on RRF-fused results)
- PostgreSQL taxonomy results merged with priority ordering before fusion (unchanged)
Alternatives Considered
Alternative 1: Weighted Linear with Better Normalization
Apply z-score or percentile normalization to both score types before combining.
- Pros: Preserves score magnitude information
- Cons: Normalization requires corpus statistics, fragile across corpus updates
- Why rejected: RRF achieves better results with zero tuning
Alternative 2: Convex Combination of Normalized Ranks
Normalize ranks to [0,1] and use weighted sum.
- Pros: Allows weight tuning
- Cons: Still requires a weight parameter, marginal benefit over RRF
- Why rejected: Added complexity without meaningful improvement
References
- Cormack, G. V., Clarke, C. L. A., & Buettcher, S. (2009). Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. Proceedings of SIGIR 2009, 758--759. https://doi.org/10.1145/1571941.1572114
- ParadeDB. (2024). Hybrid search in PostgreSQL: The missing manual. https://www.paradedb.com/blog/hybrid-search-in-postgresql-the-missing-manual