Skip to main content
Architectural Update (March 2026)

This ADR was written when the system used Neo4j for entity storage. As of March 2026, Neo4j has been fully removed and replaced by PostgreSQL taxonomy tables (taxonomy_entities, taxonomy_relationships). The decision rationale documented here remains valid; the storage layer has changed.

ADR-0020: Reciprocal Rank Fusion

Date: 2026-02-10 | Status: Accepted

Context

The hybrid search pipeline combines vector similarity (pgvector cosine distance) with keyword matching (BM25). The previous implementation used weighted linear combination: final_score = 0.7 * vector_score + 0.3 * bm25_score.

This approach has a fundamental flaw: BM25 scores and cosine similarities operate on incompatible scales. Cosine similarity ranges from -1 to 1 (typically 0.3-0.9 for relevant results), while BM25 scores are unbounded positive values that vary wildly depending on query length, document frequency, and corpus size.

Decision

Replace weighted linear combination with Reciprocal Rank Fusion (RRF):

score(d) = Σ 1/(k + rank_i + 1) for each result list i

Where:

  • k = 60 (standard constant from the original RRF paper by Cormack, Clarke & Buettcher, 2009)
  • rank_i = position of document d in result list i (0-based)
  • Documents not present in a result list receive no contribution from that list

RRF is score-agnostic -- it only uses rank positions, completely sidestepping the score incompatibility problem.

Key Properties

PropertyImplication
Score-agnosticNo need to calibrate weights between different scoring scales
Overlap promotionDocuments in both lists rank higher than those in only one
Monotonically decreasingHigher rank always yields lower score contribution
Well-studiedUsed by Elasticsearch, Azure AI Search, Pinecone

Implementation

In search_service.py, Step 4 (BM25 merge) was replaced with RRF fusion. Vector search and BM25 search each return ranked lists, and RRF combines ranks into a single score sorted descending.

Consequences

Positive

  • +3-7% accuracy improvement: Measured across query test sets
  • More robust across query types: No need to tune weights per query category
  • Simpler code: No normalization logic, no weight parameters to maintain
  • Well-studied: Standard technique in production RAG systems

Negative

  • No score weighting: Cannot express "trust vector more than BM25" (though k=60 naturally favors consistent ranking)
  • Rank-only: Ignores confidence gaps (rank 1 with 0.99 similarity vs 0.51 are treated identically)

Neutral

  • Same retrieval candidates (pgvector + BM25 sources unchanged)
  • Same reranking step downstream (BGE reranker operates on RRF-fused results)
  • PostgreSQL taxonomy results merged with priority ordering before fusion (unchanged)

Alternatives Considered

Alternative 1: Weighted Linear with Better Normalization

Apply z-score or percentile normalization to both score types before combining.

  • Pros: Preserves score magnitude information
  • Cons: Normalization requires corpus statistics, fragile across corpus updates
  • Why rejected: RRF achieves better results with zero tuning

Alternative 2: Convex Combination of Normalized Ranks

Normalize ranks to [0,1] and use weighted sum.

  • Pros: Allows weight tuning
  • Cons: Still requires a weight parameter, marginal benefit over RRF
  • Why rejected: Added complexity without meaningful improvement

References

  • ADR-007: RAG Pipeline Enrichment (original hybrid search design)
  • ADR-0019: Contextual Embeddings (improved embedding inputs)