Skip to main content

ADR-0034: Pipeline Latency Optimization

Status: Accepted (February 2026)

Context

The ZOL RAG pipeline (Lewis et al., 2020) targets a total response time under 7 seconds. The reranking stage, which uses a local BGE-reranker-v2-m3 cross-encoder, consumed ~1.5 seconds of this budget. Additionally, the default of 50 candidates for reranking was established during early development when retrieval quality was lower and broader candidate pools were needed for recall.

Benchmarking revealed two optimization opportunities:

  1. Reranker latency: Jina Reranker v2 API (~500ms) is 3x faster than local BGE cross-encoder inference (~1.5s) while producing comparable or superior ranking quality.
  2. Candidate count: A/B testing showed that 20 candidates produced equivalent MRR and NDCG@5 scores compared to 50 candidates, while halving reranking latency.

Decision

  1. Switch primary reranker from local BGE-reranker-v2-m3 to Jina Reranker v2 API, retaining the local model as an automatic fallback.
  2. Reduce default candidates from 50 to 20 for normal mode (rag_rerank_candidates=20). Escalated mode ("Think Harder") retains 100 candidates with top-20 reranking.
  3. Make the reranker provider configurable via rag_reranker_provider setting ("jina" or "local").

Consequences

Positive

  • ~1 second latency reduction per query in full mode (500ms vs 1.5s reranking)
  • Equivalent retrieval quality: Benchmarks confirmed no MRR/NDCG regression at 20 candidates
  • Graceful degradation: Automatic fallback to local BGE model if Jina API is unavailable
  • Configurable: Operators can switch back to local-only mode without code changes

Negative

  • External API dependency: Jina API introduces a network call (mitigated by local fallback)
  • Per-query cost: ~$0.001/query for Jina API (negligible at 25K queries/month)
  • Reduced candidate pool may theoretically miss long-tail relevant results (not observed in testing)

References