Skip to main content

Reranking

Retrieval finds candidate answers quickly; reranking decides their order before the language model ever sees them. Because the model is only shown the top handful of chunks, the order is not cosmetic — it determines which evidence grounds the answer. This page explains what reranking is, where it sits in the ZOL pipeline, the three rerank steps that run in sequence, and the empirical evaluation that fixed the current configuration.

1. What reranking is, and why it exists

Modern retrieval uses a two-stage retrieve-then-rerank design (Nogueira & Cho 2019). The two stages have deliberately opposite priorities:

StageOptimises forHowCost
First-stage retrievalRecalldon't miss the right chunkBi-encoder vector search + BM25, fused with RRFCheap; runs over the whole corpus
RerankingPrecisionput the best chunk firstHeavier models that score each candidate against the queryExpensive; runs over a small shortlist only

The economic logic is the funnel below: a cheap retriever casts a wide net to guarantee the right chunk is somewhere in ~20 candidates; an expensive reranker then studies that short list closely to get the ordering right. You never run the expensive model over the whole corpus — only over the shortlist.

A weak reranker undermines everything upstream: even with perfect retrieval, poor ordering means the model receives second-best evidence and produces a second-best answer.

Why this is hard for Dutch medical content

The reranker must judge relevance in a language and domain that general-purpose models handle poorly:

  • Morphology — Dutch compounds such as hartchirurgie (heart-surgery) or bloeddrukverlagend (blood-pressure-lowering) only match well with models trained on Dutch or multilingual corpora.
  • Mixed vocabulary — clinical text blends Dutch, Latin and English (hypertensie, psoriasis, ECG).
  • Register gap — patients write colloquially (hoge bloeddruk) while content is clinical (Hypertensie).
  • Low resource — there is no Dutch equivalent of MS MARCO, so English reranker benchmarks do not transfer (Section 6 shows a model that tops the English leaderboard scoring last on our corpus).

2. Where reranking sits in the pipeline

Reranking is not a standalone service — it is the bridge between retrieval and context assembly inside the query pipeline. The pipeline is organised as numbered stages; reranking spans the end of the retrieval stage and a dedicated step immediately after it:

The two facts that the old version of this page got wrong, now stated correctly:

  1. The cross-encoder runs first, the Value Framework runs last. In code, _qs_execute_retrieval_stage runs hybrid search then the cross-encoder rerank (Stage 5); the Value Framework affinity rerank (apply_intent_category_affinity, Stage 5b) fires afterwards and re-sorts. So relevance is scored first, then category-appropriateness has the final say.
  2. "Stage 5b" is just the pipeline's label for the Value Framework step — it is not a prerequisite the reader needs to know in advance; it simply means "the step right after retrieval-and-rerank."

3. The rerank stack: three sequential steps

ZOL extends the textbook two-stage shape into three rerank steps, applied in this order:

OrderStepAlways on?What it does
1Cross-encoder relevance rerankYes (full / escalated mode)Scores each (query, chunk) pair jointly for fine-grained relevance
2ColBERT late-interactionNo — feature-flagged offToken-level multi-vector matching for precision-critical queries
3Value Framework category rerankYes (no flag)Multiplies each score by an intent × content-category affinity, then re-sorts

Step 1 — Cross-encoder relevance rerank

A first-stage retriever uses a bi-encoder: the query and each document are embedded separately, and similarity is a cheap vector comparison. A cross-encoder instead feeds the query and a candidate together into one model, so every word of the query can attend to every word of the document. This joint attention is far more accurate for fine-grained relevance — and far too slow to run over a whole corpus, which is exactly why it belongs in the rerank stage over a 20-candidate shortlist.

ZOL uses Jina Reranker v2 (API) as primary, with BAAI/bge-reranker-v2-m3 (local) as an automatic fallback when the API is unavailable (see Section 5 for why these specific models).

Step 2 — ColBERT late-interaction (optional)

A cross-encoder still compresses each document to a single relevance score. ColBERT (Khattab & Zaharia 2020) keeps one vector per token and scores by MaxSim — for each query token it takes the best-matching document token, then sums:

This preserves token-level distinctions a single vector hides — Dutch compounds (hartchirurgie vs hartritmestoornissen), doctor-name precision (Dr. Mullens vs Dr. Peeters), and multi-entity queries ("welke cardioloog op campus Sint-Jan doet echocardiografie?" has three entities ColBERT scores independently). Because storing per-token vectors for the whole corpus is prohibitive, ColBERT is used only as a reranker over the shortlist, and is disabled by default (colbert_reranking_enabled=False) — held in reserve for precision-critical use.

Step 3 — Value Framework category rerank

The cross-encoder answers "how relevant is this chunk?" The Value Framework answers a different question: "is this the right kind of content for this question?" It multiplies each chunk's score by an intent × content-category affinity and re-sorts, so a navigational question is answered from practical content rather than regulatory content even when both mention the same keyword. It is always on, runs last, and is the final arbiter of order. Full detail on the Value Framework page.

Two different "rerankers", one common confusion

The cross-encoder decides relevance (how well a chunk matches the query). The Value Framework decides appropriateness (whether the chunk's category fits the question's intent). They are complementary and sequential — relevance first, appropriateness last.

The full stack, end to end

Putting retrieval and all three steps together — including the two behaviours most people miss (pinned chunks and intent-aware top-K):

Three behaviours of this stack carry most of its real-world effect:

  • Pinned chunks bypass reranking. Graph results and keyword_rescue hits are concatenated after reranking (s["chunks"] = pinned + reranked), so a deterministic structured answer — e.g. a doctor→department fact — is never reordered away by a relevance model that cannot see the structure.
  • The output count is intent-aware, not fixed. Standard retrieval resolves top-K from the intent classifier (_resolve_top_k_for_intent): a navigation_or_practical_info query keeps ≈12 chunks, a doctor_lookup ≈6; escalated ("Think Harder") keeps a fixed wide budget of 20.
  • Reranking is fail-open. If the reranker errors, the exception is caught and the pre-rerank order is kept (logged, non-fatal) — an outage degrades ordering, it never drops the answer.
  • Breadth queries get a diversity cap. For navigation/department_lookup intents the reranker is asked for 2× top-K, then capped at max 2 chunks per document, so the final context spreads across separate brochures instead of being dominated by one general page.

4. Current configuration

ParameterValueRationale
Primary modelJina Reranker v2 (API)Winner of the Section 5 Dutch-medical benchmark; ≈10× faster than local BGE
Fallback modelBAAI/bge-reranker-v2-m3 (local)Multilingual cross-encoder, supports Dutch; runs when the Jina API is unavailable
Fallback architectureXLM-RoBERTa, 24-layer560 M parameters
max_length512 tokensSafety net beyond character truncation
Content truncation500 characters11.4× speedup vs full content (ADR-0034); medical pages frontload the key facts
Candidates20 (rag_rerank_candidates)Section 5: 20 beats 40 on MRR/NDCG while halving latency
Score threshold0.0No minimum-score filtering
ColBERTcolbert_reranking_enabled=FalseFeature-flagged; reserved for the precision-critical case
Value FrameworkAlways-onMultiplies score by the intent × category matrix; no flag
Fallback inferenceCPU (PyTorch)ONNX/MPS evaluated, no benefit on Apple Silicon (Section 7)
note
BGE-M3 is the ColBERT model, not the embedding model

Two different BGE-M3 roles are easy to conflate. The primary embedding model for first-stage vector search is OpenAI text-embedding-3-large (1,536-dim), per ADR-0048 (@openai2024embeddings). BGE-M3 (1,024-dim) survives in the stack only because it natively supports the late-interaction multi-vector mode that Step 2 (ColBERT) needs. Older "first-stage retriever (BGE-M3 embeddings)" phrasing in the Section 5 benchmark describes the state at that benchmark's date; today the first-stage retriever is text-embedding-3-large.

Key design trade-offs

DecisionChosenRejected alternatives, and why
Cross-encoder modelJina Reranker v2 APILocal BGE was 10× slower with worse MRR. Cohere Rerank 4 is #2 on the English leaderboard but scored last on our Dutch corpus (MRR@10 = 0.149) — English benchmarks do not transfer.
Candidate count2040/50/100 scored worse on MRR/NDCG: beyond 20, extra candidates add noise, not recoverable signal, and push latency past the perceptual threshold.
Value Framework as a separate last stepAlways-on, after the cross-encoderBaking the affinity matrix into the cross-encoder would require re-fine-tuning per tenant; a hard category filter would drop correct chunks on a boundary misclassification. A graded multiplier lets the cross-encoder rank on pure relevance first, then re-weights by appropriateness without ever excluding a chunk outright.
ColBERT as optional Step 2, not primaryFeature-flaggedColBERT-as-retriever needs per-token vectors for the whole corpus (storage-prohibitive at 10k+ chunks). As a reranker it gives the late-interaction precision benefit only where it is worth the cost.
Truncation length500 chars (configurable)500 chars gives an 11.4× local speedup; 1,000 chars gave a tiny MAP gain (0.2804 vs 0.2751) at 4× latency. On the Jina API path, input length matters little.

5. The Dutch-medical evaluation (why these choices)

The configuration above was not chosen from leaderboards — it was measured on a purpose-built Dutch evaluation set. We built 118 Dutch medical queries from the golden-question corpus, each with expected_source_urls, yielding 695 positive and 2,360 negative chunks as query/positive/negative triples (scripts/build_reranker_eval_set.py, scripts/benchmark_rerankers.py).

ModelTruncationCandidatesMRR@10NDCG@10MAPAvg latency
Jina Reranker v2API200.40460.30050.3473≈450 ms
Jina Reranker v2API400.31660.21180.2550446 ms
BGE-reranker-v2-m31000200.24060.19010.28044,596 ms
BGE-reranker-v2-m3500200.22470.19140.27511,503 ms
BGE-reranker-v2-m32000200.22960.18050.27374,960 ms
Cohere Rerank 4API400.14930.1054892 ms

Jina API runs hit rate-limiting (429s) that inflated some latency measurements; ≈450 ms is the rate-limit-free figure. All BGE runs used the full 118-query set.

What the data settled:

  1. Jina Reranker v2 wins on Dutch medical content — highest MRR@10 (0.4046) and NDCG@10 (0.3005), while ≈10× faster than local BGE.
  2. 20 candidates beats 40/50 at every truncation length — beyond 20, first-stage candidates add noise.
  3. English rankings do not transfer — Cohere Rerank 4 (English #2) scored last here. This is the single strongest argument for our evaluation-first approach.
  4. Switching to the API saved ≈4 s/query (4.6 s local → 0.45 s) — a transformative UX gain.

The production decision: Jina Reranker v2 API primary, local BGE-reranker-v2-m3 automatic fallback, candidates reduced 40 → 20.

Candidate count vs latency

CandidatesRecall potentialReranker latency (500-char, CPU)Notes
20Best on Dutch medical≈3.5 s local · ≈450 ms JinaCurrent setting
40+15–20% raw recall≈7.0 sWorse MRR/NDCG than 20
50+18–22%≈8.5 sPre-optimisation setting
100+25–30%≈17 sImpractical for interactive use

Nogueira et al. (2019) showed cross-encoder reranking of top-100 BM25 results lifting NDCG@10 by 15–20% on MS MARCO — but returns diminish fast beyond 50 candidates for domain-specific corpora where the first-stage retriever already has decent precision.

6. State-of-the-art context

Reranker leaderboard (Agentset, Feb 2026)

RankModelELOTypeMultilingual
1Zerank 21638APIYes
2Cohere Rerank 4 Pro1629API100+ langs
3Jina Reranker v21585API/LocalYes
5BGE Reranker v2.5 Gemma1498LocalYes
11BGE Reranker v2 m3 (our fallback)1327LocalYes

Our local fallback ranks #11 of 12 — ≈300 ELO behind the leaders, which corresponds to roughly an 85% chance the higher-rated model wins a head-to-head on the leaderboard's (English) benchmarks. But the leaderboard evaluates English; our production primary is the API model (Jina, #3), and our own Dutch benchmark (Section 5) is what actually governs the choice. This is precisely why the Cohere result matters: leaderboard position is not a substitute for in-domain measurement.

Healthcare RAG literature

  • MEGA-RAG (2025) — graph traversal + dense retrieval + reranking; +18–20% over vanilla RAG on MedQA/PubMedQA.
  • MedGraphRAG (2025) — graph-guided retrieval cuts hallucination ≈40% vs pure vector search; uses no neural reranking.
  • MIRAGE (2024) (Xiong et al. 2024) — standardised medical-QA evaluation; baseline ms-marco-MiniLM reranking reaches NDCG@10 0.72, domain-adapted rerankers 0.78.

The consistent finding across this literature: the quality of the final re-scoring stage has outsized impact on answer quality — which is why ZOL invests three steps in it.

Dutch-specific gaps standard benchmarks miss

  1. Compound splittingbloeddrukverlagend should match Hypertensie.
  2. Register mismatch"pijn in mijn borst" must reach thoracale pijn / angina pectoris.
  3. Code-switching — Belgian patients mix Dutch and French (rendez-vous for afspraak).
  4. AbbreviationsNMRMRI, CTcomputertomografie.

7. Inference-runtime notes (ADR-0034)

On Apple Silicon, alternative runtimes gave no benefit for the local fallback model:

  • ONNX Runtime — 2× slower than PyTorch (the CPU provider lacks ARM NEON optimisations; PyTorch uses Apple's Accelerate natively).
  • Apple MPS — no improvement (cross-encoder inference is memory-bound at batch 40; MPS overhead offsets GPU parallelism).

PyTorch CPU is already well-optimised here. On a Linux/CUDA production host, GPU inference would change this calculus — but that is a deployment decision, not a model one.

8. References


Related: Hybrid Search (the first stage that feeds this one) · Context Assembly (the next stage) · Value Framework (Step 3 in depth) · Query Pipeline (the full stage map).