Reranking
Retrieval finds candidate answers quickly; reranking decides their order before the language model ever sees them. Because the model is only shown the top handful of chunks, the order is not cosmetic — it determines which evidence grounds the answer. This page explains what reranking is, where it sits in the ZOL pipeline, the three rerank steps that run in sequence, and the empirical evaluation that fixed the current configuration.
1. What reranking is, and why it exists
Modern retrieval uses a two-stage retrieve-then-rerank design (Nogueira & Cho 2019). The two stages have deliberately opposite priorities:
| Stage | Optimises for | How | Cost |
|---|---|---|---|
| First-stage retrieval | Recall — don't miss the right chunk | Bi-encoder vector search + BM25, fused with RRF | Cheap; runs over the whole corpus |
| Reranking | Precision — put the best chunk first | Heavier models that score each candidate against the query | Expensive; runs over a small shortlist only |
The economic logic is the funnel below: a cheap retriever casts a wide net to guarantee the right chunk is somewhere in ~20 candidates; an expensive reranker then studies that short list closely to get the ordering right. You never run the expensive model over the whole corpus — only over the shortlist.
A weak reranker undermines everything upstream: even with perfect retrieval, poor ordering means the model receives second-best evidence and produces a second-best answer.
Why this is hard for Dutch medical content
The reranker must judge relevance in a language and domain that general-purpose models handle poorly:
- Morphology — Dutch compounds such as hartchirurgie (heart-surgery) or bloeddrukverlagend (blood-pressure-lowering) only match well with models trained on Dutch or multilingual corpora.
- Mixed vocabulary — clinical text blends Dutch, Latin and English (hypertensie, psoriasis, ECG).
- Register gap — patients write colloquially (hoge bloeddruk) while content is clinical (Hypertensie).
- Low resource — there is no Dutch equivalent of MS MARCO, so English reranker benchmarks do not transfer (Section 6 shows a model that tops the English leaderboard scoring last on our corpus).
2. Where reranking sits in the pipeline
Reranking is not a standalone service — it is the bridge between retrieval and context assembly inside the query pipeline. The pipeline is organised as numbered stages; reranking spans the end of the retrieval stage and a dedicated step immediately after it:
The two facts that the old version of this page got wrong, now stated correctly:
- The cross-encoder runs first, the Value Framework runs last. In code,
_qs_execute_retrieval_stageruns hybrid search then the cross-encoder rerank (Stage 5); the Value Framework affinity rerank (apply_intent_category_affinity, Stage 5b) fires afterwards and re-sorts. So relevance is scored first, then category-appropriateness has the final say. - "Stage 5b" is just the pipeline's label for the Value Framework step — it is not a prerequisite the reader needs to know in advance; it simply means "the step right after retrieval-and-rerank."
3. The rerank stack: three sequential steps
ZOL extends the textbook two-stage shape into three rerank steps, applied in this order:
| Order | Step | Always on? | What it does |
|---|---|---|---|
| 1 | Cross-encoder relevance rerank | Yes (full / escalated mode) | Scores each (query, chunk) pair jointly for fine-grained relevance |
| 2 | ColBERT late-interaction | No — feature-flagged off | Token-level multi-vector matching for precision-critical queries |
| 3 | Value Framework category rerank | Yes (no flag) | Multiplies each score by an intent × content-category affinity, then re-sorts |
Step 1 — Cross-encoder relevance rerank
A first-stage retriever uses a bi-encoder: the query and each document are embedded separately, and similarity is a cheap vector comparison. A cross-encoder instead feeds the query and a candidate together into one model, so every word of the query can attend to every word of the document. This joint attention is far more accurate for fine-grained relevance — and far too slow to run over a whole corpus, which is exactly why it belongs in the rerank stage over a 20-candidate shortlist.
ZOL uses Jina Reranker v2 (API) as primary, with BAAI/bge-reranker-v2-m3 (local) as an automatic fallback when the API is unavailable (see Section 5 for why these specific models).
Step 2 — ColBERT late-interaction (optional)
A cross-encoder still compresses each document to a single relevance score. ColBERT (Khattab & Zaharia 2020) keeps one vector per token and scores by MaxSim — for each query token it takes the best-matching document token, then sums:
This preserves token-level distinctions a single vector hides — Dutch compounds (hartchirurgie vs hartritmestoornissen), doctor-name precision (Dr. Mullens vs Dr. Peeters), and multi-entity queries ("welke cardioloog op campus Sint-Jan doet echocardiografie?" has three entities ColBERT scores independently). Because storing per-token vectors for the whole corpus is prohibitive, ColBERT is used only as a reranker over the shortlist, and is disabled by default (colbert_reranking_enabled=False) — held in reserve for precision-critical use.
Step 3 — Value Framework category rerank
The cross-encoder answers "how relevant is this chunk?" The Value Framework answers a different question: "is this the right kind of content for this question?" It multiplies each chunk's score by an intent × content-category affinity and re-sorts, so a navigational question is answered from practical content rather than regulatory content even when both mention the same keyword. It is always on, runs last, and is the final arbiter of order. Full detail on the Value Framework page.
The cross-encoder decides relevance (how well a chunk matches the query). The Value Framework decides appropriateness (whether the chunk's category fits the question's intent). They are complementary and sequential — relevance first, appropriateness last.
The full stack, end to end
Putting retrieval and all three steps together — including the two behaviours most people miss (pinned chunks and intent-aware top-K):
Three behaviours of this stack carry most of its real-world effect:
- Pinned chunks bypass reranking. Graph results and
keyword_rescuehits are concatenated after reranking (s["chunks"] = pinned + reranked), so a deterministic structured answer — e.g. a doctor→department fact — is never reordered away by a relevance model that cannot see the structure. - The output count is intent-aware, not fixed. Standard retrieval resolves top-K from the intent classifier (
_resolve_top_k_for_intent): anavigation_or_practical_infoquery keeps ≈12 chunks, adoctor_lookup≈6; escalated ("Think Harder") keeps a fixed wide budget of 20. - Reranking is fail-open. If the reranker errors, the exception is caught and the pre-rerank order is kept (logged, non-fatal) — an outage degrades ordering, it never drops the answer.
- Breadth queries get a diversity cap. For
navigation/department_lookupintents the reranker is asked for 2× top-K, then capped at max 2 chunks per document, so the final context spreads across separate brochures instead of being dominated by one general page.
4. Current configuration
| Parameter | Value | Rationale |
|---|---|---|
| Primary model | Jina Reranker v2 (API) | Winner of the Section 5 Dutch-medical benchmark; ≈10× faster than local BGE |
| Fallback model | BAAI/bge-reranker-v2-m3 (local) | Multilingual cross-encoder, supports Dutch; runs when the Jina API is unavailable |
| Fallback architecture | XLM-RoBERTa, 24-layer | 560 M parameters |
| max_length | 512 tokens | Safety net beyond character truncation |
| Content truncation | 500 characters | 11.4× speedup vs full content (ADR-0034); medical pages frontload the key facts |
| Candidates | 20 (rag_rerank_candidates) | Section 5: 20 beats 40 on MRR/NDCG while halving latency |
| Score threshold | 0.0 | No minimum-score filtering |
| ColBERT | colbert_reranking_enabled=False | Feature-flagged; reserved for the precision-critical case |
| Value Framework | Always-on | Multiplies score by the intent × category matrix; no flag |
| Fallback inference | CPU (PyTorch) | ONNX/MPS evaluated, no benefit on Apple Silicon (Section 7) |
Two different BGE-M3 roles are easy to conflate. The primary embedding model for first-stage vector search is OpenAI text-embedding-3-large (1,536-dim), per ADR-0048 (@openai2024embeddings). BGE-M3 (1,024-dim) survives in the stack only because it natively supports the late-interaction multi-vector mode that Step 2 (ColBERT) needs. Older "first-stage retriever (BGE-M3 embeddings)" phrasing in the Section 5 benchmark describes the state at that benchmark's date; today the first-stage retriever is text-embedding-3-large.
Key design trade-offs
| Decision | Chosen | Rejected alternatives, and why |
|---|---|---|
| Cross-encoder model | Jina Reranker v2 API | Local BGE was 10× slower with worse MRR. Cohere Rerank 4 is #2 on the English leaderboard but scored last on our Dutch corpus (MRR@10 = 0.149) — English benchmarks do not transfer. |
| Candidate count | 20 | 40/50/100 scored worse on MRR/NDCG: beyond 20, extra candidates add noise, not recoverable signal, and push latency past the perceptual threshold. |
| Value Framework as a separate last step | Always-on, after the cross-encoder | Baking the affinity matrix into the cross-encoder would require re-fine-tuning per tenant; a hard category filter would drop correct chunks on a boundary misclassification. A graded multiplier lets the cross-encoder rank on pure relevance first, then re-weights by appropriateness without ever excluding a chunk outright. |
| ColBERT as optional Step 2, not primary | Feature-flagged | ColBERT-as-retriever needs per-token vectors for the whole corpus (storage-prohibitive at 10k+ chunks). As a reranker it gives the late-interaction precision benefit only where it is worth the cost. |
| Truncation length | 500 chars (configurable) | 500 chars gives an 11.4× local speedup; 1,000 chars gave a tiny MAP gain (0.2804 vs 0.2751) at 4× latency. On the Jina API path, input length matters little. |
5. The Dutch-medical evaluation (why these choices)
The configuration above was not chosen from leaderboards — it was measured on a purpose-built Dutch evaluation set. We built 118 Dutch medical queries from the golden-question corpus, each with expected_source_urls, yielding 695 positive and 2,360 negative chunks as query/positive/negative triples (scripts/build_reranker_eval_set.py, scripts/benchmark_rerankers.py).
| Model | Truncation | Candidates | MRR@10 | NDCG@10 | MAP | Avg latency |
|---|---|---|---|---|---|---|
| Jina Reranker v2 | API | 20 | 0.4046 | 0.3005 | 0.3473 | ≈450 ms |
| Jina Reranker v2 | API | 40 | 0.3166 | 0.2118 | 0.2550 | 446 ms |
| BGE-reranker-v2-m3 | 1000 | 20 | 0.2406 | 0.1901 | 0.2804 | 4,596 ms |
| BGE-reranker-v2-m3 | 500 | 20 | 0.2247 | 0.1914 | 0.2751 | 1,503 ms |
| BGE-reranker-v2-m3 | 2000 | 20 | 0.2296 | 0.1805 | 0.2737 | 4,960 ms |
| Cohere Rerank 4 | API | 40 | 0.1493 | 0.1054 | — | 892 ms |
Jina API runs hit rate-limiting (429s) that inflated some latency measurements; ≈450 ms is the rate-limit-free figure. All BGE runs used the full 118-query set.
What the data settled:
- Jina Reranker v2 wins on Dutch medical content — highest MRR@10 (0.4046) and NDCG@10 (0.3005), while ≈10× faster than local BGE.
- 20 candidates beats 40/50 at every truncation length — beyond 20, first-stage candidates add noise.
- English rankings do not transfer — Cohere Rerank 4 (English #2) scored last here. This is the single strongest argument for our evaluation-first approach.
- Switching to the API saved ≈4 s/query (4.6 s local → 0.45 s) — a transformative UX gain.
The production decision: Jina Reranker v2 API primary, local BGE-reranker-v2-m3 automatic fallback, candidates reduced 40 → 20.
Candidate count vs latency
| Candidates | Recall potential | Reranker latency (500-char, CPU) | Notes |
|---|---|---|---|
| 20 | Best on Dutch medical | ≈3.5 s local · ≈450 ms Jina | Current setting |
| 40 | +15–20% raw recall | ≈7.0 s | Worse MRR/NDCG than 20 |
| 50 | +18–22% | ≈8.5 s | Pre-optimisation setting |
| 100 | +25–30% | ≈17 s | Impractical for interactive use |
Nogueira et al. (2019) showed cross-encoder reranking of top-100 BM25 results lifting NDCG@10 by 15–20% on MS MARCO — but returns diminish fast beyond 50 candidates for domain-specific corpora where the first-stage retriever already has decent precision.
6. State-of-the-art context
Reranker leaderboard (Agentset, Feb 2026)
| Rank | Model | ELO | Type | Multilingual |
|---|---|---|---|---|
| 1 | Zerank 2 | 1638 | API | Yes |
| 2 | Cohere Rerank 4 Pro | 1629 | API | 100+ langs |
| 3 | Jina Reranker v2 | 1585 | API/Local | Yes |
| 5 | BGE Reranker v2.5 Gemma | 1498 | Local | Yes |
| 11 | BGE Reranker v2 m3 (our fallback) | 1327 | Local | Yes |
Our local fallback ranks #11 of 12 — ≈300 ELO behind the leaders, which corresponds to roughly an 85% chance the higher-rated model wins a head-to-head on the leaderboard's (English) benchmarks. But the leaderboard evaluates English; our production primary is the API model (Jina, #3), and our own Dutch benchmark (Section 5) is what actually governs the choice. This is precisely why the Cohere result matters: leaderboard position is not a substitute for in-domain measurement.
Healthcare RAG literature
- MEGA-RAG (2025) — graph traversal + dense retrieval + reranking; +18–20% over vanilla RAG on MedQA/PubMedQA.
- MedGraphRAG (2025) — graph-guided retrieval cuts hallucination ≈40% vs pure vector search; uses no neural reranking.
- MIRAGE (2024) (Xiong et al. 2024) — standardised medical-QA evaluation; baseline
ms-marco-MiniLMreranking reaches NDCG@10 0.72, domain-adapted rerankers 0.78.
The consistent finding across this literature: the quality of the final re-scoring stage has outsized impact on answer quality — which is why ZOL invests three steps in it.
Dutch-specific gaps standard benchmarks miss
- Compound splitting — bloeddrukverlagend should match Hypertensie.
- Register mismatch — "pijn in mijn borst" must reach thoracale pijn / angina pectoris.
- Code-switching — Belgian patients mix Dutch and French (rendez-vous for afspraak).
- Abbreviations — NMR → MRI, CT → computertomografie.
7. Inference-runtime notes (ADR-0034)
On Apple Silicon, alternative runtimes gave no benefit for the local fallback model:
- ONNX Runtime — 2× slower than PyTorch (the CPU provider lacks ARM NEON optimisations; PyTorch uses Apple's Accelerate natively).
- Apple MPS — no improvement (cross-encoder inference is memory-bound at batch 40; MPS overhead offsets GPU parallelism).
PyTorch CPU is already well-optimised here. On a Linux/CUDA production host, GPU inference would change this calculus — but that is a deployment decision, not a model one.
8. References
- Chen, J., et al. (2024). BGE M3-Embedding. arXiv:2402.03216.
- Khattab, O. & Zaharia, M. (2020). ColBERT: Late Interaction over BERT. SIGIR 2020.
- Nogueira, R. & Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085.
- Nogueira, R., et al. (2019). Multi-Stage Document Ranking with BERT. arXiv:1910.14424.
- Santhanam, K., et al. (2022). ColBERTv2. NAACL 2022.
- Wu, P., et al. (2025). MedGraphRAG: Graph RAG for Medical Contexts.
- Xiong, G., et al. (2024). Benchmarking RAG for Medicine (MIRAGE). arXiv:2402.13178.
- Agentset Reranking Leaderboard. Hugging Face.
Related: Hybrid Search (the first stage that feeds this one) · Context Assembly (the next stage) · Value Framework (Step 3 in depth) · Query Pipeline (the full stage map).