Reranking

Retrieval finds candidate answers quickly; reranking decides their order before the language model ever sees them. Because the model is only shown the top handful of chunks, the order is not cosmetic — it determines which evidence grounds the answer. This page explains what reranking is, where it sits in the ZOL pipeline, the three rerank steps that run in sequence, and the empirical evaluation that fixed the current configuration.

1. What reranking is, and why it exists

Modern retrieval uses a two-stage retrieve-then-rerank design (Nogueira & Cho 2019). The two stages have deliberately opposite priorities:

Stage	Optimises for	How	Cost
First-stage retrieval	Recall — don't miss the right chunk	Bi-encoder vector search + BM25, fused with RRF	Cheap; runs over the whole corpus
Reranking	Precision — put the best chunk first	Heavier models that score each candidate against the query	Expensive; runs over a small shortlist only

The economic logic is the funnel below: a cheap retriever casts a wide net to guarantee the right chunk is somewhere in ~20 candidates; an expensive reranker then studies that short list closely to get the ordering right. You never run the expensive model over the whole corpus — only over the shortlist.

A weak reranker undermines everything upstream: even with perfect retrieval, poor ordering means the model receives second-best evidence and produces a second-best answer.

Why this is hard for Dutch medical content

The reranker must judge relevance in a language and domain that general-purpose models handle poorly:

Morphology — Dutch compounds such as hartchirurgie (heart-surgery) or bloeddrukverlagend (blood-pressure-lowering) only match well with models trained on Dutch or multilingual corpora.
Mixed vocabulary — clinical text blends Dutch, Latin and English (hypertensie, psoriasis, ECG).
Register gap — patients write colloquially (hoge bloeddruk) while content is clinical (Hypertensie).
Low resource — there is no Dutch equivalent of MS MARCO, so English reranker benchmarks do not transfer (Section 6 shows a model that tops the English leaderboard scoring last on our corpus).

2. Where reranking sits in the pipeline

Reranking is not a standalone service — it is the bridge between retrieval and context assembly inside the query pipeline. The pipeline is organised as numbered stages; reranking spans the end of the retrieval stage and a dedicated step immediately after it:

The two facts that the old version of this page got wrong, now stated correctly:

The cross-encoder runs first, the Value Framework runs last. In code, _qs_execute_retrieval_stage runs hybrid search then the cross-encoder rerank (Stage 5); the Value Framework affinity rerank (apply_intent_category_affinity, Stage 5b) fires afterwards and re-sorts. So relevance is scored first, then category-appropriateness has the final say.
"Stage 5b" is just the pipeline's label for the Value Framework step — it is not a prerequisite the reader needs to know in advance; it simply means "the step right after retrieval-and-rerank."

3. The rerank stack: three sequential steps

ZOL extends the textbook two-stage shape into three rerank steps, applied in this order:

Order	Step	Always on?	What it does
1	Cross-encoder relevance rerank	Yes (full / escalated mode)	Scores each (query, chunk) pair jointly for fine-grained relevance
2	ColBERT late-interaction	No — feature-flagged off	Token-level multi-vector matching for precision-critical queries
3	Value Framework category rerank	Yes (no flag)	Multiplies each score by an intent × content-category affinity, then re-sorts

Step 1 — Cross-encoder relevance rerank

A first-stage retriever uses a bi-encoder: the query and each document are embedded separately, and similarity is a cheap vector comparison. A cross-encoder instead feeds the query and a candidate together into one model, so every word of the query can attend to every word of the document. This joint attention is far more accurate for fine-grained relevance — and far too slow to run over a whole corpus, which is exactly why it belongs in the rerank stage over a 20-candidate shortlist.

ZOL uses Jina Reranker v2 (API) as primary, with BAAI/bge-reranker-v2-m3 (local) as an automatic fallback when the API is unavailable (see Section 5 for why these specific models).

Step 2 — ColBERT late-interaction (optional)

A cross-encoder still compresses each document to a single relevance score. ColBERT (Khattab & Zaharia 2020) keeps one vector per token and scores by MaxSim — for each query token it takes the best-matching document token, then sums:

This preserves token-level distinctions a single vector hides — Dutch compounds (hartchirurgie vs hartritmestoornissen), doctor-name precision (Dr. Mullens vs Dr. Peeters), and multi-entity queries ("welke cardioloog op campus Sint-Jan doet echocardiografie?" has three entities ColBERT scores independently). Because storing per-token vectors for the whole corpus is prohibitive, ColBERT is used only as a reranker over the shortlist, and is disabled by default (colbert_reranking_enabled=False) — held in reserve for precision-critical use.

Step 3 — Value Framework category rerank

The cross-encoder answers "how relevant is this chunk?" The Value Framework answers a different question: "is this the right kind of content for this question?" It multiplies each chunk's score by an intent × content-category affinity and re-sorts, so a navigational question is answered from practical content rather than regulatory content even when both mention the same keyword. It is always on, runs last, and is the final arbiter of order. Full detail on the Value Framework page.

Two different "rerankers", one common confusion

The cross-encoder decides relevance (how well a chunk matches the query). The Value Framework decides appropriateness (whether the chunk's category fits the question's intent). They are complementary and sequential — relevance first, appropriateness last.

The full stack, end to end

Putting retrieval and all three steps together — including the two behaviours most people miss (pinned chunks and intent-aware top-K):

Three behaviours of this stack carry most of its real-world effect:

Pinned chunks bypass reranking. Graph results and keyword_rescue hits are concatenated after reranking (s["chunks"] = pinned + reranked), so a deterministic structured answer — e.g. a doctor→department fact — is never reordered away by a relevance model that cannot see the structure.
The output count is intent-aware, not fixed. Standard retrieval resolves top-K from the intent classifier (_resolve_top_k_for_intent): a navigation_or_practical_info query keeps ≈12 chunks, a doctor_lookup ≈6; escalated ("Think Harder") keeps a fixed wide budget of 20.
Reranking is fail-open. If the reranker errors, the exception is caught and the pre-rerank order is kept (logged, non-fatal) — an outage degrades ordering, it never drops the answer.
Breadth queries get a diversity cap. For navigation/department_lookup intents the reranker is asked for 2× top-K, then capped at max 2 chunks per document, so the final context spreads across separate brochures instead of being dominated by one general page.

4. Current configuration

Parameter	Value	Rationale
Primary model	Jina Reranker v2 (API)	Winner of the Section 5 Dutch-medical benchmark; ≈10× faster than local BGE
Fallback model	`BAAI/bge-reranker-v2-m3` (local)	Multilingual cross-encoder, supports Dutch; runs when the Jina API is unavailable
Fallback architecture	XLM-RoBERTa, 24-layer	560 M parameters
max_length	512 tokens	Safety net beyond character truncation
Content truncation	500 characters	11.4× speedup vs full content (ADR-0034); medical pages frontload the key facts
Candidates	20 (`rag_rerank_candidates`)	Section 5: 20 beats 40 on MRR/NDCG while halving latency
Score threshold	0.0	No minimum-score filtering
ColBERT	`colbert_reranking_enabled=False`	Feature-flagged; reserved for the precision-critical case
Value Framework	Always-on	Multiplies score by the intent × category matrix; no flag
Fallback inference	CPU (PyTorch)	ONNX/MPS evaluated, no benefit on Apple Silicon (Section 7)

note

BGE-M3 is the ColBERT model, not the embedding model

Two different BGE-M3 roles are easy to conflate. The primary embedding model for first-stage vector search is OpenAI text-embedding-3-large (1,536-dim), per ADR-0048 (@openai2024embeddings). BGE-M3 (1,024-dim) survives in the stack only because it natively supports the late-interaction multi-vector mode that Step 2 (ColBERT) needs. Older "first-stage retriever (BGE-M3 embeddings)" phrasing in the Section 5 benchmark describes the state at that benchmark's date; today the first-stage retriever is text-embedding-3-large.

Key design trade-offs

Decision	Chosen	Rejected alternatives, and why
Cross-encoder model	Jina Reranker v2 API	Local BGE was 10× slower with worse MRR. Cohere Rerank 4 is #2 on the English leaderboard but scored last on our Dutch corpus (MRR@10 = 0.149) — English benchmarks do not transfer.
Candidate count	20	40/50/100 scored worse on MRR/NDCG: beyond 20, extra candidates add noise, not recoverable signal, and push latency past the perceptual threshold.
Value Framework as a separate last step	Always-on, after the cross-encoder	Baking the affinity matrix into the cross-encoder would require re-fine-tuning per tenant; a hard category filter would drop correct chunks on a boundary misclassification. A graded multiplier lets the cross-encoder rank on pure relevance first, then re-weights by appropriateness without ever excluding a chunk outright.
ColBERT as optional Step 2, not primary	Feature-flagged	ColBERT-as-retriever needs per-token vectors for the whole corpus (storage-prohibitive at 10k+ chunks). As a reranker it gives the late-interaction precision benefit only where it is worth the cost.
Truncation length	500 chars (configurable)	500 chars gives an 11.4× local speedup; 1,000 chars gave a tiny MAP gain (0.2804 vs 0.2751) at 4× latency. On the Jina API path, input length matters little.

5. The Dutch-medical evaluation (why these choices)

The configuration above was not chosen from leaderboards — it was measured on a purpose-built Dutch evaluation set. We built 118 Dutch medical queries from the golden-question corpus, each with expected_source_urls, yielding 695 positive and 2,360 negative chunks as query/positive/negative triples (scripts/build_reranker_eval_set.py, scripts/benchmark_rerankers.py).

Model	Truncation	Candidates	MRR@10	NDCG@10	MAP	Avg latency
Jina Reranker v2	API	20	0.4046	0.3005	0.3473	≈450 ms
Jina Reranker v2	API	40	0.3166	0.2118	0.2550	446 ms
BGE-reranker-v2-m3	1000	20	0.2406	0.1901	0.2804	4,596 ms
BGE-reranker-v2-m3	500	20	0.2247	0.1914	0.2751	1,503 ms
BGE-reranker-v2-m3	2000	20	0.2296	0.1805	0.2737	4,960 ms
Cohere Rerank 4	API	40	0.1493	0.1054	—	892 ms

Jina API runs hit rate-limiting (429s) that inflated some latency measurements; ≈450 ms is the rate-limit-free figure. All BGE runs used the full 118-query set.

What the data settled:

Jina Reranker v2 wins on Dutch medical content — highest MRR@10 (0.4046) and NDCG@10 (0.3005), while ≈10× faster than local BGE.
20 candidates beats 40/50 at every truncation length — beyond 20, first-stage candidates add noise.
English rankings do not transfer — Cohere Rerank 4 (English #2) scored last here. This is the single strongest argument for our evaluation-first approach.
Switching to the API saved ≈4 s/query (4.6 s local → 0.45 s) — a transformative UX gain.

The production decision: Jina Reranker v2 API primary, local BGE-reranker-v2-m3 automatic fallback, candidates reduced 40 → 20.

Candidate count vs latency

Candidates	Recall potential	Reranker latency (500-char, CPU)	Notes
20	Best on Dutch medical	≈3.5 s local · ≈450 ms Jina	Current setting
40	+15–20% raw recall	≈7.0 s	Worse MRR/NDCG than 20
50	+18–22%	≈8.5 s	Pre-optimisation setting
100	+25–30%	≈17 s	Impractical for interactive use

Nogueira et al. (2019) showed cross-encoder reranking of top-100 BM25 results lifting NDCG@10 by 15–20% on MS MARCO — but returns diminish fast beyond 50 candidates for domain-specific corpora where the first-stage retriever already has decent precision.

6. State-of-the-art context

Reranker leaderboard (Agentset, Feb 2026)

Rank	Model	ELO	Type	Multilingual
1	Zerank 2	1638	API	Yes
2	Cohere Rerank 4 Pro	1629	API	100+ langs
3	Jina Reranker v2	1585	API/Local	Yes
5	BGE Reranker v2.5 Gemma	1498	Local	Yes
11	BGE Reranker v2 m3 (our fallback)	1327	Local	Yes

Our local fallback ranks #11 of 12 — ≈300 ELO behind the leaders, which corresponds to roughly an 85% chance the higher-rated model wins a head-to-head on the leaderboard's (English) benchmarks. But the leaderboard evaluates English; our production primary is the API model (Jina, #3), and our own Dutch benchmark (Section 5) is what actually governs the choice. This is precisely why the Cohere result matters: leaderboard position is not a substitute for in-domain measurement.

Healthcare RAG literature

MEGA-RAG (2025) — graph traversal + dense retrieval + reranking; +18–20% over vanilla RAG on MedQA/PubMedQA.
MedGraphRAG (2025) — graph-guided retrieval cuts hallucination ≈40% vs pure vector search; uses no neural reranking.
MIRAGE (2024) (Xiong et al. 2024) — standardised medical-QA evaluation; baseline ms-marco-MiniLM reranking reaches NDCG@10 0.72, domain-adapted rerankers 0.78.

The consistent finding across this literature: the quality of the final re-scoring stage has outsized impact on answer quality — which is why ZOL invests three steps in it.

Dutch-specific gaps standard benchmarks miss

Compound splitting — bloeddrukverlagend should match Hypertensie.
Register mismatch — "pijn in mijn borst" must reach thoracale pijn / angina pectoris.
Code-switching — Belgian patients mix Dutch and French (rendez-vous for afspraak).
Abbreviations — NMR → MRI, CT → computertomografie.

7. Inference-runtime notes (ADR-0034)

On Apple Silicon, alternative runtimes gave no benefit for the local fallback model:

ONNX Runtime — 2× slower than PyTorch (the CPU provider lacks ARM NEON optimisations; PyTorch uses Apple's Accelerate natively).
Apple MPS — no improvement (cross-encoder inference is memory-bound at batch 40; MPS overhead offsets GPU parallelism).

PyTorch CPU is already well-optimised here. On a Linux/CUDA production host, GPU inference would change this calculus — but that is a deployment decision, not a model one.

8. References

Chen, J., et al. (2024). BGE M3-Embedding. arXiv:2402.03216.
Khattab, O. & Zaharia, M. (2020). ColBERT: Late Interaction over BERT. SIGIR 2020.
Nogueira, R. & Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085.
Nogueira, R., et al. (2019). Multi-Stage Document Ranking with BERT. arXiv:1910.14424.
Santhanam, K., et al. (2022). ColBERTv2. NAACL 2022.
Wu, P., et al. (2025). MedGraphRAG: Graph RAG for Medical Contexts.
Xiong, G., et al. (2024). Benchmarking RAG for Medicine (MIRAGE). arXiv:2402.13178.
Agentset Reranking Leaderboard. Hugging Face.

Related: Hybrid Search (the first stage that feeds this one) · Context Assembly (the next stage) · Value Framework (Step 3 in depth) · Query Pipeline (the full stage map).

1. What reranking is, and why it exists​

Why this is hard for Dutch medical content​

2. Where reranking sits in the pipeline​

3. The rerank stack: three sequential steps​

Step 1 — Cross-encoder relevance rerank​

Step 2 — ColBERT late-interaction (optional)​

Step 3 — Value Framework category rerank​

The full stack, end to end​

4. Current configuration​

Key design trade-offs​

5. The Dutch-medical evaluation (why these choices)​

Candidate count vs latency​

6. State-of-the-art context​

Reranker leaderboard (Agentset, Feb 2026)​

Healthcare RAG literature​

Dutch-specific gaps standard benchmarks miss​

7. Inference-runtime notes (ADR-0034)​

8. References​