ColBERT Enablement Study — 2026-02-28 17:14 UTC

Two-condition paired experiment measuring the impact of ColBERT multi-vector reranking (BGE-M3 MaxSim) as a secondary reranker after Jina cross-encoder.

Study Design

Condition	Feature Flag	Pipeline Position
A. baseline	`colbert_reranking_enabled=false`	Jina cross-encoder only
B. colbert-on	`colbert_reranking_enabled=true`	Jina → ColBERT MaxSim refinement

When ColBERT is ON, the primary Jina cross-encoder returns the full candidate set (instead of top-k), and ColBERT refines the ranking using token-level MaxSim.

Primary Outcome: Pass Rate

Condition	Pass Rate	Passed	Total
Baseline	100.0%	3	3
ColBERT ON	100.0%	3	3
Delta	+0.0pp	+0

McNemar's Test

	ColBERT pass	ColBERT fail
Baseline pass	3	0 (regressed)
Baseline fail	0 (improved)	—

Test type: exact binomial (b+c=0)
Statistic: exact test
p-value: 1.0000 (NOT significant at alpha=0.05)
Odds ratio: nan
Post-hoc power: 0.000

The difference is not statistically significant (p=1.0000). 0 questions improved and 0 regressed — insufficient evidence that ColBERT affects pass rate.

Secondary Outcomes

Entity Recall

Baseline: ER: 1.000 [1.000, 1.000] (n=3)
ColBERT ON: ER: 1.000 [1.000, 1.000] (n=3)
Delta: +0.0000

NDCG@5

Bootstrap test: delta=+0.000 (p=1.0000, not significant)

MRR

Bootstrap test: delta=+0.000 (p=1.0000, not significant)

Response Time

Baseline: 29098ms [14817, 50717]
ColBERT ON: 12907ms [6852, 20330]
Delta: -16191ms
Cohen's d: 1.191 (large)

ColBERT does not add measurable latency.

Per-Category Analysis

Category	Baseline	ColBERT ON	Delta
doctor_department	3/3 (100%)	3/3 (100%)	+0.0pp

Regression Analysis

Improved: none Regressed: none

Statistical Methodology

McNemar's test for paired binary pass/fail outcomes: exact binomial for b+c < 25 (Dietterich 1998), continuity-corrected chi-squared otherwise (Edwards 1948)
Wilcoxon signed-rank test for paired continuous metrics (entity recall, response time)
Bootstrap CIs: 10,000 resamples, percentile method, seed=42 (Efron & Tibshirani 1993)
Cohen's d effect sizes: |d| < 0.2 negligible, 0.2-0.5 small, 0.5-0.8 medium, > 0.8 large (Cohen 1988)
Significance level: alpha = 0.05, no multiple comparison correction (pre-registered primary + secondary outcomes)

References

McNemar, Q. (1947). Psychometrika, 12(2), 153-157.
Edwards, A. L. (1948). Psychometrika, 13(3), 185-187.
Dietterich, T. G. (1998). Neural Computation, 10(7), 1895-1923.
Efron, B. & Tibshirani, R. J. (1993). An Introduction to the Bootstrap.
Cohen, J. (1988). Statistical Power Analysis (2nd ed.). Routledge.

System Context

Git branch: master
Git commit: de115ca
LLM model: openai/o4-mini
Embedding model: bge-m3
Questions: 3
DeepEval: disabled (entity-recall only)

Generated by run_colbert_study.py

Study Design​

Primary Outcome: Pass Rate​

McNemar's Test​

Secondary Outcomes​

Entity Recall​

NDCG@5​

MRR​

Response Time​

Per-Category Analysis​

Regression Analysis​

Statistical Methodology​

References​

System Context​