ColBERT Enablement Study — 2026-02-28 17:14 UTC
Two-condition paired experiment measuring the impact of ColBERT multi-vector reranking (BGE-M3 MaxSim) as a secondary reranker after Jina cross-encoder.
Study Design
| Condition | Feature Flag | Pipeline Position |
|---|---|---|
| A. baseline | colbert_reranking_enabled=false | Jina cross-encoder only |
| B. colbert-on | colbert_reranking_enabled=true | Jina → ColBERT MaxSim refinement |
When ColBERT is ON, the primary Jina cross-encoder returns the full candidate set (instead of top-k), and ColBERT refines the ranking using token-level MaxSim.
Primary Outcome: Pass Rate
| Condition | Pass Rate | Passed | Total |
|---|---|---|---|
| Baseline | 100.0% | 3 | 3 |
| ColBERT ON | 100.0% | 3 | 3 |
| Delta | +0.0pp | +0 |
McNemar's Test
| ColBERT pass | ColBERT fail | |
|---|---|---|
| Baseline pass | 3 | 0 (regressed) |
| Baseline fail | 0 (improved) | — |
- Test type: exact binomial (b+c=0)
- Statistic: exact test
- p-value: 1.0000 (NOT significant at alpha=0.05)
- Odds ratio: nan
- Post-hoc power: 0.000
The difference is not statistically significant (p=1.0000). 0 questions improved and 0 regressed — insufficient evidence that ColBERT affects pass rate.
Secondary Outcomes
Entity Recall
- Baseline: ER: 1.000 [1.000, 1.000] (n=3)
- ColBERT ON: ER: 1.000 [1.000, 1.000] (n=3)
- Delta: +0.0000
NDCG@5
- Bootstrap test: delta=+0.000 (p=1.0000, not significant)
MRR
- Bootstrap test: delta=+0.000 (p=1.0000, not significant)
Response Time
- Baseline: 29098ms [14817, 50717]
- ColBERT ON: 12907ms [6852, 20330]
- Delta: -16191ms
- Cohen's d: 1.191 (large)
ColBERT does not add measurable latency.
Per-Category Analysis
| Category | Baseline | ColBERT ON | Delta |
|---|---|---|---|
| doctor_department | 3/3 (100%) | 3/3 (100%) | +0.0pp |
Regression Analysis
Improved: none Regressed: none
Statistical Methodology
- McNemar's test for paired binary pass/fail outcomes: exact binomial for b+c < 25 (Dietterich 1998), continuity-corrected chi-squared otherwise (Edwards 1948)
- Wilcoxon signed-rank test for paired continuous metrics (entity recall, response time)
- Bootstrap CIs: 10,000 resamples, percentile method, seed=42 (Efron & Tibshirani 1993)
- Cohen's d effect sizes: |d| < 0.2 negligible, 0.2-0.5 small, 0.5-0.8 medium, > 0.8 large (Cohen 1988)
- Significance level: alpha = 0.05, no multiple comparison correction (pre-registered primary + secondary outcomes)
References
- McNemar, Q. (1947). Psychometrika, 12(2), 153-157.
- Edwards, A. L. (1948). Psychometrika, 13(3), 185-187.
- Dietterich, T. G. (1998). Neural Computation, 10(7), 1895-1923.
- Efron, B. & Tibshirani, R. J. (1993). An Introduction to the Bootstrap.
- Cohen, J. (1988). Statistical Power Analysis (2nd ed.). Routledge.
System Context
- Git branch: master
- Git commit: de115ca
- LLM model: openai/o4-mini
- Embedding model: bge-m3
- Questions: 3
- DeepEval: disabled (entity-recall only)
Generated by run_colbert_study.py