Skip to main content

ColBERT Enablement Study — 2026-02-28 17:14 UTC

Two-condition paired experiment measuring the impact of ColBERT multi-vector reranking (BGE-M3 MaxSim) as a secondary reranker after Jina cross-encoder.

Study Design

ConditionFeature FlagPipeline Position
A. baselinecolbert_reranking_enabled=falseJina cross-encoder only
B. colbert-oncolbert_reranking_enabled=trueJina → ColBERT MaxSim refinement

When ColBERT is ON, the primary Jina cross-encoder returns the full candidate set (instead of top-k), and ColBERT refines the ranking using token-level MaxSim.

Primary Outcome: Pass Rate

ConditionPass RatePassedTotal
Baseline100.0%33
ColBERT ON100.0%33
Delta+0.0pp+0

McNemar's Test

ColBERT passColBERT fail
Baseline pass30 (regressed)
Baseline fail0 (improved)
  • Test type: exact binomial (b+c=0)
  • Statistic: exact test
  • p-value: 1.0000 (NOT significant at alpha=0.05)
  • Odds ratio: nan
  • Post-hoc power: 0.000

The difference is not statistically significant (p=1.0000). 0 questions improved and 0 regressed — insufficient evidence that ColBERT affects pass rate.

Secondary Outcomes

Entity Recall

  • Baseline: ER: 1.000 [1.000, 1.000] (n=3)
  • ColBERT ON: ER: 1.000 [1.000, 1.000] (n=3)
  • Delta: +0.0000

NDCG@5

  • Bootstrap test: delta=+0.000 (p=1.0000, not significant)

MRR

  • Bootstrap test: delta=+0.000 (p=1.0000, not significant)

Response Time

  • Baseline: 29098ms [14817, 50717]
  • ColBERT ON: 12907ms [6852, 20330]
  • Delta: -16191ms
  • Cohen's d: 1.191 (large)

ColBERT does not add measurable latency.

Per-Category Analysis

CategoryBaselineColBERT ONDelta
doctor_department3/3 (100%)3/3 (100%)+0.0pp

Regression Analysis

Improved: none Regressed: none

Statistical Methodology

  • McNemar's test for paired binary pass/fail outcomes: exact binomial for b+c < 25 (Dietterich 1998), continuity-corrected chi-squared otherwise (Edwards 1948)
  • Wilcoxon signed-rank test for paired continuous metrics (entity recall, response time)
  • Bootstrap CIs: 10,000 resamples, percentile method, seed=42 (Efron & Tibshirani 1993)
  • Cohen's d effect sizes: |d| < 0.2 negligible, 0.2-0.5 small, 0.5-0.8 medium, > 0.8 large (Cohen 1988)
  • Significance level: alpha = 0.05, no multiple comparison correction (pre-registered primary + secondary outcomes)

References

  • McNemar, Q. (1947). Psychometrika, 12(2), 153-157.
  • Edwards, A. L. (1948). Psychometrika, 13(3), 185-187.
  • Dietterich, T. G. (1998). Neural Computation, 10(7), 1895-1923.
  • Efron, B. & Tibshirani, R. J. (1993). An Introduction to the Bootstrap.
  • Cohen, J. (1988). Statistical Power Analysis (2nd ed.). Routledge.

System Context

  • Git branch: master
  • Git commit: de115ca
  • LLM model: openai/o4-mini
  • Embedding model: bge-m3
  • Questions: 3
  • DeepEval: disabled (entity-recall only)

Generated by run_colbert_study.py