Skip to main content

ADR-0033: BGE-M3 Embedding Migration

Superseded by ADR-0048 (2026-04-30)

This decision record describes the adoption of BGE-M3 in February 2026. The embedding model has since been migrated to OpenAI text-embedding-3-large — see ADR-0048. The body below is preserved verbatim as the historical decision record; do not configure a new system from this page. For the current embedding stack, read the embedding-model decision and the storage architecture page.

Status: Superseded by ADR-0048 (2026-04-30) — original status: Accepted (February 2026)

Supersedes: ADR-0005

Context

The ZOL Intelligent Search system relied on nomic-embed-text (768-dim) for all embedding operations. While functional, this model had three significant limitations:

  1. Unknown Dutch quality: nomic-embed-text was not benchmarked on MTEB-NL (the Dutch embedding benchmark), making its Dutch retrieval quality unmeasured.
  2. Semantic cache contamination: The A/B experiment (report) revealed that structurally similar Dutch medical queries (e.g., "Welke artsen werken bij Cardiologie?" vs "Welke artsen werken bij Orthopedie?") produced dangerously similar embeddings (cosine >0.97), causing cache false positives.
  3. Limited multilingual support: English-primary training data resulted in weak cross-lingual embedding similarity for non-Dutch queries (Turkish, Arabic patient demographics).

BGE-M3 (Chen et al., 2024) was identified as the strongest candidate based on:

  • MTEB-NL benchmark score of 60.0 (retrieval), providing measured Dutch quality
  • 1024 dimensions (vs 768), offering richer representations
  • 100+ language support with XLM-RoBERTa architecture
  • Same 8,192-token context window as nomic-embed-text
  • Available on Ollama for zero-cost local inference

An alternative candidate, multilingual-e5-large-instruct (MTEB-NL: 61.4), was rejected due to its 512-token context window limitation -- insufficient for our medical content chunks averaging ~350 tokens with contextual enrichment.

Decision

Migrate from nomic-embed-text (768-dim) to BGE-M3 (1024-dim) as the embedding model for all operations: document ingestion, query embedding, quality gate evaluation, and semantic cache.

Migration Steps Executed

  1. Updated config: embedding_model="bge-m3", embedding_dimensions=1024
  2. Database migration: Altered pgvector column from vector(768) to vector(1024)
  3. Re-embedded all documents with enriched text (contextual retrieval format)
  4. Rebuilt semantic cache entries
  5. Recalibrated similarity thresholds (quality gate: 0.40 maintained; cache: 0.97 maintained)
  6. Ran golden evaluation to validate retrieval quality

Consequences

Positive

  • Measurable Dutch quality: MTEB-NL score of 60.0 replaces "unknown" baseline
  • Better cache discrimination: Higher-dimensional embeddings produce more distinctive vectors for structurally similar queries
  • Improved multilingual support: Superior cross-lingual retrieval quality
  • Future ColBERT option: BGE-M3 supports dense + sparse + ColBERT retrieval modes

Negative

  • 33% more storage per vector (1024 vs 768 dimensions)
  • Full re-indexing required during migration (temporary downtime)
  • Slightly higher embedding latency (~15% increase for local inference)

References

  • Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). BGE M3-Embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint, arXiv:2402.03216. https://arxiv.org/abs/2402.03216
  • Muennighoff, N., et al. (2023). MTEB: Massive text embedding benchmark. Proceedings of EACL 2023, 2014--2037. https://huggingface.co/spaces/mteb/leaderboard