ADR-0048: Migrate Embeddings from Ollama bge-m3 to OpenAI text-embedding-3-large
Master record:
docs/ADR/0048-openai-embeddings-migration.md. The master is canonical; this Docusaurus rendering is for in-site navigation.
Date: 2026-04-30 Status: Accepted — Supersedes ADR-0033 (BGE-M3 via Ollama) Deciders: Tsunami-max (engineering)
Context
ADR-0033 (2026-02-18) selected bge-m3 (1024-dim) hosted via Ollama as the embedding provider for the ZOL RAG system, replacing nomic-embed-text from ADR-0005. The choice was driven by Dutch-language quality on MTEB-NL (Chen et al., 2024) and a data-sovereignty preference for keeping embeddings on-premise.
In the period since, several constraints shifted:
- Voice channel went live. Time-to-first-sound on voice turns is the primary user-perceived latency metric. bge-m3 is a 566 M-parameter F16 model running on CPU inside Ollama; first-call cold start is 1.7–2.1 s, and Ollama serializes concurrent requests, so the speculative+canonical parallel-embedding path in
rag.retrieval_mixinpaid 5–8 s wall-clock on voice turns. Three-filler ladders ("Even kijken…", "Ik ben nog aan het zoeken…", "Het duurt wat langer…") fired routinely as a result. - Public-website context, low PII. The chatbot serves a hospital public website. Question text contains symptom keywords but no patient identifiers, lab values, or medical-record content. The data-sovereignty argument that motivated bge-m3 was overweighted for this surface area; embedding the query string "Wat zijn de bezoekuren bij cardiologie?" via OpenAI does not constitute a PHI/GDPR transfer.
- Silent compose-override drift surfaced a broken state. The
backend/.envfile had been updated toEMBEDDING_PROVIDER=openai/EMBEDDING_MODEL=text-embedding-3-largeat some prior point, butdocker/docker-compose.ymlretained anenvironment:block that hardcodedEMBEDDING_PROVIDER: ollamaandEMBEDDING_MODEL: bge-m3. Compose'senvironment:overridesenv_file:, so the running container generated 1024-dim Ollama queries against a 1536-dim corpus — vector retrieval failed silently and the FAQ matcher returned generic "verschilt per afdeling" answers for any department-scoped query. The corpus has been at 1536 dim (text-embedding-3-large) for the entire window; the dev compose path was the regression.
The split state (corpus already migrated, query-side still on Ollama via a forgotten override) is itself the strongest evidence the migration was intended and incomplete rather than a deliberate dual-provider design.
Decision
Standardise on OpenAI text-embedding-3-large at 1536 dimensions for both ingest-time corpus embedding and query-time embedding. Remove Ollama from the runtime entirely.
Concrete changes:
docker/docker-compose.yml: drop theEMBEDDING_PROVIDER,EMBEDDING_MODEL, andOLLAMA_BASE_URLoverrides in the backendenvironment:block; letenv_file: ../backend/.envcascade authoritatively. Drop theollamaandollama-initservice blocks, theollamadepends_onentry on the backend, and theollama_datavolume.backend/.env: already canonical (EMBEDDING_PROVIDER=openai,EMBEDDING_MODEL=text-embedding-3-large,EMBEDDING_DIMENSIONS=1536).backend/app/services/embedding_service.py: routes by_provider; the OpenAI branch already exists and handled the live container query verification post-restart (211 ms first call, 150 ms second call, 1536-dim output). No code change required.
Consequences
Positive
- ~30× faster embeddings on voice turns (Ollama 1.7–5.8 s → OpenAI 150–211 ms). Removes the largest fixed cost on every voice query.
- Retrieval actually works on the dev path. The 1024/1536 dim mismatch silently broke vector search; with provider unified to
text-embedding-3-large, department-scoped queries return department-specific content instead of falling through to FAQ generics. - No serialization queue. OpenAI handles concurrent requests; the speculative+canonical parallel-embedding path in
rag.retrieval_mixinactually runs in parallel now instead of serialising behind Ollama's single slot. - Smaller compose stack.
zol-ollamaandzol-ollama-initcontainers are gone — fewer moving parts, lower local memory footprint (~1.4 GB freed). - Closes the env-drift class-of-bug. Removing the override means
backend/.envis now the single source of truth for embedding config.
Negative
- External-API dependency on OpenAI. Network outage or OpenAI outage degrades search to keyword/BM25 fallback. Mitigated by: embedding service raises clear errors that surface as a graceful degradation in the RAG pipeline's existing exception path.
- Per-query cost.
text-embedding-3-largeat 1536 dim is $0.13 per 1 M tokens. At ~25 K queries/month × ~50 tokens average, that's ~1.25 M tokens/year ≈ $0.16/year. Trivial. - GDPR posture is slightly looser. Query strings now leave the EU through OpenAI (or Azure OpenAI EU if regulator pushes back). Justified above for public-website + non-PHI query content. If the surface ever expands to include identifying patient input, revisit with Azure OpenAI EU.
Neutral
- Corpus does not need re-embedding. Already at 1536-dim
text-embedding-3-large(3 761 chunks verified viavector_dims(embedding)aggregation inapp.document_chunks). backend/app/services/embedding_service.pyretains both code paths. Theif self._provider == "ollama"branches stay in source so a future on-prem deployment can flip the env back. Code cleanup deferred to a future ADR if/when the multi-provider abstraction proves unused for ≥ 6 months.
Alternatives Considered
Alternative 1: Keep bge-m3, fix the cold-start
- Pros: Preserves data-sovereignty posture from ADR-0033. No external-API dependency.
- Cons: Even hot, bge-m3 on CPU is 150–200 ms per call — same as OpenAI. Cold-start tax remains because keep_alive doesn't survive container restarts. Serialization tax remains because Ollama's
OLLAMA_NUM_PARALLEL=1setting is needed to avoid OOM on the dev box. - Why rejected: Solves zero of the three constraints. The data-sovereignty argument is overweighted for non-PHI public-website query content (see Context #2).
Alternative 2: Azure OpenAI EU endpoint
- Pros: Same model family with stricter EU data-residency guarantees and Microsoft's GDPR-aligned DPA. Future-proofs the surface if patient-identifying input ever enters scope.
- Cons: Adds Azure tenant + key management to ops. ~2 × the cost of direct OpenAI. Latency from Azure West Europe is comparable to OpenAI Frankfurt.
- Why rejected (for now, not forever): Unnecessary for the current public-website + non-PHI surface area. Documented as the natural upgrade path if scope expands.
Alternative 3: text-embedding-3-small @ 1536-dim
- Pros: ~2 × cheaper ($0.02 per 1 M tokens vs $0.13). Slightly faster.
- Cons:
text-embedding-3-largeoutperformstext-embedding-3-smallon multilingual retrieval (MTEB) by ~3–5 percentage points. The corpus is already embedded with-large; switching to-smallmeans re-embedding all 3 761 chunks (≈ 10 min, ≈ $0.10) AND accepting a small quality regression. - Why rejected: Cost is not a constraint at this scale. Quality matters for the hospital surface — every false-negative on a department-scoped query results in a fallback to a generic FAQ answer the caller doesn't want.
References
- ADR-0033 (superseded by this ADR): bge-m3 via Ollama
- ADR-0005 (superseded by ADR-0033): nomic-embed-text
- Dev-loop incident 2026-04-30: voice-channel cardiologie query reproducer that surfaced the dim-mismatch silent retrieval failure. Compose-override drift is the same class of bug as the earlier
OPENROUTER_API_KEYdrift. - OpenAI 2024 Embeddings Announcement — model spec, dimensionality, pricing.
- Chen et al., 2024 — BGE-M3 retained as historical reference for the predecessor model.