Skip to main content

ADR-0048: Migrate Embeddings from Ollama bge-m3 to OpenAI text-embedding-3-large

Master record: docs/ADR/0048-openai-embeddings-migration.md. The master is canonical; this Docusaurus rendering is for in-site navigation.

Date: 2026-04-30 Status: Accepted — Supersedes ADR-0033 (BGE-M3 via Ollama) Deciders: Tsunami-max (engineering)

Context

ADR-0033 (2026-02-18) selected bge-m3 (1024-dim) hosted via Ollama as the embedding provider for the ZOL RAG system, replacing nomic-embed-text from ADR-0005. The choice was driven by Dutch-language quality on MTEB-NL (Chen et al., 2024) and a data-sovereignty preference for keeping embeddings on-premise.

In the period since, several constraints shifted:

  1. Voice channel went live. Time-to-first-sound on voice turns is the primary user-perceived latency metric. bge-m3 is a 566 M-parameter F16 model running on CPU inside Ollama; first-call cold start is 1.7–2.1 s, and Ollama serializes concurrent requests, so the speculative+canonical parallel-embedding path in rag.retrieval_mixin paid 5–8 s wall-clock on voice turns. Three-filler ladders ("Even kijken…", "Ik ben nog aan het zoeken…", "Het duurt wat langer…") fired routinely as a result.
  2. Public-website context, low PII. The chatbot serves a hospital public website. Question text contains symptom keywords but no patient identifiers, lab values, or medical-record content. The data-sovereignty argument that motivated bge-m3 was overweighted for this surface area; embedding the query string "Wat zijn de bezoekuren bij cardiologie?" via OpenAI does not constitute a PHI/GDPR transfer.
  3. Silent compose-override drift surfaced a broken state. The backend/.env file had been updated to EMBEDDING_PROVIDER=openai / EMBEDDING_MODEL=text-embedding-3-large at some prior point, but docker/docker-compose.yml retained an environment: block that hardcoded EMBEDDING_PROVIDER: ollama and EMBEDDING_MODEL: bge-m3. Compose's environment: overrides env_file:, so the running container generated 1024-dim Ollama queries against a 1536-dim corpus — vector retrieval failed silently and the FAQ matcher returned generic "verschilt per afdeling" answers for any department-scoped query. The corpus has been at 1536 dim (text-embedding-3-large) for the entire window; the dev compose path was the regression.

The split state (corpus already migrated, query-side still on Ollama via a forgotten override) is itself the strongest evidence the migration was intended and incomplete rather than a deliberate dual-provider design.

Decision

Standardise on OpenAI text-embedding-3-large at 1536 dimensions for both ingest-time corpus embedding and query-time embedding. Remove Ollama from the runtime entirely.

Concrete changes:

  • docker/docker-compose.yml: drop the EMBEDDING_PROVIDER, EMBEDDING_MODEL, and OLLAMA_BASE_URL overrides in the backend environment: block; let env_file: ../backend/.env cascade authoritatively. Drop the ollama and ollama-init service blocks, the ollama depends_on entry on the backend, and the ollama_data volume.
  • backend/.env: already canonical (EMBEDDING_PROVIDER=openai, EMBEDDING_MODEL=text-embedding-3-large, EMBEDDING_DIMENSIONS=1536).
  • backend/app/services/embedding_service.py: routes by _provider; the OpenAI branch already exists and handled the live container query verification post-restart (211 ms first call, 150 ms second call, 1536-dim output). No code change required.

Consequences

Positive

  • ~30× faster embeddings on voice turns (Ollama 1.7–5.8 s → OpenAI 150–211 ms). Removes the largest fixed cost on every voice query.
  • Retrieval actually works on the dev path. The 1024/1536 dim mismatch silently broke vector search; with provider unified to text-embedding-3-large, department-scoped queries return department-specific content instead of falling through to FAQ generics.
  • No serialization queue. OpenAI handles concurrent requests; the speculative+canonical parallel-embedding path in rag.retrieval_mixin actually runs in parallel now instead of serialising behind Ollama's single slot.
  • Smaller compose stack. zol-ollama and zol-ollama-init containers are gone — fewer moving parts, lower local memory footprint (~1.4 GB freed).
  • Closes the env-drift class-of-bug. Removing the override means backend/.env is now the single source of truth for embedding config.

Negative

  • External-API dependency on OpenAI. Network outage or OpenAI outage degrades search to keyword/BM25 fallback. Mitigated by: embedding service raises clear errors that surface as a graceful degradation in the RAG pipeline's existing exception path.
  • Per-query cost. text-embedding-3-large at 1536 dim is $0.13 per 1 M tokens. At ~25 K queries/month × ~50 tokens average, that's ~1.25 M tokens/year ≈ $0.16/year. Trivial.
  • GDPR posture is slightly looser. Query strings now leave the EU through OpenAI (or Azure OpenAI EU if regulator pushes back). Justified above for public-website + non-PHI query content. If the surface ever expands to include identifying patient input, revisit with Azure OpenAI EU.

Neutral

  • Corpus does not need re-embedding. Already at 1536-dim text-embedding-3-large (3 761 chunks verified via vector_dims(embedding) aggregation in app.document_chunks).
  • backend/app/services/embedding_service.py retains both code paths. The if self._provider == "ollama" branches stay in source so a future on-prem deployment can flip the env back. Code cleanup deferred to a future ADR if/when the multi-provider abstraction proves unused for ≥ 6 months.

Alternatives Considered

Alternative 1: Keep bge-m3, fix the cold-start

  • Pros: Preserves data-sovereignty posture from ADR-0033. No external-API dependency.
  • Cons: Even hot, bge-m3 on CPU is 150–200 ms per call — same as OpenAI. Cold-start tax remains because keep_alive doesn't survive container restarts. Serialization tax remains because Ollama's OLLAMA_NUM_PARALLEL=1 setting is needed to avoid OOM on the dev box.
  • Why rejected: Solves zero of the three constraints. The data-sovereignty argument is overweighted for non-PHI public-website query content (see Context #2).

Alternative 2: Azure OpenAI EU endpoint

  • Pros: Same model family with stricter EU data-residency guarantees and Microsoft's GDPR-aligned DPA. Future-proofs the surface if patient-identifying input ever enters scope.
  • Cons: Adds Azure tenant + key management to ops. ~2 × the cost of direct OpenAI. Latency from Azure West Europe is comparable to OpenAI Frankfurt.
  • Why rejected (for now, not forever): Unnecessary for the current public-website + non-PHI surface area. Documented as the natural upgrade path if scope expands.

Alternative 3: text-embedding-3-small @ 1536-dim

  • Pros: ~2 × cheaper ($0.02 per 1 M tokens vs $0.13). Slightly faster.
  • Cons: text-embedding-3-large outperforms text-embedding-3-small on multilingual retrieval (MTEB) by ~3–5 percentage points. The corpus is already embedded with -large; switching to -small means re-embedding all 3 761 chunks (≈ 10 min, ≈ $0.10) AND accepting a small quality regression.
  • Why rejected: Cost is not a constraint at this scale. Quality matters for the hospital surface — every false-negative on a department-scoped query results in a fallback to a generic FAQ answer the caller doesn't want.

References

  • ADR-0033 (superseded by this ADR): bge-m3 via Ollama
  • ADR-0005 (superseded by ADR-0033): nomic-embed-text
  • Dev-loop incident 2026-04-30: voice-channel cardiologie query reproducer that surfaced the dim-mismatch silent retrieval failure. Compose-override drift is the same class of bug as the earlier OPENROUTER_API_KEY drift.
  • OpenAI 2024 Embeddings Announcement — model spec, dimensionality, pricing.
  • Chen et al., 2024 — BGE-M3 retained as historical reference for the predecessor model.