Embedding Models and Vector Search

Embedding models transform text into dense vector representations where semantically similar content is geometrically close. The ZOL system uses OpenAI text-embedding-3-large (1536-dimensional dense vectors) accessed via the OpenAI API. See ADR-0048 for the migration rationale and @openai2024embeddings for the model announcement; the underlying dense bi-encoder retrieval pattern is the academic precedent established by @karpukhin2020dpr. Semantically similar Dutch phrases like "hartfalen behandeling" and "therapie voor het hart" produce vectors with high cosine similarity, even without shared words.

Trade-offs

Decision	Chosen	Alternatives considered	Rejected because
Embedding provider	OpenAI API (`text-embedding-3-large`, 1,536-dim truncated from native 3,072)	Local Ollama with BGE-M3 (1,024-dim, @chen2024bgem3); local nomic-embed-text (768-dim); local mxbai-embed-large (1,024-dim, English-only)	mxbai-embed-large failed Dutch outright. nomic-embed-text had no published Dutch benchmark and produced moderate retrieval quality. BGE-M3 served well from Feb–Apr 2026 (MTEB-NL 60.0) but trailed text-embedding-3-large on the same benchmark (~64.6) and required Ollama operational overhead. text-embedding-3-large costs ~$0.20/month at our 25 K query volume — operationally negligible — and removed the Ollama dependency entirely. See ADR-0048.
Dimensionality	1,536 (truncated from native 3,072)	Native 3,072; 768 (smaller variant)	pgvector's HNSW index has a 2,000-dim hard limit; native 3,072 cannot be indexed without lossy quantisation. The 1,536-dim truncation is OpenAI's documented Matryoshka-representation approach: cosine geometry is preserved by simple truncation. The smaller variants traded too much Dutch quality.
Index type	HNSW (Malkov & Yashunin 2018)	IVFFlat; brute-force scan; FAISS (@johnson2017faiss)	IVFFlat degrades on dynamic inserts without periodic reindexing — unacceptable for a corpus that grows nightly. Brute-force is O(N) per query and fine at 10 K chunks but stops scaling at the 50 K horizon. FAISS would mean a separate operational system on top of pgvector; we kept everything in pgvector for the one-database, one-backup story.
Token counter	tiktoken `cl100k_base`	Hugging Face `transformers` tokeniser; word count	Embedder is OpenAI; cl100k_base is OpenAI's tokeniser, so token counts are exact rather than approximate. Word count is 10–20 % off for Dutch compounds. The Hugging Face tokeniser is correct but heavier as a runtime dependency for what is just a chunk-size guard.

The Embedding Model Selection Journey

Selecting the right embedding model for the ZOL system required careful evaluation against three criteria: Dutch language quality, dimensional efficiency, and operational cost. The journey involved three candidate models before arriving at the current production model.

Attempt 1: OpenAI text-embedding-3-small

The initial implementation used OpenAI's text-embedding-3-small model:

Dimensions: 1,536
Quality: Excellent for English, good for Dutch
Cost: $0.02 per million tokens (API-based)
Privacy: All content sent to OpenAI's API

While the quality was acceptable, the reliance on an external API created two concerns: ongoing cost at scale (25,000 monthly queries, plus ingestion) and data egress (all hospital content leaving the infrastructure for embedding).

Attempt 2: mxbai-embed-large (Local)

To address the API dependency, the team evaluated mxbai-embed-large, a high-quality open-source model that could run locally via Ollama:

Dimensions: 1,024
Quality: Excellent for English
Cost: Zero (local inference)
Problem: No Dutch language support

Testing revealed that mxbai-embed-large produced poor embeddings for Dutch medical text. Semantically related Dutch phrases were not mapped to nearby vectors, making retrieval unreliable. This model was quickly eliminated.

Attempt 3: nomic-embed-text (Previously Selected)

The third model evaluated was nomic-embed-text, which satisfied the local-inference and multilingual criteria:

Dimensions: 768
Quality: Moderate multilingual support (~20 languages including Dutch)
Context window: 8,192 tokens
Cost: Zero (local inference via Ollama)
Problem: No Dutch-specific benchmark score (MTEB-NL unavailable at time of selection)

nomic-embed-text served as the production model from initial deployment through February 2026 (documented in ADR-0005). While it performed adequately, the lack of a Dutch benchmark score made it difficult to assess true retrieval quality for Dutch medical content.

Previous: bge-m3 (Feb–Apr 2026)

For two months in early 2026 the production model was BAAI/bge-m3 at 1,024 dimensions, selected after the MTEB-NL benchmark became available (September 2025) and replacing nomic-embed-text. It served as the production embedding model from February 2026 (ADR-0033) until the migration to OpenAI in April 2026 (ADR-0048).

Dimensions: 1,024
Quality: Strong multilingual support (100+ languages), MTEB-NL retrieval score 60.0
Context window: 8,192 tokens
Cost: Zero (local inference via Ollama)
Privacy: All processing on-premise

BGE-M3 still survives in the stack as the ColBERT reranker model (feature-flagged), since BGE-M3 natively supports the late-interaction multi-vector mode required by ColBERT. See Reranking & Evaluation and the academic basis in Khattab & Zaharia 2020.

Current: text-embedding-3-large (OpenAI, ADR-0048)

The current embedding model is OpenAI text-embedding-3-large:

Dimensions: 1,536 (truncated from native 3,072 to fit pgvector's HNSW 2,000-dim limit)
Quality: Strong multilingual support; MTEB-NL retrieval score ~64.6 (above BGE-M3's 60.0)
Cost: $0.13 per million tokens (75% prompt-cache discount; ~$0.20/month at 25,000 monthly queries)
Privacy: Content sent to OpenAI's API (compensated by stronger quality and operational simplicity per ADR-0048)

The migration rationale (ADR-0048): superior MTEB-NL Dutch retrieval, removal of the Ollama operational dependency, and predictable latency under load. See OpenAI 2024 for the canonical announcement.

Why This Matters

The system has migrated through three embedding models: nomic-embed-text (768d) → BGE-M3 (1024d) → text-embedding-3-large (3072d native, truncated to 1536d for HNSW). The current model provides the highest retrieval quality for Dutch medical content.

Model Comparison

Model	Dims	MTEB-NL	Dutch	Provider	Cost	Status
OpenAI text-embedding-3-small	1,536	N/A	Good	OpenAI API	$0.02/M tokens	Rejected (Attempt 1)
mxbai-embed-large	1,024	N/A	Poor	Ollama	Free	Rejected (Attempt 2)
nomic-embed-text	768	N/A	Moderate	Ollama	Free	Replaced Feb 2026
bge-m3	1,024	60.0	Strong	Ollama	Free	Replaced Apr 2026 (ADR-0048); now ColBERT-only
text-embedding-3-large (current)	1,536	~64.6	Strong	OpenAI API	$0.13/M tokens	Current (ADR-0048)

Contextual Embeddings

The ZOL system implements Anthropic's Contextual Retrieval technique. Rather than embedding raw chunk text, each chunk is embedded as enriched text combining three components:

Chunk context: An LLM-generated summary of the surrounding document context
Canonical questions: LLM-generated questions that this chunk can answer
Raw chunk text: The original content

This enrichment ensures that even isolated chunks retain document-level context in their embedding vectors. Anthropic's research shows this reduces the top-20-chunk retrieval failure rate by 35% compared to naive embedding, and by 49% when combined with BM25 keyword search.

The enrichment data is generated once during ingestion and stored in chunk_metadata. Both the ingestion pipeline and the re-embedding script use the same _build_enriched_text() function to ensure consistency.

Vector Similarity Search

At query time, the user's question is embedded using the same text-embedding-3-large model, producing a 1,536-dimensional query vector. This vector is then compared against all stored document chunk vectors using cosine similarity:

$$ \text{similarity}(A, B) = \frac{A \cdot B}{|A| \times |B|} $$

Cosine similarity measures the angle between two vectors, ranging from -1 (opposite) to +1 (identical). For normalized vectors (which OpenAI's embedding API returns), this is equivalent to the dot product, making computation efficient.

Approximate Nearest Neighbors

With ~10,400 document chunks (May 2026 production corpus), an exact brute-force comparison against all vectors for every query would be computationally expensive. Instead, the system uses an Approximate Nearest Neighbor (ANN) index that trades a small amount of accuracy for dramatically faster search — see Johnson et al. 2017 for the academic basis of billion-scale ANN search and pgvector for our specific implementation.

HNSW: How It Works

The Hierarchical Navigable Small World (HNSW) algorithm, proposed by Malkov and Yashunin (2018), builds a multi-layered graph where:

The bottom layer contains all vectors, connected to their nearest neighbors
Higher layers contain progressively fewer vectors, acting as "express lanes"
Search starts at the top layer and navigates down, narrowing the search space at each level

Why HNSW over IVFFlat?

pgvector supports two index types. HNSW was selected because ZOL's content is continuously updated -- new brochures are published, website content changes, and doctors join or leave:

Property	HNSW	IVFFlat
Dynamic inserts	Graceful (no reindex)	Degrades without reindex
Build time	Slower initial build	Faster initial build
Query accuracy	Higher (recall ~99%)	Lower without tuning
Memory	Higher	Lower

For a system where content freshness directly impacts user experience, HNSW's ability to maintain search quality without operational maintenance (reindexing) was the deciding factor.

Token Counting Approximation

The chunking service uses Tiktoken's cl100k_base tokenizer to count tokens for chunk size control. With text-embedding-3-large (also an OpenAI model), cl100k_base is the same tokenizer the embedder uses internally, so token counts are exact. Historically, when the embedder was BGE-M3 (Feb–Apr 2026), the cl100k_base count was approximate (~10-15% variance) but still acceptable since chunk size targets (350 tokens) and maximums (450 tokens) have sufficient margin to absorb the variance. The Ollama fallback path (e.g. llama3.2:3b) retains that approximation behavior.

Embedding Pipeline Summary

References

Anthropic. (2024). Introducing contextual retrieval. https://www.anthropic.com/news/contextual-retrieval
Banar, N., & Lotfi, E. (2025). MTEB-NL and E5-NL: Embedding benchmark and models for Dutch. arXiv preprint, arXiv:2509.12340. https://arxiv.org/abs/2509.12340
Chen, J., et al. (2024). BGE M3-Embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint, arXiv:2402.03216. https://arxiv.org/abs/2402.03216
Karpukhin, V., et al. (2020). Dense passage retrieval for open-domain question answering. Proceedings of EMNLP 2020, 6769--6781. https://arxiv.org/abs/2004.04906
Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 824--836. https://doi.org/10.1109/TPAMI.2018.2889473
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-Networks. Proceedings of EMNLP 2019, 3982--3992. https://arxiv.org/abs/1908.10084

Trade-offs​

The Embedding Model Selection Journey​

Attempt 1: OpenAI text-embedding-3-small​

Attempt 2: mxbai-embed-large (Local)​

Attempt 3: nomic-embed-text (Previously Selected)​

Previous: bge-m3 (Feb–Apr 2026)​

Current: text-embedding-3-large (OpenAI, ADR-0048)​

Model Comparison​

Contextual Embeddings​

Vector Similarity Search​

Approximate Nearest Neighbors​

HNSW: How It Works​

Why HNSW over IVFFlat?​

Token Counting Approximation​

Embedding Pipeline Summary​

References​