Embedding Models and Vector Search
Embedding models transform text into dense vector representations where semantically similar content is geometrically close. The ZOL system uses OpenAI text-embedding-3-large (1536-dimensional dense vectors) accessed via the OpenAI API. See ADR-0048 for the migration rationale and @openai2024embeddings for the model announcement; the underlying dense bi-encoder retrieval pattern is the academic precedent established by @karpukhin2020dpr. Semantically similar Dutch phrases like "hartfalen behandeling" and "therapie voor het hart" produce vectors with high cosine similarity, even without shared words.
Trade-offs
| Decision | Chosen | Alternatives considered | Rejected because |
|---|---|---|---|
| Embedding provider | OpenAI API (text-embedding-3-large, 1,536-dim truncated from native 3,072) | Local Ollama with BGE-M3 (1,024-dim, @chen2024bgem3); local nomic-embed-text (768-dim); local mxbai-embed-large (1,024-dim, English-only) | mxbai-embed-large failed Dutch outright. nomic-embed-text had no published Dutch benchmark and produced moderate retrieval quality. BGE-M3 served well from Feb–Apr 2026 (MTEB-NL 60.0) but trailed text-embedding-3-large on the same benchmark (~64.6) and required Ollama operational overhead. text-embedding-3-large costs ~$0.20/month at our 25 K query volume — operationally negligible — and removed the Ollama dependency entirely. See ADR-0048. |
| Dimensionality | 1,536 (truncated from native 3,072) | Native 3,072; 768 (smaller variant) | pgvector's HNSW index has a 2,000-dim hard limit; native 3,072 cannot be indexed without lossy quantisation. The 1,536-dim truncation is OpenAI's documented Matryoshka-representation approach: cosine geometry is preserved by simple truncation. The smaller variants traded too much Dutch quality. |
| Index type | HNSW (Malkov & Yashunin 2018) | IVFFlat; brute-force scan; FAISS (@johnson2017faiss) | IVFFlat degrades on dynamic inserts without periodic reindexing — unacceptable for a corpus that grows nightly. Brute-force is O(N) per query and fine at 10 K chunks but stops scaling at the 50 K horizon. FAISS would mean a separate operational system on top of pgvector; we kept everything in pgvector for the one-database, one-backup story. |
| Token counter | tiktoken cl100k_base | Hugging Face transformers tokeniser; word count | Embedder is OpenAI; cl100k_base is OpenAI's tokeniser, so token counts are exact rather than approximate. Word count is 10–20 % off for Dutch compounds. The Hugging Face tokeniser is correct but heavier as a runtime dependency for what is just a chunk-size guard. |
The Embedding Model Selection Journey
Selecting the right embedding model for the ZOL system required careful evaluation against three criteria: Dutch language quality, dimensional efficiency, and operational cost. The journey involved three candidate models before arriving at the current production model.
Attempt 1: OpenAI text-embedding-3-small
The initial implementation used OpenAI's text-embedding-3-small model:
- Dimensions: 1,536
- Quality: Excellent for English, good for Dutch
- Cost: $0.02 per million tokens (API-based)
- Privacy: All content sent to OpenAI's API
While the quality was acceptable, the reliance on an external API created two concerns: ongoing cost at scale (25,000 monthly queries, plus ingestion) and data egress (all hospital content leaving the infrastructure for embedding).
Attempt 2: mxbai-embed-large (Local)
To address the API dependency, the team evaluated mxbai-embed-large, a high-quality open-source model that could run locally via Ollama:
- Dimensions: 1,024
- Quality: Excellent for English
- Cost: Zero (local inference)
- Problem: No Dutch language support
Testing revealed that mxbai-embed-large produced poor embeddings for Dutch medical text. Semantically related Dutch phrases were not mapped to nearby vectors, making retrieval unreliable. This model was quickly eliminated.
Attempt 3: nomic-embed-text (Previously Selected)
The third model evaluated was nomic-embed-text, which satisfied the local-inference and multilingual criteria:
- Dimensions: 768
- Quality: Moderate multilingual support (~20 languages including Dutch)
- Context window: 8,192 tokens
- Cost: Zero (local inference via Ollama)
- Problem: No Dutch-specific benchmark score (MTEB-NL unavailable at time of selection)
nomic-embed-text served as the production model from initial deployment through February 2026 (documented in ADR-0005). While it performed adequately, the lack of a Dutch benchmark score made it difficult to assess true retrieval quality for Dutch medical content.
Previous: bge-m3 (Feb–Apr 2026)
For two months in early 2026 the production model was BAAI/bge-m3 at 1,024 dimensions, selected after the MTEB-NL benchmark became available (September 2025) and replacing nomic-embed-text. It served as the production embedding model from February 2026 (ADR-0033) until the migration to OpenAI in April 2026 (ADR-0048).
- Dimensions: 1,024
- Quality: Strong multilingual support (100+ languages), MTEB-NL retrieval score 60.0
- Context window: 8,192 tokens
- Cost: Zero (local inference via Ollama)
- Privacy: All processing on-premise
BGE-M3 still survives in the stack as the ColBERT reranker model (feature-flagged), since BGE-M3 natively supports the late-interaction multi-vector mode required by ColBERT. See Reranking & Evaluation and the academic basis in Khattab & Zaharia 2020.
Current: text-embedding-3-large (OpenAI, ADR-0048)
The current embedding model is OpenAI text-embedding-3-large:
- Dimensions: 1,536 (truncated from native 3,072 to fit pgvector's HNSW 2,000-dim limit)
- Quality: Strong multilingual support; MTEB-NL retrieval score ~64.6 (above BGE-M3's 60.0)
- Cost: $0.13 per million tokens (75% prompt-cache discount; ~$0.20/month at 25,000 monthly queries)
- Privacy: Content sent to OpenAI's API (compensated by stronger quality and operational simplicity per ADR-0048)
The migration rationale (ADR-0048): superior MTEB-NL Dutch retrieval, removal of the Ollama operational dependency, and predictable latency under load. See OpenAI 2024 for the canonical announcement.
The system has migrated through three embedding models: nomic-embed-text (768d) → BGE-M3 (1024d) → text-embedding-3-large (3072d native, truncated to 1536d for HNSW). The current model provides the highest retrieval quality for Dutch medical content.
Model Comparison
| Model | Dims | MTEB-NL | Dutch | Provider | Cost | Status |
|---|---|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1,536 | N/A | Good | OpenAI API | $0.02/M tokens | Rejected (Attempt 1) |
| mxbai-embed-large | 1,024 | N/A | Poor | Ollama | Free | Rejected (Attempt 2) |
| nomic-embed-text | 768 | N/A | Moderate | Ollama | Free | Replaced Feb 2026 |
| bge-m3 | 1,024 | 60.0 | Strong | Ollama | Free | Replaced Apr 2026 (ADR-0048); now ColBERT-only |
| text-embedding-3-large (current) | 1,536 | ~64.6 | Strong | OpenAI API | $0.13/M tokens | Current (ADR-0048) |
Contextual Embeddings
The ZOL system implements Anthropic's Contextual Retrieval technique. Rather than embedding raw chunk text, each chunk is embedded as enriched text combining three components:
- Chunk context: An LLM-generated summary of the surrounding document context
- Canonical questions: LLM-generated questions that this chunk can answer
- Raw chunk text: The original content
This enrichment ensures that even isolated chunks retain document-level context in their embedding vectors. Anthropic's research shows this reduces the top-20-chunk retrieval failure rate by 35% compared to naive embedding, and by 49% when combined with BM25 keyword search.
The enrichment data is generated once during ingestion and stored in chunk_metadata. Both the ingestion pipeline and the re-embedding script use the same _build_enriched_text() function to ensure consistency.
Vector Similarity Search
At query time, the user's question is embedded using the same text-embedding-3-large model, producing a 1,536-dimensional query vector. This vector is then compared against all stored document chunk vectors using cosine similarity:
$$ \text{similarity}(A, B) = \frac{A \cdot B}{|A| \times |B|} $$
Cosine similarity measures the angle between two vectors, ranging from -1 (opposite) to +1 (identical). For normalized vectors (which OpenAI's embedding API returns), this is equivalent to the dot product, making computation efficient.
Approximate Nearest Neighbors
With ~10,400 document chunks (May 2026 production corpus), an exact brute-force comparison against all vectors for every query would be computationally expensive. Instead, the system uses an Approximate Nearest Neighbor (ANN) index that trades a small amount of accuracy for dramatically faster search — see Johnson et al. 2017 for the academic basis of billion-scale ANN search and pgvector for our specific implementation.
HNSW: How It Works
The Hierarchical Navigable Small World (HNSW) algorithm, proposed by Malkov and Yashunin (2018), builds a multi-layered graph where:
- The bottom layer contains all vectors, connected to their nearest neighbors
- Higher layers contain progressively fewer vectors, acting as "express lanes"
- Search starts at the top layer and navigates down, narrowing the search space at each level
Why HNSW over IVFFlat?
pgvector supports two index types. HNSW was selected because ZOL's content is continuously updated -- new brochures are published, website content changes, and doctors join or leave:
| Property | HNSW | IVFFlat |
|---|---|---|
| Dynamic inserts | Graceful (no reindex) | Degrades without reindex |
| Build time | Slower initial build | Faster initial build |
| Query accuracy | Higher (recall ~99%) | Lower without tuning |
| Memory | Higher | Lower |
For a system where content freshness directly impacts user experience, HNSW's ability to maintain search quality without operational maintenance (reindexing) was the deciding factor.
Token Counting Approximation
The chunking service uses Tiktoken's cl100k_base tokenizer to count tokens for chunk size control. With text-embedding-3-large (also an OpenAI model), cl100k_base is the same tokenizer the embedder uses internally, so token counts are exact. Historically, when the embedder was BGE-M3 (Feb–Apr 2026), the cl100k_base count was approximate (~10-15% variance) but still acceptable since chunk size targets (350 tokens) and maximums (450 tokens) have sufficient margin to absorb the variance. The Ollama fallback path (e.g. llama3.2:3b) retains that approximation behavior.
Embedding Pipeline Summary
References
- Anthropic. (2024). Introducing contextual retrieval. https://www.anthropic.com/news/contextual-retrieval
- Banar, N., & Lotfi, E. (2025). MTEB-NL and E5-NL: Embedding benchmark and models for Dutch. arXiv preprint, arXiv:2509.12340. https://arxiv.org/abs/2509.12340
- Chen, J., et al. (2024). BGE M3-Embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint, arXiv:2402.03216. https://arxiv.org/abs/2402.03216
- Karpukhin, V., et al. (2020). Dense passage retrieval for open-domain question answering. Proceedings of EMNLP 2020, 6769--6781. https://arxiv.org/abs/2004.04906
- Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 824--836. https://doi.org/10.1109/TPAMI.2018.2889473
- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-Networks. Proceedings of EMNLP 2019, 3982--3992. https://arxiv.org/abs/1908.10084