ADR-0001: Text Chunking Strategy
Status: Accepted
The discussion of tokenizer compatibility below references both cl100k_base (OpenAI) and BGE-M3's BERT-style tokenizer. The chunking decision itself is unchanged, but for orientation: production has migrated through three embedding models (nomic-embed-text → BGE-M3 → text-embedding-3-large). The current model is OpenAI text-embedding-3-large (1536 dim, hosted) per ADR-0048; its tokenizer is OpenAI's cl100k_base, so the chunk-size targets fit it cleanly.
Context
The RAG pipeline (Lewis et al., 2020) requires documents to be split into chunks suitable for embedding. The chunking strategy directly impacts retrieval quality:
- Too large: Chunks contain mixed topics, diluting the semantic signal. A chunk about both knee surgery preparation and post-operative rehabilitation will match queries about either topic with mediocre relevance for both.
- Too small: Chunks lose contextual coherence. A chunk containing only "Neem uw identiteitskaart mee" (Bring your identity card) lacks the context to indicate this relates to knee surgery preparation.
- Boundary-unaware: Splitting mid-paragraph or mid-section creates chunks that begin or end in semantically incomplete states.
The hospital content is predominantly structured Markdown (from web pages) and structured PDFs (brochures with headings and sections).
Decision
Implement a Tiktoken-based, markdown-aware text chunking strategy with the following configuration:
| Parameter | Value |
|---|---|
| Target chunk size | 350 tokens |
| Maximum chunk size | 450 tokens |
| Overlap | 70 tokens |
| Tokenizer | Tiktoken (cl100k_base) |
| Split awareness | Markdown heading hierarchy |
Splitting Algorithm
- First pass: Split at Markdown H1 boundaries
- Second pass: If any section exceeds 450 tokens, split at H2 boundaries
- Third pass: If still oversized, split at paragraph boundaries
- Overlap: Each chunk includes the last 70 tokens of the previous chunk for context continuity
- Metadata: Each chunk inherits the heading hierarchy from its position in the document
Why Tiktoken?
The tokenizer must match the embedding model's tokenization to accurately control chunk sizes. Tiktoken's cl100k_base encoding is compatible with the OpenAI-family tokenization used during the initial development phase. When the embedding model changed to nomic-embed-text (ADR-0005) and later to BGE-M3 (ADR-0033), Tiktoken remained sufficiently accurate for chunk size estimation.
The BGE-M3 model uses a BERT-style tokenizer, not the cl100k_base tokenizer from OpenAI. In practice, cl100k_base token counts fall within ~10-15% of the equivalent BERT tokenization. This is a deliberate trade-off: adding a BERT tokenizer dependency solely for token counting would increase complexity without meaningful improvement in chunk quality. The existing chunk size targets (350 tokens) and maximums (450 tokens) include sufficient headroom to absorb this variance.
Consequences
Positive
- Markdown-aware splitting preserves the topical coherence of hospital content
- The 350-token target produces chunks large enough for context but small enough for precise retrieval
- 70-token overlap prevents information loss at chunk boundaries
- Configurable parameters enable tuning without code changes
Negative
- Markdown parsing adds complexity to the ingestion pipeline
- Non-Markdown content (plain text PDFs) falls back to paragraph-only splitting
- The 70-token overlap increases storage by approximately 20%
- Chunk size is an approximation (token count varies by content)