ADR-0001: Text Chunking Strategy

Status: Accepted

Embedding-model context

The discussion of tokenizer compatibility below references both cl100k_base (OpenAI) and BGE-M3's BERT-style tokenizer. The chunking decision itself is unchanged, but for orientation: production has migrated through three embedding models (nomic-embed-text → BGE-M3 → text-embedding-3-large). The current model is OpenAI text-embedding-3-large (1536 dim, hosted) per ADR-0048; its tokenizer is OpenAI's cl100k_base, so the chunk-size targets fit it cleanly.

Context

The RAG pipeline (Lewis et al., 2020) requires documents to be split into chunks suitable for embedding. The chunking strategy directly impacts retrieval quality:

Too large: Chunks contain mixed topics, diluting the semantic signal. A chunk about both knee surgery preparation and post-operative rehabilitation will match queries about either topic with mediocre relevance for both.
Too small: Chunks lose contextual coherence. A chunk containing only "Neem uw identiteitskaart mee" (Bring your identity card) lacks the context to indicate this relates to knee surgery preparation.
Boundary-unaware: Splitting mid-paragraph or mid-section creates chunks that begin or end in semantically incomplete states.

The hospital content is predominantly structured Markdown (from web pages) and structured PDFs (brochures with headings and sections).

Decision

Implement a Tiktoken-based, markdown-aware text chunking strategy with the following configuration:

Parameter	Value
Target chunk size	350 tokens
Maximum chunk size	450 tokens
Overlap	70 tokens
Tokenizer	Tiktoken (cl100k_base)
Split awareness	Markdown heading hierarchy

Splitting Algorithm

First pass: Split at Markdown H1 boundaries
Second pass: If any section exceeds 450 tokens, split at H2 boundaries
Third pass: If still oversized, split at paragraph boundaries
Overlap: Each chunk includes the last 70 tokens of the previous chunk for context continuity
Metadata: Each chunk inherits the heading hierarchy from its position in the document

Why Tiktoken?

The tokenizer must match the embedding model's tokenization to accurately control chunk sizes. Tiktoken's cl100k_base encoding is compatible with the OpenAI-family tokenization used during the initial development phase. When the embedding model changed to nomic-embed-text (ADR-0005) and later to BGE-M3 (ADR-0033), Tiktoken remained sufficiently accurate for chunk size estimation.

Tiktoken as BERT Approximation

The BGE-M3 model uses a BERT-style tokenizer, not the cl100k_base tokenizer from OpenAI. In practice, cl100k_base token counts fall within ~10-15% of the equivalent BERT tokenization. This is a deliberate trade-off: adding a BERT tokenizer dependency solely for token counting would increase complexity without meaningful improvement in chunk quality. The existing chunk size targets (350 tokens) and maximums (450 tokens) include sufficient headroom to absorb this variance.

Consequences

Positive

Markdown-aware splitting preserves the topical coherence of hospital content
The 350-token target produces chunks large enough for context but small enough for precise retrieval
70-token overlap prevents information loss at chunk boundaries
Configurable parameters enable tuning without code changes

Negative

Markdown parsing adds complexity to the ingestion pipeline
Non-Markdown content (plain text PDFs) falls back to paragraph-only splitting
The 70-token overlap increases storage by approximately 20%
Chunk size is an approximation (token count varies by content)

Context​

Decision​

Splitting Algorithm​

Why Tiktoken?​

Consequences​

Positive​

Negative​

Trade-off Visualization​