Skip to main content

ADR-0001: Text Chunking Strategy

Status: Accepted

Embedding-model context

The discussion of tokenizer compatibility below references both cl100k_base (OpenAI) and BGE-M3's BERT-style tokenizer. The chunking decision itself is unchanged, but for orientation: production has migrated through three embedding models (nomic-embed-text → BGE-M3 → text-embedding-3-large). The current model is OpenAI text-embedding-3-large (1536 dim, hosted) per ADR-0048; its tokenizer is OpenAI's cl100k_base, so the chunk-size targets fit it cleanly.

Context

The RAG pipeline (Lewis et al., 2020) requires documents to be split into chunks suitable for embedding. The chunking strategy directly impacts retrieval quality:

  • Too large: Chunks contain mixed topics, diluting the semantic signal. A chunk about both knee surgery preparation and post-operative rehabilitation will match queries about either topic with mediocre relevance for both.
  • Too small: Chunks lose contextual coherence. A chunk containing only "Neem uw identiteitskaart mee" (Bring your identity card) lacks the context to indicate this relates to knee surgery preparation.
  • Boundary-unaware: Splitting mid-paragraph or mid-section creates chunks that begin or end in semantically incomplete states.

The hospital content is predominantly structured Markdown (from web pages) and structured PDFs (brochures with headings and sections).

Decision

Implement a Tiktoken-based, markdown-aware text chunking strategy with the following configuration:

ParameterValue
Target chunk size350 tokens
Maximum chunk size450 tokens
Overlap70 tokens
TokenizerTiktoken (cl100k_base)
Split awarenessMarkdown heading hierarchy

Splitting Algorithm

  1. First pass: Split at Markdown H1 boundaries
  2. Second pass: If any section exceeds 450 tokens, split at H2 boundaries
  3. Third pass: If still oversized, split at paragraph boundaries
  4. Overlap: Each chunk includes the last 70 tokens of the previous chunk for context continuity
  5. Metadata: Each chunk inherits the heading hierarchy from its position in the document

Why Tiktoken?

The tokenizer must match the embedding model's tokenization to accurately control chunk sizes. Tiktoken's cl100k_base encoding is compatible with the OpenAI-family tokenization used during the initial development phase. When the embedding model changed to nomic-embed-text (ADR-0005) and later to BGE-M3 (ADR-0033), Tiktoken remained sufficiently accurate for chunk size estimation.

Tiktoken as BERT Approximation

The BGE-M3 model uses a BERT-style tokenizer, not the cl100k_base tokenizer from OpenAI. In practice, cl100k_base token counts fall within ~10-15% of the equivalent BERT tokenization. This is a deliberate trade-off: adding a BERT tokenizer dependency solely for token counting would increase complexity without meaningful improvement in chunk quality. The existing chunk size targets (350 tokens) and maximums (450 tokens) include sufficient headroom to absorb this variance.

Consequences

Positive

  • Markdown-aware splitting preserves the topical coherence of hospital content
  • The 350-token target produces chunks large enough for context but small enough for precise retrieval
  • 70-token overlap prevents information loss at chunk boundaries
  • Configurable parameters enable tuning without code changes

Negative

  • Markdown parsing adds complexity to the ingestion pipeline
  • Non-Markdown content (plain text PDFs) falls back to paragraph-only splitting
  • The 70-token overlap increases storage by approximately 20%
  • Chunk size is an approximation (token count varies by content)

Trade-off Visualization