Skip to main content

LLM Stack

The ZOL Intelligent Search employs a multi-model strategy where each task is assigned to the most cost-effective model capable of performing it at the required quality level. Within the RAG architecture (Lewis et al., 2020), this approach -- sometimes called "model routing" or "tiered inference" -- ensures that expensive, high-capability models are reserved for tasks that genuinely require them.

Structured-output reliability across the stack is enforced by the structured_call helper (app.llm.structured, ~190 LOC over the raw AsyncOpenAI client), which wraps eight call sites — including intent classification and query decomposition — enforcing JSON-schema validation with retries before raising a typed StructuredCallError. This replaced the older "first-malformed-output silently fails" failure mode. (A Pydantic AI Agent pattern was trialed across these sites on 2026-05-09 but removed 2026-05-12, commit b8d8da67, after telemetry showed it added ~720 ms per call; see Decision-Cost Rubric.)

Model Allocation

Model Tier Mapping

The system uses a tier-based model architecture where each tier represents a capability level, not a specific model. Actual model identifiers are configured in backend/app/config.py and can change independently of the tier architecture.

TierRoleCharacteristics
Tier 1Fast classificationLowest cost, fastest inference
Tier 2Standard generationBalanced cost/quality
Tier 3Flagship generationHigh quality, higher cost
EscalationEnhanced reasoning (Think Harder)Used for escalated queries when users signal dissatisfaction
Decoupled from Model Versions

This documentation uses tier references (Tier 1, Tier 2, etc.) rather than specific model names. When the underlying models are upgraded, only the configuration file needs to change -- not the documentation or architecture.

Model Profiles

Tier 2 (Standard) -- The Workhorse

Role: Fast, cheap tasks that require language understanding but not generation quality

The Tier 2 (standard) model handles all fast inference tasks where speed and cost matter more than generation sophistication:

TaskWhy Tier 2LatencyCost Impact
Intent classificationCategorical decision, not creative generation; temperature=0.1 for near-deterministic output~400ms~$0.50/month
Query rewritingCombined with intent classification in a single LLM call when conversation history exists~0ms extraIncluded above
Background evaluationDeepEval scoring via OpenAI; async so latency is irrelevant~2s~$0.30/month

Tier 2 (Standard) -- Graph Entity Validation & Page Summaries

Role: Post-extraction quality gate for knowledge graph entities; page summary generation for contextual retrieval

After regex extraction identifies candidate entities, the Tier 2 model validates each entity and relationship through a single LLM call per page (temperature=0.1 for near-deterministic validation). This same call generates a 2-3 sentence Dutch page summary used for contextual retrieval — Anthropic's research demonstrates that prepending document context to chunks reduces retrieval failure rates by 49% when combined with hybrid search.

TaskWhy Tier 2LatencyCost Impact
Entity validationRejects fake names ("Borstkas"), boilerplate hubs, wrong entity types; temp=0.1Async (ingestion)~$1-2/full run
Page summary generationDutch summary for contextual retrieval at query timeSame LLM callIncluded above
DeepEval directDirect OpenAI model reference for DeepEval frameworkAsyncIncluded in eval

A cross-page entity cache ensures that once an entity is rejected or renamed on one page, the same decision is applied instantly on all subsequent pages without additional LLM calls. This reduces total LLM calls by 10-25%. See ADR-0014 for the full architecture.

Tier 2 (Standard) -- Query-Time Entity Extraction (ADR-0030)

Role: Structured medical entity extraction from user queries for knowledge graph routing

The intent classification LLM call was extended to output structured medical entities alongside intent and rewritten query. This replaces the previous dictionary-gated graph routing with LLM semantic understanding -- the same LLM call that classifies intent now also extracts {condition, department, doctor, treatment, examination, campus, service} in a single inference pass.

TaskWhy Tier 2LatencyCost Impact
Entity extraction from queriesCombined with intent classification; zero extra latency~0ms extraIncluded in IC call

See ADR-0030 for the rationale — dictionary gating required 16+ iterations of alias additions, was monolingual, and silently blocked valid queries.

Tier 2 (Standard) -- LLM-as-Judge Safety Validation

Role: Defense-in-depth post-generation safety check (enabled by default)

When enabled via safety_llm_validation_enabled, the Tier 2 model evaluates whether the generated response contains subtle medical advice that regex patterns cannot catch. This is a non-blocking, async validation — if the LLM judge call fails, the response passes through (the regex layer still catches critical patterns). Note that the separate guardrails check (Llama Guard 3) operates in fail-closed mode when enabled.

TaskWhy Tier 2LatencyCost Impact
LLM safety judgeCatches paraphrased medical advice that regex misses~500ms (async)~$0.20/month

Togglable at runtime via the Settings API for demos.

Tier 2 / Tier 3 -- The Generator

Role: High-quality Dutch language response generation

In standard mode, the Tier 2 (standard) model handles response generation via OpenAI. In full mode (rag_full_mode=True, the default), the system uses the same Tier 2 model but routes through OpenAI direct API for lowest latency, enables always-on reranking, and increases max_tokens. Full mode does not upgrade to Tier 3 — that tier is reserved exclusively for escalated (Think Harder) queries. See ADR-0024 for the feature flag design.

AspectStandard ModeFull Mode (default)
ModelTier 2 (via OpenRouter)Tier 2 (via direct OpenAI)
Temperature0.10.1 (configurable via rag_full_mode_temperature)
Max tokens1,0001,500 (configurable via rag_full_mode_max_tokens)
RerankingOffAlways-on (Jina Reranker v2, 20 → top-15 candidates)
Latency~3 seconds (streamed)~4.5 seconds (reranking + generation)
Cost per query~$0.0003~$0.0015

text-embedding-3-large -- The Embedder

Role: Convert text to 1,536-dimensional vectors for semantic search

Embeddings are produced by OpenAI text-embedding-3-large (1,536-dimensional dense vectors, truncated from the model's native 3,072 dimensions to fit pgvector's HNSW 2,000-dim limit). See ADR-0048 for the migration rationale, @openai2024embeddings for the model announcement, and @karpukhin2020dpr for the foundational dense bi-encoder retrieval pattern. The model handles:

  • Query embedding at search time
  • Document chunk embedding at ingestion time
  • Response and context embedding for quality gate evaluation

Migration history. The system has gone through three embedding models: nomic-embed-text (768-dim, Ollama) → BGE-M3 (1024-dim, Ollama, ADR-0033, Feb 2026) → text-embedding-3-large (1,536-dim, OpenAI, ADR-0048, April 2026). The migration to OpenAI traded the zero-cost Ollama local-inference path for stronger Dutch retrieval (MTEB-NL ~64.6 vs BGE-M3's 60.0) and the removal of an operational dependency. BGE-M3 still survives in the stack as the optional ColBERT reranker model — see Reranking & Evaluation. Cost: ~$0.13/1M tokens (75% prompt-cache discount), ~$0.20/month at 25,000 monthly queries.

Jina Reranker v2 -- The Reranker

Role: Deep semantic reranking for all queries in full mode

In full mode (default), the Jina Reranker v2 API reranks all queries — not just escalated searches. This was enabled by ADR-0024 after benchmarks showed that the quality improvement justifies the latency cost for the demo and production deployment. BGE-reranker-v2-m3 serves as a local fallback if the Jina API is unavailable.

AspectStandard ModeFull Mode (default)
TaskNot usedRerank 20 candidates → top 15
WhenNeverEvery query
Latency0ms~500ms (API) / ~1.5s (local fallback)
CostZero~$0.001/query (Jina API)
QualityCosine similarity ranking onlyCross-encoder deep relevance scoring

In escalated (Think Harder) mode, the reranker processes 100 candidates instead of 20, providing even broader coverage.

Tier 1 (Fast) -- The Lightweight Specialist

Role: Ultra-fast, ultra-cheap structured JSON output for non-critical tasks

The Tier 1 (fast) model handles tasks that require minimal language understanding and produce short, structured output. It is specifically used for follow-up suggestion generation — producing 3 contextual follow-up questions as a JSON array after each response.

TaskWhy Tier 1LatencyCost Impact
Follow-up suggestionsJSON array output, no reasoning needed; structured follow-ups derived from response text~200ms~$0.10/month
Why a dedicated Tier 1 model was reintroduced

The Tier 1 model was previously removed from the pipeline due to poor performance on Dutch medical content (ADR-0014). It is now reintroduced exclusively for follow-up suggestion generation — a task that requires extracting topics from an already-generated response, not understanding medical terminology. The rag_followup_model config is separate from the main rag_llm_model, ensuring that the follow-up task never accidentally uses a reasoning model (which would break JSON parsing due to hidden thinking tokens).

Regex -- Deterministic Pattern Matching

Multiple tasks in the pipeline use compiled regex patterns instead of LLM calls:

TaskWhy Regex, Not LLM
Ingestion entity extractionDutch medical patterns are well-defined; regex is faster, free, deterministic
PII detectionEmail, phone, BSN patterns are regular expressions by nature
Safety regex validationMedical advice patterns in Dutch can be reliably captured with regex
Query-Time Entity Extraction

At query time, entity extraction uses the LLM (combined with intent classification) rather than regex. This is because user queries are colloquial, multilingual, and unpredictable — unlike ingestion content which follows well-defined Dutch medical patterns. See ADR-0030.

Regex + LLM Hybrid (Ingestion)

Ingestion-time entity extraction uses a hybrid approach: regex provides a fast, deterministic, cost-free baseline that extracts 95%+ of entities correctly. The Tier 2 model then validates the results, catching semantic errors that regex fundamentally cannot (e.g., "Borstkas" being a body part, not a doctor name). The same LLM call also generates page summaries for contextual retrieval. See Entity Extraction and ADR-0014.

Query-time entity extraction uses the LLM directly (combined with intent classification) because user queries are colloquial and multilingual. See ADR-0030.

OpenRouter as LLM Gateway

LLM calls use a three-step fallback chain with circuit breaker for resilience: Direct OpenAI API (primary) → local Ollama model (emergency fallback). All LLM calls use the OpenAI direct API. The Ollama local model serves as emergency fallback when the OpenAI API is unreachable.

Why OpenRouter?

BenefitDescription
Provider flexibilitySwitch between OpenAI, Anthropic, Google models without code changes
Unified billingSingle API key and billing for all cloud models
Fallback routingAutomatic failover to alternative providers on outage
Rate limit managementOpenRouter handles per-provider rate limits transparently

Cost-Per-Query Analysis

In full mode (the default), the dominant cost component is response generation via the Tier 2 model at $0.0015/query. Combined with intent classification ($0.00003), quality evaluation ($0.00002), and follow-up suggestions ($0.00001), the total cost per query is approximately $0.0015-0.002. At 25,000 monthly queries, this translates to ~$38-50/month — still highly economical for a production RAG system. Prompt caching (75% input token discount) reduces effective costs further. Per-query costs are tracked automatically by the cost tracking service.

Temperature Standardization

All LLM calls use deliberately chosen temperature settings based on their task type:

Task CategoryTemperatureRationale
Classification (intent, entity validation)0.1Near-deterministic output; minimal variance improves robustness against edge-case inputs while maintaining reproducibility
Validation (graph entities, page summaries)0.1Consistency is critical; entity decisions must be reproducible across runs
RAG response generation0.1Low creativity keeps responses tightly grounded in source material while allowing natural Dutch phrasing (configurable via rag_full_mode_temperature)
Follow-up suggestions0.7Higher creativity to generate diverse, interesting follow-up questions

Setting classification and validation tasks to temperature=0.1 minimizes non-determinism that previously caused inconsistent intent classifications and entity validation decisions across identical inputs. All temperature defaults are configurable via config.py (intent_classification_temperature, rag_full_mode_temperature, rag_llm_temperature).

Prompt Caching

OpenAI provides automatic prompt caching on prompts longer than 1,024 tokens, which applies to several tasks in the pipeline:

TaskPrompt SizeCaching AppliesDiscount
Response generation (Tier 2 / Tier 3)~2,000-4,000 tokens (system prompt + context)Yes75% input token discount
Entity validation (Tier 2)~1,500-2,500 tokens (system prompt + extraction results)Yes75% input token discount
Intent classification + entity extraction (Tier 2)~500-800 tokensNo (below threshold)--

Prompt caching is automatic and requires no code changes -- OpenAI caches the prompt prefix when it exceeds 1,024 tokens. Subsequent requests that share the same prefix receive a 75% discount on input tokens. This is particularly impactful during ingestion runs where the entity validation system prompt is identical across hundreds of pages.

Cost Tracking Per Ingestion Run

The CostTracker service provides per-ingestion-run cost breakdowns, enabling visibility into how much each document ingestion costs:

  • Per-run tracking: Each ingestion job gets a unique cost breakdown showing LLM calls for entity validation, canonical question generation, and page summary generation
  • Per-model breakdown: Costs are attributed to the specific model used (Tier 2 for validation and canonical questions, Tier 2 or Tier 3 for generation tasks)
  • Token accounting: Input tokens, output tokens, and cached input tokens are tracked separately to measure prompt caching effectiveness
  • Ingestion vs. query costs: The tracker distinguishes between query-time costs (per-user) and ingestion-time costs (per-document), providing a complete cost picture

This visibility helps identify cost anomalies early and validates that the multi-model strategy is delivering the expected savings.