LLM Stack

The ZOL Intelligent Search employs a multi-model strategy where each task is assigned to the most cost-effective model capable of performing it at the required quality level. Within the RAG architecture (Lewis et al., 2020), this approach -- sometimes called "model routing" or "tiered inference" -- ensures that expensive, high-capability models are reserved for tasks that genuinely require them.

Structured-output reliability across the stack is enforced by the structured_call helper (app.llm.structured, ~190 LOC over the raw AsyncOpenAI client), which wraps eight call sites — including intent classification and query decomposition — enforcing JSON-schema validation with retries before raising a typed StructuredCallError. This replaced the older "first-malformed-output silently fails" failure mode. (A Pydantic AI Agent pattern was trialed across these sites on 2026-05-09 but removed 2026-05-12, commit b8d8da67, after telemetry showed it added ~720 ms per call; see Decision-Cost Rubric.)

Model Allocation

Model Tier Mapping

The system uses a tier-based model architecture where each tier represents a capability level, not a specific model. Actual model identifiers are configured in backend/app/config.py and can change independently of the tier architecture.

Tier	Role	Characteristics
Tier 1	Fast classification	Lowest cost, fastest inference
Tier 2	Standard generation	Balanced cost/quality
Tier 3	Flagship generation	High quality, higher cost
Escalation	Enhanced reasoning (Think Harder)	Used for escalated queries when users signal dissatisfaction

Decoupled from Model Versions

This documentation uses tier references (Tier 1, Tier 2, etc.) rather than specific model names. When the underlying models are upgraded, only the configuration file needs to change -- not the documentation or architecture.

Model Profiles

Tier 2 (Standard) -- The Workhorse

Role: Fast, cheap tasks that require language understanding but not generation quality

The Tier 2 (standard) model handles all fast inference tasks where speed and cost matter more than generation sophistication:

Task	Why Tier 2	Latency	Cost Impact
Intent classification	Categorical decision, not creative generation; temperature=0.1 for near-deterministic output	~400ms	~$0.50/month
Query rewriting	Combined with intent classification in a single LLM call when conversation history exists	~0ms extra	Included above
Background evaluation	DeepEval scoring via OpenAI; async so latency is irrelevant	~2s	~$0.30/month

Tier 2 (Standard) -- Graph Entity Validation & Page Summaries

Role: Post-extraction quality gate for knowledge graph entities; page summary generation for contextual retrieval

After regex extraction identifies candidate entities, the Tier 2 model validates each entity and relationship through a single LLM call per page (temperature=0.1 for near-deterministic validation). This same call generates a 2-3 sentence Dutch page summary used for contextual retrieval — Anthropic's research demonstrates that prepending document context to chunks reduces retrieval failure rates by 49% when combined with hybrid search.

Task	Why Tier 2	Latency	Cost Impact
Entity validation	Rejects fake names ("Borstkas"), boilerplate hubs, wrong entity types; temp=0.1	Async (ingestion)	~$1-2/full run
Page summary generation	Dutch summary for contextual retrieval at query time	Same LLM call	Included above
DeepEval direct	Direct OpenAI model reference for DeepEval framework	Async	Included in eval

A cross-page entity cache ensures that once an entity is rejected or renamed on one page, the same decision is applied instantly on all subsequent pages without additional LLM calls. This reduces total LLM calls by 10-25%. See ADR-0014 for the full architecture.

Tier 2 (Standard) -- Query-Time Entity Extraction (ADR-0030)

Role: Structured medical entity extraction from user queries for knowledge graph routing

The intent classification LLM call was extended to output structured medical entities alongside intent and rewritten query. This replaces the previous dictionary-gated graph routing with LLM semantic understanding -- the same LLM call that classifies intent now also extracts {condition, department, doctor, treatment, examination, campus, service} in a single inference pass.

Task	Why Tier 2	Latency	Cost Impact
Entity extraction from queries	Combined with intent classification; zero extra latency	~0ms extra	Included in IC call

See ADR-0030 for the rationale — dictionary gating required 16+ iterations of alias additions, was monolingual, and silently blocked valid queries.

Tier 2 (Standard) -- LLM-as-Judge Safety Validation

Role: Defense-in-depth post-generation safety check (enabled by default)

When enabled via safety_llm_validation_enabled, the Tier 2 model evaluates whether the generated response contains subtle medical advice that regex patterns cannot catch. This is a non-blocking, async validation — if the LLM judge call fails, the response passes through (the regex layer still catches critical patterns). Note that the separate guardrails check (Llama Guard 3) operates in fail-closed mode when enabled.

Task	Why Tier 2	Latency	Cost Impact
LLM safety judge	Catches paraphrased medical advice that regex misses	~500ms (async)	~$0.20/month

Togglable at runtime via the Settings API for demos.

Tier 2 / Tier 3 -- The Generator

Role: High-quality Dutch language response generation

In standard mode, the Tier 2 (standard) model handles response generation via OpenAI. In full mode (rag_full_mode=True, the default), the system uses the same Tier 2 model but routes through OpenAI direct API for lowest latency, enables always-on reranking, and increases max_tokens. Full mode does not upgrade to Tier 3 — that tier is reserved exclusively for escalated (Think Harder) queries. See ADR-0024 for the feature flag design.

Aspect	Standard Mode	Full Mode (default)
Model	Tier 2 (via OpenRouter)	Tier 2 (via direct OpenAI)
Temperature	0.1	0.1 (configurable via `rag_full_mode_temperature`)
Max tokens	1,000	1,500 (configurable via `rag_full_mode_max_tokens`)
Reranking	Off	Always-on (Jina Reranker v2, 20 → top-15 candidates)
Latency	~3 seconds (streamed)	~4.5 seconds (reranking + generation)
Cost per query	~$0.0003	~$0.0015

text-embedding-3-large -- The Embedder

Role: Convert text to 1,536-dimensional vectors for semantic search

Embeddings are produced by OpenAI text-embedding-3-large (1,536-dimensional dense vectors, truncated from the model's native 3,072 dimensions to fit pgvector's HNSW 2,000-dim limit). See ADR-0048 for the migration rationale, @openai2024embeddings for the model announcement, and @karpukhin2020dpr for the foundational dense bi-encoder retrieval pattern. The model handles:

Query embedding at search time
Document chunk embedding at ingestion time
Response and context embedding for quality gate evaluation

Migration history. The system has gone through three embedding models: nomic-embed-text (768-dim, Ollama) → BGE-M3 (1024-dim, Ollama, ADR-0033, Feb 2026) → text-embedding-3-large (1,536-dim, OpenAI, ADR-0048, April 2026). The migration to OpenAI traded the zero-cost Ollama local-inference path for stronger Dutch retrieval (MTEB-NL ~64.6 vs BGE-M3's 60.0) and the removal of an operational dependency. BGE-M3 still survives in the stack as the optional ColBERT reranker model — see Reranking & Evaluation. Cost: ~$0.13/1M tokens (75% prompt-cache discount), ~$0.20/month at 25,000 monthly queries.

Jina Reranker v2 -- The Reranker

Role: Deep semantic reranking for all queries in full mode

In full mode (default), the Jina Reranker v2 API reranks all queries — not just escalated searches. This was enabled by ADR-0024 after benchmarks showed that the quality improvement justifies the latency cost for the demo and production deployment. BGE-reranker-v2-m3 serves as a local fallback if the Jina API is unavailable.

Aspect	Standard Mode	Full Mode (default)
Task	Not used	Rerank 20 candidates → top 15
When	Never	Every query
Latency	0ms	~500ms (API) / ~1.5s (local fallback)
Cost	Zero	~$0.001/query (Jina API)
Quality	Cosine similarity ranking only	Cross-encoder deep relevance scoring

In escalated (Think Harder) mode, the reranker processes 100 candidates instead of 20, providing even broader coverage.

Tier 1 (Fast) -- The Lightweight Specialist

Role: Ultra-fast, ultra-cheap structured JSON output for non-critical tasks

The Tier 1 (fast) model handles tasks that require minimal language understanding and produce short, structured output. It is specifically used for follow-up suggestion generation — producing 3 contextual follow-up questions as a JSON array after each response.

Task	Why Tier 1	Latency	Cost Impact
Follow-up suggestions	JSON array output, no reasoning needed; structured follow-ups derived from response text	~200ms	~$0.10/month

Why a dedicated Tier 1 model was reintroduced

The Tier 1 model was previously removed from the pipeline due to poor performance on Dutch medical content (ADR-0014). It is now reintroduced exclusively for follow-up suggestion generation — a task that requires extracting topics from an already-generated response, not understanding medical terminology. The rag_followup_model config is separate from the main rag_llm_model, ensuring that the follow-up task never accidentally uses a reasoning model (which would break JSON parsing due to hidden thinking tokens).

Regex -- Deterministic Pattern Matching

Multiple tasks in the pipeline use compiled regex patterns instead of LLM calls:

Task	Why Regex, Not LLM
Ingestion entity extraction	Dutch medical patterns are well-defined; regex is faster, free, deterministic
PII detection	Email, phone, BSN patterns are regular expressions by nature
Safety regex validation	Medical advice patterns in Dutch can be reliably captured with regex

Query-Time Entity Extraction

At query time, entity extraction uses the LLM (combined with intent classification) rather than regex. This is because user queries are colloquial, multilingual, and unpredictable — unlike ingestion content which follows well-defined Dutch medical patterns. See ADR-0030.

Regex + LLM Hybrid (Ingestion)

Ingestion-time entity extraction uses a hybrid approach: regex provides a fast, deterministic, cost-free baseline that extracts 95%+ of entities correctly. The Tier 2 model then validates the results, catching semantic errors that regex fundamentally cannot (e.g., "Borstkas" being a body part, not a doctor name). The same LLM call also generates page summaries for contextual retrieval. See Entity Extraction and ADR-0014.

Query-time entity extraction uses the LLM directly (combined with intent classification) because user queries are colloquial and multilingual. See ADR-0030.

OpenRouter as LLM Gateway

LLM calls use a three-step fallback chain with circuit breaker for resilience: Direct OpenAI API (primary) → local Ollama model (emergency fallback). All LLM calls use the OpenAI direct API. The Ollama local model serves as emergency fallback when the OpenAI API is unreachable.

Why OpenRouter?

Benefit	Description
Provider flexibility	Switch between OpenAI, Anthropic, Google models without code changes
Unified billing	Single API key and billing for all cloud models
Fallback routing	Automatic failover to alternative providers on outage
Rate limit management	OpenRouter handles per-provider rate limits transparently

Cost-Per-Query Analysis

In full mode (the default), the dominant cost component is response generation via the Tier 2 model at ~~$0.0015/query. Combined with intent classification (~~$0.00003), quality evaluation (~~$0.00002), and follow-up suggestions (~~$0.00001), the total cost per query is approximately $0.0015-0.002. At 25,000 monthly queries, this translates to ~$38-50/month — still highly economical for a production RAG system. Prompt caching (75% input token discount) reduces effective costs further. Per-query costs are tracked automatically by the cost tracking service.

Temperature Standardization

All LLM calls use deliberately chosen temperature settings based on their task type:

Task Category	Temperature	Rationale
Classification (intent, entity validation)	0.1	Near-deterministic output; minimal variance improves robustness against edge-case inputs while maintaining reproducibility
Validation (graph entities, page summaries)	0.1	Consistency is critical; entity decisions must be reproducible across runs
RAG response generation	0.1	Low creativity keeps responses tightly grounded in source material while allowing natural Dutch phrasing (configurable via `rag_full_mode_temperature`)
Follow-up suggestions	0.7	Higher creativity to generate diverse, interesting follow-up questions

Setting classification and validation tasks to temperature=0.1 minimizes non-determinism that previously caused inconsistent intent classifications and entity validation decisions across identical inputs. All temperature defaults are configurable via config.py (intent_classification_temperature, rag_full_mode_temperature, rag_llm_temperature).

Prompt Caching

OpenAI provides automatic prompt caching on prompts longer than 1,024 tokens, which applies to several tasks in the pipeline:

Task	Prompt Size	Caching Applies	Discount
Response generation (Tier 2 / Tier 3)	~2,000-4,000 tokens (system prompt + context)	Yes	75% input token discount
Entity validation (Tier 2)	~1,500-2,500 tokens (system prompt + extraction results)	Yes	75% input token discount
Intent classification + entity extraction (Tier 2)	~500-800 tokens	No (below threshold)	--

Prompt caching is automatic and requires no code changes -- OpenAI caches the prompt prefix when it exceeds 1,024 tokens. Subsequent requests that share the same prefix receive a 75% discount on input tokens. This is particularly impactful during ingestion runs where the entity validation system prompt is identical across hundreds of pages.

Cost Tracking Per Ingestion Run

The CostTracker service provides per-ingestion-run cost breakdowns, enabling visibility into how much each document ingestion costs:

Per-run tracking: Each ingestion job gets a unique cost breakdown showing LLM calls for entity validation, canonical question generation, and page summary generation
Per-model breakdown: Costs are attributed to the specific model used (Tier 2 for validation and canonical questions, Tier 2 or Tier 3 for generation tasks)
Token accounting: Input tokens, output tokens, and cached input tokens are tracked separately to measure prompt caching effectiveness
Ingestion vs. query costs: The tracker distinguishes between query-time costs (per-user) and ingestion-time costs (per-document), providing a complete cost picture

This visibility helps identify cost anomalies early and validates that the multi-model strategy is delivering the expected savings.

Model Allocation​

Model Tier Mapping​

Model Profiles​

Tier 2 (Standard) -- The Workhorse​

Tier 2 (Standard) -- Graph Entity Validation & Page Summaries​

Tier 2 (Standard) -- Query-Time Entity Extraction (ADR-0030)​

Tier 2 (Standard) -- LLM-as-Judge Safety Validation​

Tier 2 / Tier 3 -- The Generator​

text-embedding-3-large -- The Embedder​

Jina Reranker v2 -- The Reranker​

Tier 1 (Fast) -- The Lightweight Specialist​

Regex -- Deterministic Pattern Matching​

OpenRouter as LLM Gateway​

Why OpenRouter?​

Cost-Per-Query Analysis​

Temperature Standardization​

Prompt Caching​

Cost Tracking Per Ingestion Run​