LLM Stack
The ZOL Intelligent Search employs a multi-model strategy where each task is assigned to the most cost-effective model capable of performing it at the required quality level. Within the RAG architecture (Lewis et al., 2020), this approach -- sometimes called "model routing" or "tiered inference" -- ensures that expensive, high-capability models are reserved for tasks that genuinely require them.
Structured-output reliability across the stack is enforced by the structured_call helper (app.llm.structured, ~190 LOC over the raw AsyncOpenAI client), which wraps eight call sites — including intent classification and query decomposition — enforcing JSON-schema validation with retries before raising a typed StructuredCallError. This replaced the older "first-malformed-output silently fails" failure mode. (A Pydantic AI Agent pattern was trialed across these sites on 2026-05-09 but removed 2026-05-12, commit b8d8da67, after telemetry showed it added ~720 ms per call; see Decision-Cost Rubric.)
Model Allocation
Model Tier Mapping
The system uses a tier-based model architecture where each tier represents a capability level, not a specific model. Actual model identifiers are configured in backend/app/config.py and can change independently of the tier architecture.
| Tier | Role | Characteristics |
|---|---|---|
| Tier 1 | Fast classification | Lowest cost, fastest inference |
| Tier 2 | Standard generation | Balanced cost/quality |
| Tier 3 | Flagship generation | High quality, higher cost |
| Escalation | Enhanced reasoning (Think Harder) | Used for escalated queries when users signal dissatisfaction |
This documentation uses tier references (Tier 1, Tier 2, etc.) rather than specific model names. When the underlying models are upgraded, only the configuration file needs to change -- not the documentation or architecture.
Model Profiles
Tier 2 (Standard) -- The Workhorse
Role: Fast, cheap tasks that require language understanding but not generation quality
The Tier 2 (standard) model handles all fast inference tasks where speed and cost matter more than generation sophistication:
| Task | Why Tier 2 | Latency | Cost Impact |
|---|---|---|---|
| Intent classification | Categorical decision, not creative generation; temperature=0.1 for near-deterministic output | ~400ms | ~$0.50/month |
| Query rewriting | Combined with intent classification in a single LLM call when conversation history exists | ~0ms extra | Included above |
| Background evaluation | DeepEval scoring via OpenAI; async so latency is irrelevant | ~2s | ~$0.30/month |
Tier 2 (Standard) -- Graph Entity Validation & Page Summaries
Role: Post-extraction quality gate for knowledge graph entities; page summary generation for contextual retrieval
After regex extraction identifies candidate entities, the Tier 2 model validates each entity and relationship through a single LLM call per page (temperature=0.1 for near-deterministic validation). This same call generates a 2-3 sentence Dutch page summary used for contextual retrieval — Anthropic's research demonstrates that prepending document context to chunks reduces retrieval failure rates by 49% when combined with hybrid search.
| Task | Why Tier 2 | Latency | Cost Impact |
|---|---|---|---|
| Entity validation | Rejects fake names ("Borstkas"), boilerplate hubs, wrong entity types; temp=0.1 | Async (ingestion) | ~$1-2/full run |
| Page summary generation | Dutch summary for contextual retrieval at query time | Same LLM call | Included above |
| DeepEval direct | Direct OpenAI model reference for DeepEval framework | Async | Included in eval |
A cross-page entity cache ensures that once an entity is rejected or renamed on one page, the same decision is applied instantly on all subsequent pages without additional LLM calls. This reduces total LLM calls by 10-25%. See ADR-0014 for the full architecture.
Tier 2 (Standard) -- Query-Time Entity Extraction (ADR-0030)
Role: Structured medical entity extraction from user queries for knowledge graph routing
The intent classification LLM call was extended to output structured medical entities alongside intent and rewritten query. This replaces the previous dictionary-gated graph routing with LLM semantic understanding -- the same LLM call that classifies intent now also extracts {condition, department, doctor, treatment, examination, campus, service} in a single inference pass.
| Task | Why Tier 2 | Latency | Cost Impact |
|---|---|---|---|
| Entity extraction from queries | Combined with intent classification; zero extra latency | ~0ms extra | Included in IC call |
See ADR-0030 for the rationale — dictionary gating required 16+ iterations of alias additions, was monolingual, and silently blocked valid queries.
Tier 2 (Standard) -- LLM-as-Judge Safety Validation
Role: Defense-in-depth post-generation safety check (enabled by default)
When enabled via safety_llm_validation_enabled, the Tier 2 model evaluates whether the generated response contains subtle medical advice that regex patterns cannot catch. This is a non-blocking, async validation — if the LLM judge call fails, the response passes through (the regex layer still catches critical patterns). Note that the separate guardrails check (Llama Guard 3) operates in fail-closed mode when enabled.
| Task | Why Tier 2 | Latency | Cost Impact |
|---|---|---|---|
| LLM safety judge | Catches paraphrased medical advice that regex misses | ~500ms (async) | ~$0.20/month |
Togglable at runtime via the Settings API for demos.
Tier 2 / Tier 3 -- The Generator
Role: High-quality Dutch language response generation
In standard mode, the Tier 2 (standard) model handles response generation via OpenAI. In full mode (rag_full_mode=True, the default), the system uses the same Tier 2 model but routes through OpenAI direct API for lowest latency, enables always-on reranking, and increases max_tokens. Full mode does not upgrade to Tier 3 — that tier is reserved exclusively for escalated (Think Harder) queries. See ADR-0024 for the feature flag design.
| Aspect | Standard Mode | Full Mode (default) |
|---|---|---|
| Model | Tier 2 (via OpenRouter) | Tier 2 (via direct OpenAI) |
| Temperature | 0.1 | 0.1 (configurable via rag_full_mode_temperature) |
| Max tokens | 1,000 | 1,500 (configurable via rag_full_mode_max_tokens) |
| Reranking | Off | Always-on (Jina Reranker v2, 20 → top-15 candidates) |
| Latency | ~3 seconds (streamed) | ~4.5 seconds (reranking + generation) |
| Cost per query | ~$0.0003 | ~$0.0015 |
text-embedding-3-large -- The Embedder
Role: Convert text to 1,536-dimensional vectors for semantic search
Embeddings are produced by OpenAI text-embedding-3-large (1,536-dimensional dense vectors, truncated from the model's native 3,072 dimensions to fit pgvector's HNSW 2,000-dim limit). See ADR-0048 for the migration rationale, @openai2024embeddings for the model announcement, and @karpukhin2020dpr for the foundational dense bi-encoder retrieval pattern. The model handles:
- Query embedding at search time
- Document chunk embedding at ingestion time
- Response and context embedding for quality gate evaluation
Migration history. The system has gone through three embedding models: nomic-embed-text (768-dim, Ollama) → BGE-M3 (1024-dim, Ollama, ADR-0033, Feb 2026) → text-embedding-3-large (1,536-dim, OpenAI, ADR-0048, April 2026). The migration to OpenAI traded the zero-cost Ollama local-inference path for stronger Dutch retrieval (MTEB-NL ~64.6 vs BGE-M3's 60.0) and the removal of an operational dependency. BGE-M3 still survives in the stack as the optional ColBERT reranker model — see Reranking & Evaluation. Cost: ~$0.13/1M tokens (75% prompt-cache discount), ~$0.20/month at 25,000 monthly queries.
Jina Reranker v2 -- The Reranker
Role: Deep semantic reranking for all queries in full mode
In full mode (default), the Jina Reranker v2 API reranks all queries — not just escalated searches. This was enabled by ADR-0024 after benchmarks showed that the quality improvement justifies the latency cost for the demo and production deployment. BGE-reranker-v2-m3 serves as a local fallback if the Jina API is unavailable.
| Aspect | Standard Mode | Full Mode (default) |
|---|---|---|
| Task | Not used | Rerank 20 candidates → top 15 |
| When | Never | Every query |
| Latency | 0ms | ~500ms (API) / ~1.5s (local fallback) |
| Cost | Zero | ~$0.001/query (Jina API) |
| Quality | Cosine similarity ranking only | Cross-encoder deep relevance scoring |
In escalated (Think Harder) mode, the reranker processes 100 candidates instead of 20, providing even broader coverage.
Tier 1 (Fast) -- The Lightweight Specialist
Role: Ultra-fast, ultra-cheap structured JSON output for non-critical tasks
The Tier 1 (fast) model handles tasks that require minimal language understanding and produce short, structured output. It is specifically used for follow-up suggestion generation — producing 3 contextual follow-up questions as a JSON array after each response.
| Task | Why Tier 1 | Latency | Cost Impact |
|---|---|---|---|
| Follow-up suggestions | JSON array output, no reasoning needed; structured follow-ups derived from response text | ~200ms | ~$0.10/month |
The Tier 1 model was previously removed from the pipeline due to poor performance on Dutch medical content (ADR-0014). It is now reintroduced exclusively for follow-up suggestion generation — a task that requires extracting topics from an already-generated response, not understanding medical terminology. The rag_followup_model config is separate from the main rag_llm_model, ensuring that the follow-up task never accidentally uses a reasoning model (which would break JSON parsing due to hidden thinking tokens).
Regex -- Deterministic Pattern Matching
Multiple tasks in the pipeline use compiled regex patterns instead of LLM calls:
| Task | Why Regex, Not LLM |
|---|---|
| Ingestion entity extraction | Dutch medical patterns are well-defined; regex is faster, free, deterministic |
| PII detection | Email, phone, BSN patterns are regular expressions by nature |
| Safety regex validation | Medical advice patterns in Dutch can be reliably captured with regex |
At query time, entity extraction uses the LLM (combined with intent classification) rather than regex. This is because user queries are colloquial, multilingual, and unpredictable — unlike ingestion content which follows well-defined Dutch medical patterns. See ADR-0030.
Ingestion-time entity extraction uses a hybrid approach: regex provides a fast, deterministic, cost-free baseline that extracts 95%+ of entities correctly. The Tier 2 model then validates the results, catching semantic errors that regex fundamentally cannot (e.g., "Borstkas" being a body part, not a doctor name). The same LLM call also generates page summaries for contextual retrieval. See Entity Extraction and ADR-0014.
Query-time entity extraction uses the LLM directly (combined with intent classification) because user queries are colloquial and multilingual. See ADR-0030.
OpenRouter as LLM Gateway
LLM calls use a three-step fallback chain with circuit breaker for resilience: Direct OpenAI API (primary) → local Ollama model (emergency fallback). All LLM calls use the OpenAI direct API. The Ollama local model serves as emergency fallback when the OpenAI API is unreachable.
Why OpenRouter?
| Benefit | Description |
|---|---|
| Provider flexibility | Switch between OpenAI, Anthropic, Google models without code changes |
| Unified billing | Single API key and billing for all cloud models |
| Fallback routing | Automatic failover to alternative providers on outage |
| Rate limit management | OpenRouter handles per-provider rate limits transparently |
Cost-Per-Query Analysis
In full mode (the default), the dominant cost component is response generation via the Tier 2 model at $0.0015/query. Combined with intent classification ($0.00003), quality evaluation ($0.00002), and follow-up suggestions ($0.00001), the total cost per query is approximately $0.0015-0.002. At 25,000 monthly queries, this translates to ~$38-50/month — still highly economical for a production RAG system. Prompt caching (75% input token discount) reduces effective costs further. Per-query costs are tracked automatically by the cost tracking service.
Temperature Standardization
All LLM calls use deliberately chosen temperature settings based on their task type:
| Task Category | Temperature | Rationale |
|---|---|---|
| Classification (intent, entity validation) | 0.1 | Near-deterministic output; minimal variance improves robustness against edge-case inputs while maintaining reproducibility |
| Validation (graph entities, page summaries) | 0.1 | Consistency is critical; entity decisions must be reproducible across runs |
| RAG response generation | 0.1 | Low creativity keeps responses tightly grounded in source material while allowing natural Dutch phrasing (configurable via rag_full_mode_temperature) |
| Follow-up suggestions | 0.7 | Higher creativity to generate diverse, interesting follow-up questions |
Setting classification and validation tasks to temperature=0.1 minimizes non-determinism that previously caused inconsistent intent classifications and entity validation decisions across identical inputs. All temperature defaults are configurable via config.py (intent_classification_temperature, rag_full_mode_temperature, rag_llm_temperature).
Prompt Caching
OpenAI provides automatic prompt caching on prompts longer than 1,024 tokens, which applies to several tasks in the pipeline:
| Task | Prompt Size | Caching Applies | Discount |
|---|---|---|---|
| Response generation (Tier 2 / Tier 3) | ~2,000-4,000 tokens (system prompt + context) | Yes | 75% input token discount |
| Entity validation (Tier 2) | ~1,500-2,500 tokens (system prompt + extraction results) | Yes | 75% input token discount |
| Intent classification + entity extraction (Tier 2) | ~500-800 tokens | No (below threshold) | -- |
Prompt caching is automatic and requires no code changes -- OpenAI caches the prompt prefix when it exceeds 1,024 tokens. Subsequent requests that share the same prefix receive a 75% discount on input tokens. This is particularly impactful during ingestion runs where the entity validation system prompt is identical across hundreds of pages.
Cost Tracking Per Ingestion Run
The CostTracker service provides per-ingestion-run cost breakdowns, enabling visibility into how much each document ingestion costs:
- Per-run tracking: Each ingestion job gets a unique cost breakdown showing LLM calls for entity validation, canonical question generation, and page summary generation
- Per-model breakdown: Costs are attributed to the specific model used (Tier 2 for validation and canonical questions, Tier 2 or Tier 3 for generation tasks)
- Token accounting: Input tokens, output tokens, and cached input tokens are tracked separately to measure prompt caching effectiveness
- Ingestion vs. query costs: The tracker distinguishes between query-time costs (per-user) and ingestion-time costs (per-document), providing a complete cost picture
This visibility helps identify cost anomalies early and validates that the multi-model strategy is delivering the expected savings.