System Architecture

Prerequisites

This page describes the architecture of an already-functional RAG system and assumes familiarity with the fundamentals. If you are new to the system, start with Core Concepts, and keep the Glossary open for any unfamiliar term.

The ZOL Intelligent Search system implements retrieval-augmented generation Lewis et al. 2020 over hospital corpora, with two production input channels — a web chat interface and a voice (telephony) interface — that share the same retrieval backend. The architecture is layered: each layer has a distinct responsibility and communicates with adjacent layers through well-defined interfaces. This separation makes substitutions tractable; when we migrated the embedding model from BGE-M3 to OpenAI text-embedding-3-large (see ADR-0048, @openai2024embeddings), the change touched a single layer (Embedding Generation) without cascading through retrieval, generation, or evaluation.

Architectural Trade-offs

Three foundational decisions define the system shape; each is captured in an ADR with the alternatives that were considered and rejected.

Decision	Chosen	Alternatives considered	Rejected because
Vector store	PostgreSQL + pgvector (@pgvector_docs)	Dedicated vector DB (Pinecone, Weaviate, Qdrant); FAISS in-process (@johnson2017faiss)	Pinecone/Weaviate add a second operational system, separate access-control plane, and synchronisation surface against the relational tenant + taxonomy data; FAISS lacks ACID transactions and persistent storage at our 10K-chunk scale. pgvector keeps embeddings, taxonomy, and tenant metadata in one ACID database with one backup story.
Knowledge representation	PostgreSQL taxonomy tables (ADR-0053; supersedes Neo4j adoption)	Neo4j Graph Data Science (@neo4j_gds_manual); embed graph structure into chunk text	Neo4j added an operational service for queries that were already expressible as SQL joins over typed entity tables, and the GDS algorithms (PageRank, community detection) were never load-bearing in retrieval. Embedding the graph into chunks lost the structured-lookup property used by the doctor / department / specialty paths.
LLM call boundary	`structured_call` thin helper (~190 LOC) over the raw OpenAI client	Pydantic AI `Agent[None, OutputModel]` (@pydantic_ai_docs); raw `json.loads`; LangChain wrappers	Pydantic AI was adopted 2026-05-09 across 8 call sites then removed 2026-05-12 (commit `b8d8da67`) after production telemetry showed it added ~720 ms per call. The thin helper preserves schema validation + retries while paying zero latency tax. See the Decision-Cost Rubric case study for the full post-mortem; this is the load-bearing argument behind the v2.3 Brainstorm Gate.

Further per-decision rationale lives in each ADR; this page summarises them so a reader can orient before reading the deeper component pages.

The layered design ensures that changes in one layer — such as swapping an embedding model or adding a new safety filter — do not cascade through the entire system.

Architecture Layers

Diagram conventions

OAI is the OpenAI direct API endpoint used for embeddings and LLM generation post the migration in ADR-0048. The previous OR box (OpenRouter) is intentionally absent: OpenRouter remains a configurable override (rag_llm_provider=openrouter) but is not the default path; see LLM Stack for the legacy / current routing.

The diagram above shows the web chat channel only. The voice channel attaches at the same QP (Query Pipeline) node via VoiceLLMOrchestrator → RAGService.query_stream(channel="voice"), but adds upstream stages — Twilio SIP gateway → LiveKit room → voice_agent worker (Deepgram Nova-3 STT, ElevenLabs Multilingual v2 TTS) → WebSocket /ws/public-query — and a downstream answer-shaping layer that strips inline [N] citation markers and enforces ≤2 sentences. The voice-specific sequence diagram lives in Voice Architecture.

Layer Responsibilities

Presentation Layer

The frontend is a React + TypeScript single-page application that provides a conversational chat interface. It communicates with the backend via both REST API (authentication, document management) and WebSocket (query streaming). The WebSocket connection enables real-time progress updates as the query traverses the pipeline, reporting stages such as "Understanding your question...", "Searching knowledge base...", and "Generating response...".

API Layer

FastAPI serves as the API gateway, chosen for its native async support, automatic OpenAPI documentation, and Pydantic-based request validation. The API layer enforces security before any business logic executes:

Middleware	Purpose	Configuration
Keycloak OIDC	Identity verification via external IdP	JWT tokens, Keycloak realm
CSRF Protection	Cross-site request forgery prevention	starlette-csrf
Rate Limiting	Abuse prevention	10/min login, 30/min queries
CORS	Cross-origin security	Configured per environment
Observability	Structured request/response logging	structlog (JSON in production, colored console in dev)
Graceful Shutdown	Request draining on SIGTERM	Returns 503 for new requests during shutdown

Security Layer

The security layer implements the safety-first principle that permeates this architecture. Every query passes through intent classification before any retrieval or generation occurs. Queries classified as out_of_scope_medical_advice are blocked immediately, never reaching the retrieval pipeline. Post-generation safety validation provides a second check, and the quality gate ensures response confidence meets minimum thresholds.

Critical Design Principle

The security layer operates on a deny-by-default model. A query must actively pass through each safety checkpoint to receive a response. Any single layer can halt the pipeline and return a safe fallback message.

Service Layer

The service layer contains the core intelligence of the system. The Query Pipeline Orchestrator coordinates the sequence of operations following the retrieve-then-generate paradigm established by Lewis et al. 2020: query rewriting for conversational context, sequential vector, BM25, and graph retrieval (asyncpg single-session constraint), a Value Framework affinity rerank to prevent cross-category contamination, response generation with source grounding, and hybrid quality evaluation. Each service is independently testable and replaceable.

Value Framework intent-category affinity rerank (backend/app/services/value_framework/affinity.py) executes between retrieval and context assembly on every non-cached query (Stage 5b). It multiplies each chunk's relevance score by an intent_class × content_category affinity coefficient, demoting chunks whose tagged category is mismatched with the classified intent. The mechanism is hospital-agnostic — it reads chunk text + intent class only, never tenant-specific facts — and per-turn outputs are written to app.category_mismatch_telemetry (migration 066) for the Operations dashboard. See Query Pipeline §Stage 5b for the algorithm and rationale (the wheelchair-vs-cardiology cross-contamination regression).

structured_call thin helper as the LLM-call boundary: 8 LLM call sites — intent classification, query decomposition, feedback investigation, feedback digest, adversarial eval, diagnostic runner, voice turn evaluator, conversation classifier — invoke the model through a 190-LOC structured_call helper over the raw AsyncOpenAI client. The helper enforces JSON-schema validation on responses, retries on validation failure, and raises a typed fallback exception only when the model still cannot satisfy the schema. This shape replaced Pydantic AI on 2026-05-12 (commit b8d8da67) after production telemetry showed Pydantic AI's Agent.run() added ~720 ms per call; the thin helper preserves the validation contract while paying zero latency tax. The removal incident is canonised in Decision-Cost Rubric as the load-bearing case study behind the methodology v2.3 Brainstorm Gate.

Data Layer

Four specialized data stores serve distinct purposes:

Store	Technology	Purpose
Vector Store	PostgreSQL + pgvector	Semantic similarity search over document chunks
Taxonomy	PostgreSQL	Structured entity relationships and graph queries
Semantic Query Cache	PostgreSQL + pgvector	Two-tier query result cache (hash + embedding similarity), see ADR-0031
Intent Classification Cache	Memory (LRU) or Redis	Per-`(tenant, query, language)` cache of intent classification results — skips the ~2,300 ms LLM call on repeat queries. Backend is runtime-selectable via `INTENT_CACHE_BACKEND`. See ADR-0054
Sessions / Rate-limit	Redis	Session management, rate limiting, token blacklist
Object Storage	MinIO	Raw document storage (markdown, HTML)

External Services

The system integrates a single external LLM provider:

OpenAI provides the LLM models and embeddings directly. All API calls use the OpenAI direct endpoint (Tier 2 for intent classification, entity extraction, evaluation, and canonical questions; Tier 2 or Tier 3 for generation in full mode). Embeddings always use OpenAI text-embedding-3-large per ADR-0048, @openai2024embeddings.

Ollama is configured but not deployed

The LLM_FALLBACK_CHAIN setting in backend/.env.example lists Ollama as a final-tier emergency fallback, and LLMProviderFactory._create_ollama_client exists in code. In practice Ollama is not deployed on pilot and the fallback path has never been validated end-to-end — only mock unit tests exist. The path should be treated as vestigial until either deployed and tested, or removed.

Technology Rationale

The technology choices reflect three guiding principles:

Right tool for the right job -- pgvector for vectors (@pgvector_docs), PostgreSQL taxonomy tables for entity relationships, Redis for ephemeral state
Cloud-first -- Embeddings and LLM calls use OpenAI direct API (single LLM provider in production; see the note in External Services for the unused Ollama fallback configuration)
Standards-based -- FastAPI (OpenAPI), PostgreSQL (SQL standard), WebSocket (RFC 6455) ensure interoperability and long-term maintainability

Deployment Architecture

All infrastructure components run as Docker containers orchestrated via Docker Compose, enabling reproducible local development and straightforward deployment. PostgreSQL serves double duty as both the vector store (pgvector) and the entity taxonomy store. Keycloak provides OIDC-based authentication with JWT tokens. The frontend and backend are the only components that run outside of Docker during development, with hot-reloading enabled for rapid iteration.

Channels — Web and Voice

The system serves two production input channels that share the retrieval backend but diverge at the LLM call:

Channel	Entry path	Orchestrator	TTS / response shape
Web (chat)	HTTPS / WebSocket from React UI	`RAGService` (`backend/app/services/rag_service.py`)	Streamed Markdown + citation list
Voice (telephony)	Twilio Elastic SIP → LiveKit SIP → LiveKit Agents	`VoiceLLMOrchestrator` (`backend/app/services/voice/voice_llm_orchestrator.py`)	ElevenLabs Multilingual v2 TTS @elevenlabs_multilingual_v2; answer-shaped for spoken delivery (no inline `[N]` citation markers)

Routing happens at request time on the channel field of QueryRequest (backend/app/models/schemas.py:216 — Literal["web", "whatsapp", "voice"]). The voice path uses a deliberately thin pipeline (regex pre-filter → FAQ tool → RAG fallback) per ADR-0049 / thin-voice-architecture — the previously-attempted 8-stage agentic orchestrator was removed in 2026-05-02 (commit 158d793) after the cache-hit rate failed to justify its complexity. Speech-to-text uses Deepgram Nova-3 @deepgram_nova3 with the language locked at first utterance per ADR-0052 / voice-language-locking.

The voice and web channels share:

the same retrieval pipeline (vector + BM25 + taxonomy)
the same Value Framework affinity rerank (Stage 5b)
the same safety / disclaimer policy (the medical-advice deny-by-default barrier sits before retrieval on both channels)
the same evaluation surface (app.pipeline_telemetry, app.category_mismatch_telemetry)

What differs is the answer shaping layer (backend/app/services/voice/voice_answer_shaper.py) which strips Markdown, fits the response within a target token window for TTS latency, and routes citation context through a separate metadata field instead of inline [N] markers (backend/app/services/voice/voice_faq_tool.py).

Multi-Tenancy

The system supports per-hospital configuration via DB-driven settings. All hospital-specific behaviour — crawl rules, boilerplate selectors, URL patterns, LLM prompt identity — is stored in PostgreSQL and loaded at runtime. No code changes are required to add a new hospital.

Two configuration planes coexist:

Web / RAG plane: app.site_crawl_configs (crawl behaviour) + app.golden_pages (high-value extraction targets) + PromptContext (parameterized prompt identity, injected per-request from the hospital's YAML config).
Voice plane: backend/app/services/voice/tenant_overlays/_yaml/<slug>.yaml — per-tenant FAQ entries, STT phonetic-recovery rules ("afwrak" → "after-care"), and DB-driven answer renderers loaded via get_overlay(slug).

All taxonomy tables, documents, and Redis keys are scoped by tenant_id. See Multi-Tenancy & Hospital-Agnostic Architecture for the full design, onboarding flow, and auto-linker details.

Observability

Structured Logging

The backend uses structlog for structured logging with environment-aware output:

Environment	Format	Purpose
Development	Colored console with key-value pairs	Developer readability
Production	JSON lines (one object per log entry)	Machine-parseable for log aggregation

All log entries include request correlation IDs, timestamps, and structured metadata. The observability middleware logs request/response pairs with timing information.

Health Endpoints

Endpoint	Auth	Purpose
`GET /health`	Public	Basic liveness check (database, Redis, MinIO)
`GET /health/ready`	Public	Deep readiness check including LLM circuit breaker state, PostgreSQL, Redis, and MinIO connectivity

The /health/ready endpoint is designed for orchestrator readiness probes. It reports the LLM circuit breaker state (closed, open, half-open), enabling automated detection of LLM API outages without requiring authenticated access.

Graceful Shutdown

The application implements graceful shutdown with request draining:

On SIGTERM, a shutdown flag is set
New requests receive HTTP 503 (Service Unavailable) with a Retry-After header
In-flight requests are allowed to complete within the configured timeout
Health endpoints (/health, /health/ready) remain available during shutdown for orchestrator polling

Configure via uvicorn's --timeout-graceful-shutdown flag (recommended: 30 seconds).

End-to-End Latency Budgets

End-to-end latency depends on cache state and channel. The web channel uses the figures below; voice channel TTFT (time-to-first-audio) and turn latency are reported in Voice Architecture when measured.

Stage	Web — measured p50	Web — measured p95	Source
Cache hit (Tier 1 — SHA-256)	~1 ms	~3 ms	`query_cache_service.py`; verifiable via `app.pipeline_telemetry`
Cache hit (Tier 2 — pgvector HNSW)	~30 ms	not yet measured	pgvector HNSW operator characteristics; verifiable in pilot
Intent classification (`structured_call` helper)	~2.4 s p50 (LLM-dominated)	not yet measured at p95	`pipeline_telemetry.intent_classification`; post pydantic-ai removal latency dropped ~720 ms per call
Hybrid retrieval (vector + BM25 + taxonomy, sequential)	not yet measured	not yet measured	sequential-execution constraint per asyncpg single-session policy
Context assembly (incl. ±1 chunk expansion + dedup)	~50 ms	not yet measured	`context_assembly_service.py`
Response generation (Tier 2 streaming)	~3 s	not yet measured	dominant cost on cache miss; mitigated by streaming
Fast quality gate	~600 ms	not yet measured	embedding-similarity blend; runs before stream close
Background analytics (DeepEval)	40–60 s	n/a — non-blocking	runs after response delivered

Verification path: each stage emits pipeline_telemetry.duration_ms per conversation_id. The Operations dashboard renders the percentile_cont(0.95) over the last period_days window via GET /api/v1/admin/feedback/telemetry-stats (see Feedback Dashboard Metrics). Wave 2.D leaves the missing stage-level p95 numbers as not yet measured rather than estimating; treat the 5.5-second blocking total in Query Pipeline §Timing Breakdown as a working estimate, not a contract.

Endpoint	Method	Purpose
`DELETE /api/v1/gdpr/users/{id}/data`	DELETE	GDPR Art. 17 right-to-erasure: deletes all user data (conversations, messages, analytics, audit logs)

The GDPR deletion endpoint requires admin authentication and returns a summary of deleted records across all data categories.

Prompt Versioning

All LLM prompts are versioned via a PROMPT_VERSION constant (e.g., 2026.04.1). This version is logged with every evaluation result, enabling correlation between prompt changes and quality regressions.

Architectural Changes Since 2026-04-09

The previous last_verified snapshot of this page predates several material architectural changes; the table below summarises them so a reader of an older PDF / cached copy knows what's new.

Date	Change	Reference
2026-04-30	Embedding model migrated BGE-M3 → OpenAI `text-embedding-3-large` (1024-dim → 1536-dim, Ollama → OpenAI)	ADR-0048
2026-05-02	Legacy 8-stage agentic voice orchestrator removed; thin pipeline (regex pre-filter → FAQ → RAG) becomes production	ADR-0049 / thin-voice-architecture; commit `158d793`
2026-05-02	Neo4j Graph Data Science removed; entity relationships consolidated into PostgreSQL taxonomy tables	ADR-0053
2026-05-09 → 2026-05-12	Pydantic AI adopted (8 call sites) then removed after telemetry showed +720 ms/call; replaced with `structured_call` thin helper (~190 LOC)	Decision-Cost Rubric case study; commit `b8d8da67`
2026-05-22	LLM-first agentic voice pipeline accepted: native OpenAI streaming-with-tools, single call per tool-decision iteration	ADR-0053-llm-first-agentic-voice; commit `15c596b5`
2026-05-23	Voice quality refit: Rule 4.5 (no repeated clarifications) + temperature 0.3 → 0.0 + 80-term STT phonetic-recovery sweep + Rule 6.5 (procedure explanations) + tier-1 session rate limit + voice ops infrastructure (trace/replay/SLO)	Voice — Architecture; commits `7900b5e0`, `2c514cc1`, `0242202e`, `3bda7f00`
2026-05-09	Value Framework affinity rerank productised at Stage 5b; `app.category_mismatch_telemetry` (migration 066) and `app.diagnostic_feedback` (migration 067) added	Query Pipeline §Stage 5b
2026-05-04	Voice tenant-overlay system (`backend/app/services/voice/tenant_overlays/`) shipped; per-tenant FAQ + STT recovery + answer renderers	Multi-Tenancy
2026-04-22	Nightly auto-ingest live on pilot (`INGEST_MODE=auto`, daily 03:00 UTC); `IngestRun` audit records (migration 061)	Document Ingestion §Scheduled Ingestion

Architectural Trade-offs​

Architecture Layers​

Layer Responsibilities​

Presentation Layer​

API Layer​

Security Layer​

Service Layer​

Data Layer​

External Services​

Technology Rationale​

Deployment Architecture​

Channels — Web and Voice​

Multi-Tenancy​

Observability​

Structured Logging​

Health Endpoints​

Graceful Shutdown​

End-to-End Latency Budgets​

GDPR Compliance Endpoints​

Prompt Versioning​

Architectural Changes Since 2026-04-09​