System Architecture
This page describes the architecture of an already-functional RAG system and assumes familiarity with the fundamentals. If you are new to the system, start with Core Concepts, and keep the Glossary open for any unfamiliar term.
The ZOL Intelligent Search system implements retrieval-augmented generation Lewis et al. 2020 over hospital corpora, with two production input channels — a web chat interface and a voice (telephony) interface — that share the same retrieval backend. The architecture is layered: each layer has a distinct responsibility and communicates with adjacent layers through well-defined interfaces. This separation makes substitutions tractable; when we migrated the embedding model from BGE-M3 to OpenAI text-embedding-3-large (see ADR-0048, @openai2024embeddings), the change touched a single layer (Embedding Generation) without cascading through retrieval, generation, or evaluation.
Architectural Trade-offs
Three foundational decisions define the system shape; each is captured in an ADR with the alternatives that were considered and rejected.
| Decision | Chosen | Alternatives considered | Rejected because |
|---|---|---|---|
| Vector store | PostgreSQL + pgvector (@pgvector_docs) | Dedicated vector DB (Pinecone, Weaviate, Qdrant); FAISS in-process (@johnson2017faiss) | Pinecone/Weaviate add a second operational system, separate access-control plane, and synchronisation surface against the relational tenant + taxonomy data; FAISS lacks ACID transactions and persistent storage at our 10K-chunk scale. pgvector keeps embeddings, taxonomy, and tenant metadata in one ACID database with one backup story. |
| Knowledge representation | PostgreSQL taxonomy tables (ADR-0053; supersedes Neo4j adoption) | Neo4j Graph Data Science (@neo4j_gds_manual); embed graph structure into chunk text | Neo4j added an operational service for queries that were already expressible as SQL joins over typed entity tables, and the GDS algorithms (PageRank, community detection) were never load-bearing in retrieval. Embedding the graph into chunks lost the structured-lookup property used by the doctor / department / specialty paths. |
| LLM call boundary | structured_call thin helper (~190 LOC) over the raw OpenAI client | Pydantic AI Agent[None, OutputModel] (@pydantic_ai_docs); raw json.loads; LangChain wrappers | Pydantic AI was adopted 2026-05-09 across 8 call sites then removed 2026-05-12 (commit b8d8da67) after production telemetry showed it added ~720 ms per call. The thin helper preserves schema validation + retries while paying zero latency tax. See the Decision-Cost Rubric case study for the full post-mortem; this is the load-bearing argument behind the v2.3 Brainstorm Gate. |
Further per-decision rationale lives in each ADR; this page summarises them so a reader can orient before reading the deeper component pages.
The layered design ensures that changes in one layer — such as swapping an embedding model or adding a new safety filter — do not cascade through the entire system.
Architecture Layers
OAI is the OpenAI direct API endpoint used for embeddings and LLM generation post the migration in ADR-0048. The previous OR box (OpenRouter) is intentionally absent: OpenRouter remains a configurable override (rag_llm_provider=openrouter) but is not the default path; see LLM Stack for the legacy / current routing.
The diagram above shows the web chat channel only. The voice channel attaches at the same QP (Query Pipeline) node via VoiceLLMOrchestrator → RAGService.query_stream(channel="voice"), but adds upstream stages — Twilio SIP gateway → LiveKit room → voice_agent worker (Deepgram Nova-3 STT, ElevenLabs Multilingual v2 TTS) → WebSocket /ws/public-query — and a downstream answer-shaping layer that strips inline [N] citation markers and enforces ≤2 sentences. The voice-specific sequence diagram lives in Voice Architecture.
Layer Responsibilities
Presentation Layer
The frontend is a React + TypeScript single-page application that provides a conversational chat interface. It communicates with the backend via both REST API (authentication, document management) and WebSocket (query streaming). The WebSocket connection enables real-time progress updates as the query traverses the pipeline, reporting stages such as "Understanding your question...", "Searching knowledge base...", and "Generating response...".
API Layer
FastAPI serves as the API gateway, chosen for its native async support, automatic OpenAPI documentation, and Pydantic-based request validation. The API layer enforces security before any business logic executes:
| Middleware | Purpose | Configuration |
|---|---|---|
| Keycloak OIDC | Identity verification via external IdP | JWT tokens, Keycloak realm |
| CSRF Protection | Cross-site request forgery prevention | starlette-csrf |
| Rate Limiting | Abuse prevention | 10/min login, 30/min queries |
| CORS | Cross-origin security | Configured per environment |
| Observability | Structured request/response logging | structlog (JSON in production, colored console in dev) |
| Graceful Shutdown | Request draining on SIGTERM | Returns 503 for new requests during shutdown |
Security Layer
The security layer implements the safety-first principle that permeates this architecture. Every query passes through intent classification before any retrieval or generation occurs. Queries classified as out_of_scope_medical_advice are blocked immediately, never reaching the retrieval pipeline. Post-generation safety validation provides a second check, and the quality gate ensures response confidence meets minimum thresholds.
The security layer operates on a deny-by-default model. A query must actively pass through each safety checkpoint to receive a response. Any single layer can halt the pipeline and return a safe fallback message.
Service Layer
The service layer contains the core intelligence of the system. The Query Pipeline Orchestrator coordinates the sequence of operations following the retrieve-then-generate paradigm established by Lewis et al. 2020: query rewriting for conversational context, sequential vector, BM25, and graph retrieval (asyncpg single-session constraint), a Value Framework affinity rerank to prevent cross-category contamination, response generation with source grounding, and hybrid quality evaluation. Each service is independently testable and replaceable.
Value Framework intent-category affinity rerank (backend/app/services/value_framework/affinity.py) executes between retrieval and context assembly on every non-cached query (Stage 5b). It multiplies each chunk's relevance score by an intent_class × content_category affinity coefficient, demoting chunks whose tagged category is mismatched with the classified intent. The mechanism is hospital-agnostic — it reads chunk text + intent class only, never tenant-specific facts — and per-turn outputs are written to app.category_mismatch_telemetry (migration 066) for the Operations dashboard. See Query Pipeline §Stage 5b for the algorithm and rationale (the wheelchair-vs-cardiology cross-contamination regression).
structured_call thin helper as the LLM-call boundary: 8 LLM call sites — intent classification, query decomposition, feedback investigation, feedback digest, adversarial eval, diagnostic runner, voice turn evaluator, conversation classifier — invoke the model through a 190-LOC structured_call helper over the raw AsyncOpenAI client. The helper enforces JSON-schema validation on responses, retries on validation failure, and raises a typed fallback exception only when the model still cannot satisfy the schema. This shape replaced Pydantic AI on 2026-05-12 (commit b8d8da67) after production telemetry showed Pydantic AI's Agent.run() added ~720 ms per call; the thin helper preserves the validation contract while paying zero latency tax. The removal incident is canonised in Decision-Cost Rubric as the load-bearing case study behind the methodology v2.3 Brainstorm Gate.
Data Layer
Four specialized data stores serve distinct purposes:
| Store | Technology | Purpose |
|---|---|---|
| Vector Store | PostgreSQL + pgvector | Semantic similarity search over document chunks |
| Taxonomy | PostgreSQL | Structured entity relationships and graph queries |
| Semantic Query Cache | PostgreSQL + pgvector | Two-tier query result cache (hash + embedding similarity), see ADR-0031 |
| Intent Classification Cache | Memory (LRU) or Redis | Per-(tenant, query, language) cache of intent classification results — skips the ~2,300 ms LLM call on repeat queries. Backend is runtime-selectable via INTENT_CACHE_BACKEND. See ADR-0054 |
| Sessions / Rate-limit | Redis | Session management, rate limiting, token blacklist |
| Object Storage | MinIO | Raw document storage (markdown, HTML) |
External Services
The system integrates a single external LLM provider:
- OpenAI provides the LLM models and embeddings directly. All API calls use the OpenAI direct endpoint (Tier 2 for intent classification, entity extraction, evaluation, and canonical questions; Tier 2 or Tier 3 for generation in full mode). Embeddings always use OpenAI
text-embedding-3-largeper ADR-0048, @openai2024embeddings.
The LLM_FALLBACK_CHAIN setting in backend/.env.example lists Ollama as a final-tier emergency fallback, and LLMProviderFactory._create_ollama_client exists in code. In practice Ollama is not deployed on pilot and the fallback path has never been validated end-to-end — only mock unit tests exist. The path should be treated as vestigial until either deployed and tested, or removed.
Technology Rationale
The technology choices reflect three guiding principles:
- Right tool for the right job -- pgvector for vectors (@pgvector_docs), PostgreSQL taxonomy tables for entity relationships, Redis for ephemeral state
- Cloud-first -- Embeddings and LLM calls use OpenAI direct API (single LLM provider in production; see the note in External Services for the unused Ollama fallback configuration)
- Standards-based -- FastAPI (OpenAPI), PostgreSQL (SQL standard), WebSocket (RFC 6455) ensure interoperability and long-term maintainability
Deployment Architecture
All infrastructure components run as Docker containers orchestrated via Docker Compose, enabling reproducible local development and straightforward deployment. PostgreSQL serves double duty as both the vector store (pgvector) and the entity taxonomy store. Keycloak provides OIDC-based authentication with JWT tokens. The frontend and backend are the only components that run outside of Docker during development, with hot-reloading enabled for rapid iteration.
Channels — Web and Voice
The system serves two production input channels that share the retrieval backend but diverge at the LLM call:
| Channel | Entry path | Orchestrator | TTS / response shape |
|---|---|---|---|
| Web (chat) | HTTPS / WebSocket from React UI | RAGService (backend/app/services/rag_service.py) | Streamed Markdown + citation list |
| Voice (telephony) | Twilio Elastic SIP → LiveKit SIP → LiveKit Agents | VoiceLLMOrchestrator (backend/app/services/voice/voice_llm_orchestrator.py) | ElevenLabs Multilingual v2 TTS @elevenlabs_multilingual_v2; answer-shaped for spoken delivery (no inline [N] citation markers) |
Routing happens at request time on the channel field of QueryRequest (backend/app/models/schemas.py:216 — Literal["web", "whatsapp", "voice"]). The voice path uses a deliberately thin pipeline (regex pre-filter → FAQ tool → RAG fallback) per ADR-0049 / thin-voice-architecture — the previously-attempted 8-stage agentic orchestrator was removed in 2026-05-02 (commit 158d793) after the cache-hit rate failed to justify its complexity. Speech-to-text uses Deepgram Nova-3 @deepgram_nova3 with the language locked at first utterance per ADR-0052 / voice-language-locking.
The voice and web channels share:
- the same retrieval pipeline (vector + BM25 + taxonomy)
- the same Value Framework affinity rerank (Stage 5b)
- the same safety / disclaimer policy (the medical-advice deny-by-default barrier sits before retrieval on both channels)
- the same evaluation surface (
app.pipeline_telemetry,app.category_mismatch_telemetry)
What differs is the answer shaping layer (backend/app/services/voice/voice_answer_shaper.py) which strips Markdown, fits the response within a target token window for TTS latency, and routes citation context through a separate metadata field instead of inline [N] markers (backend/app/services/voice/voice_faq_tool.py).
Multi-Tenancy
The system supports per-hospital configuration via DB-driven settings. All hospital-specific behaviour — crawl rules, boilerplate selectors, URL patterns, LLM prompt identity — is stored in PostgreSQL and loaded at runtime. No code changes are required to add a new hospital.
Two configuration planes coexist:
- Web / RAG plane:
app.site_crawl_configs(crawl behaviour) +app.golden_pages(high-value extraction targets) +PromptContext(parameterized prompt identity, injected per-request from the hospital's YAML config). - Voice plane:
backend/app/services/voice/tenant_overlays/_yaml/<slug>.yaml— per-tenant FAQ entries, STT phonetic-recovery rules ("afwrak" → "after-care"), and DB-driven answer renderers loaded viaget_overlay(slug).
All taxonomy tables, documents, and Redis keys are scoped by tenant_id. See Multi-Tenancy & Hospital-Agnostic Architecture for the full design, onboarding flow, and auto-linker details.
Observability
Structured Logging
The backend uses structlog for structured logging with environment-aware output:
| Environment | Format | Purpose |
|---|---|---|
| Development | Colored console with key-value pairs | Developer readability |
| Production | JSON lines (one object per log entry) | Machine-parseable for log aggregation |
All log entries include request correlation IDs, timestamps, and structured metadata. The observability middleware logs request/response pairs with timing information.
Health Endpoints
| Endpoint | Auth | Purpose |
|---|---|---|
GET /health | Public | Basic liveness check (database, Redis, MinIO) |
GET /health/ready | Public | Deep readiness check including LLM circuit breaker state, PostgreSQL, Redis, and MinIO connectivity |
The /health/ready endpoint is designed for orchestrator readiness probes. It reports the LLM circuit breaker state (closed, open, half-open), enabling automated detection of LLM API outages without requiring authenticated access.
Graceful Shutdown
The application implements graceful shutdown with request draining:
- On
SIGTERM, a shutdown flag is set - New requests receive HTTP 503 (Service Unavailable) with a
Retry-Afterheader - In-flight requests are allowed to complete within the configured timeout
- Health endpoints (
/health,/health/ready) remain available during shutdown for orchestrator polling
Configure via uvicorn's --timeout-graceful-shutdown flag (recommended: 30 seconds).
End-to-End Latency Budgets
End-to-end latency depends on cache state and channel. The web channel uses the figures below; voice channel TTFT (time-to-first-audio) and turn latency are reported in Voice Architecture when measured.
| Stage | Web — measured p50 | Web — measured p95 | Source |
|---|---|---|---|
| Cache hit (Tier 1 — SHA-256) | ~1 ms | ~3 ms | query_cache_service.py; verifiable via app.pipeline_telemetry |
| Cache hit (Tier 2 — pgvector HNSW) | ~30 ms | not yet measured | pgvector HNSW operator characteristics; verifiable in pilot |
Intent classification (structured_call helper) | ~2.4 s p50 (LLM-dominated) | not yet measured at p95 | pipeline_telemetry.intent_classification; post pydantic-ai removal latency dropped ~720 ms per call |
| Hybrid retrieval (vector + BM25 + taxonomy, sequential) | not yet measured | not yet measured | sequential-execution constraint per asyncpg single-session policy |
| Context assembly (incl. ±1 chunk expansion + dedup) | ~50 ms | not yet measured | context_assembly_service.py |
| Response generation (Tier 2 streaming) | ~3 s | not yet measured | dominant cost on cache miss; mitigated by streaming |
| Fast quality gate | ~600 ms | not yet measured | embedding-similarity blend; runs before stream close |
| Background analytics (DeepEval) | 40–60 s | n/a — non-blocking | runs after response delivered |
Verification path: each stage emits pipeline_telemetry.duration_ms per conversation_id. The Operations dashboard renders the percentile_cont(0.95) over the last period_days window via GET /api/v1/admin/feedback/telemetry-stats (see Feedback Dashboard Metrics). Wave 2.D leaves the missing stage-level p95 numbers as not yet measured rather than estimating; treat the 5.5-second blocking total in Query Pipeline §Timing Breakdown as a working estimate, not a contract.
GDPR Compliance Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
DELETE /api/v1/gdpr/users/{id}/data | DELETE | GDPR Art. 17 right-to-erasure: deletes all user data (conversations, messages, analytics, audit logs) |
The GDPR deletion endpoint requires admin authentication and returns a summary of deleted records across all data categories.
Prompt Versioning
All LLM prompts are versioned via a PROMPT_VERSION constant (e.g., 2026.04.1). This version is logged with every evaluation result, enabling correlation between prompt changes and quality regressions.
Architectural Changes Since 2026-04-09
The previous last_verified snapshot of this page predates several material architectural changes; the table below summarises them so a reader of an older PDF / cached copy knows what's new.
| Date | Change | Reference |
|---|---|---|
| 2026-04-30 | Embedding model migrated BGE-M3 → OpenAI text-embedding-3-large (1024-dim → 1536-dim, Ollama → OpenAI) | ADR-0048 |
| 2026-05-02 | Legacy 8-stage agentic voice orchestrator removed; thin pipeline (regex pre-filter → FAQ → RAG) becomes production | ADR-0049 / thin-voice-architecture; commit 158d793 |
| 2026-05-02 | Neo4j Graph Data Science removed; entity relationships consolidated into PostgreSQL taxonomy tables | ADR-0053 |
| 2026-05-09 → 2026-05-12 | Pydantic AI adopted (8 call sites) then removed after telemetry showed +720 ms/call; replaced with structured_call thin helper (~190 LOC) | Decision-Cost Rubric case study; commit b8d8da67 |
| 2026-05-22 | LLM-first agentic voice pipeline accepted: native OpenAI streaming-with-tools, single call per tool-decision iteration | ADR-0053-llm-first-agentic-voice; commit 15c596b5 |
| 2026-05-23 | Voice quality refit: Rule 4.5 (no repeated clarifications) + temperature 0.3 → 0.0 + 80-term STT phonetic-recovery sweep + Rule 6.5 (procedure explanations) + tier-1 session rate limit + voice ops infrastructure (trace/replay/SLO) | Voice — Architecture; commits 7900b5e0, 2c514cc1, 0242202e, 3bda7f00 |
| 2026-05-09 | Value Framework affinity rerank productised at Stage 5b; app.category_mismatch_telemetry (migration 066) and app.diagnostic_feedback (migration 067) added | Query Pipeline §Stage 5b |
| 2026-05-04 | Voice tenant-overlay system (backend/app/services/voice/tenant_overlays/) shipped; per-tenant FAQ + STT recovery + answer renderers | Multi-Tenancy |
| 2026-04-22 | Nightly auto-ingest live on pilot (INGEST_MODE=auto, daily 03:00 UTC); IngestRun audit records (migration 061) | Document Ingestion §Scheduled Ingestion |