Skip to main content

System Architecture

Prerequisites

This page describes the architecture of an already-functional RAG system and assumes familiarity with the fundamentals. If you are new to the system, start with Core Concepts, and keep the Glossary open for any unfamiliar term.

The ZOL Intelligent Search system implements retrieval-augmented generation Lewis et al. 2020 over hospital corpora, with two production input channels — a web chat interface and a voice (telephony) interface — that share the same retrieval backend. The architecture is layered: each layer has a distinct responsibility and communicates with adjacent layers through well-defined interfaces. This separation makes substitutions tractable; when we migrated the embedding model from BGE-M3 to OpenAI text-embedding-3-large (see ADR-0048, @openai2024embeddings), the change touched a single layer (Embedding Generation) without cascading through retrieval, generation, or evaluation.

Architectural Trade-offs

Three foundational decisions define the system shape; each is captured in an ADR with the alternatives that were considered and rejected.

DecisionChosenAlternatives consideredRejected because
Vector storePostgreSQL + pgvector (@pgvector_docs)Dedicated vector DB (Pinecone, Weaviate, Qdrant); FAISS in-process (@johnson2017faiss)Pinecone/Weaviate add a second operational system, separate access-control plane, and synchronisation surface against the relational tenant + taxonomy data; FAISS lacks ACID transactions and persistent storage at our 10K-chunk scale. pgvector keeps embeddings, taxonomy, and tenant metadata in one ACID database with one backup story.
Knowledge representationPostgreSQL taxonomy tables (ADR-0053; supersedes Neo4j adoption)Neo4j Graph Data Science (@neo4j_gds_manual); embed graph structure into chunk textNeo4j added an operational service for queries that were already expressible as SQL joins over typed entity tables, and the GDS algorithms (PageRank, community detection) were never load-bearing in retrieval. Embedding the graph into chunks lost the structured-lookup property used by the doctor / department / specialty paths.
LLM call boundarystructured_call thin helper (~190 LOC) over the raw OpenAI clientPydantic AI Agent[None, OutputModel] (@pydantic_ai_docs); raw json.loads; LangChain wrappersPydantic AI was adopted 2026-05-09 across 8 call sites then removed 2026-05-12 (commit b8d8da67) after production telemetry showed it added ~720 ms per call. The thin helper preserves schema validation + retries while paying zero latency tax. See the Decision-Cost Rubric case study for the full post-mortem; this is the load-bearing argument behind the v2.3 Brainstorm Gate.

Further per-decision rationale lives in each ADR; this page summarises them so a reader can orient before reading the deeper component pages.

The layered design ensures that changes in one layer — such as swapping an embedding model or adding a new safety filter — do not cascade through the entire system.

Architecture Layers

Diagram conventions

OAI is the OpenAI direct API endpoint used for embeddings and LLM generation post the migration in ADR-0048. The previous OR box (OpenRouter) is intentionally absent: OpenRouter remains a configurable override (rag_llm_provider=openrouter) but is not the default path; see LLM Stack for the legacy / current routing.

The diagram above shows the web chat channel only. The voice channel attaches at the same QP (Query Pipeline) node via VoiceLLMOrchestratorRAGService.query_stream(channel="voice"), but adds upstream stages — Twilio SIP gateway → LiveKit room → voice_agent worker (Deepgram Nova-3 STT, ElevenLabs Multilingual v2 TTS) → WebSocket /ws/public-query — and a downstream answer-shaping layer that strips inline [N] citation markers and enforces ≤2 sentences. The voice-specific sequence diagram lives in Voice Architecture.

Layer Responsibilities

Presentation Layer

The frontend is a React + TypeScript single-page application that provides a conversational chat interface. It communicates with the backend via both REST API (authentication, document management) and WebSocket (query streaming). The WebSocket connection enables real-time progress updates as the query traverses the pipeline, reporting stages such as "Understanding your question...", "Searching knowledge base...", and "Generating response...".

API Layer

FastAPI serves as the API gateway, chosen for its native async support, automatic OpenAPI documentation, and Pydantic-based request validation. The API layer enforces security before any business logic executes:

MiddlewarePurposeConfiguration
Keycloak OIDCIdentity verification via external IdPJWT tokens, Keycloak realm
CSRF ProtectionCross-site request forgery preventionstarlette-csrf
Rate LimitingAbuse prevention10/min login, 30/min queries
CORSCross-origin securityConfigured per environment
ObservabilityStructured request/response loggingstructlog (JSON in production, colored console in dev)
Graceful ShutdownRequest draining on SIGTERMReturns 503 for new requests during shutdown

Security Layer

The security layer implements the safety-first principle that permeates this architecture. Every query passes through intent classification before any retrieval or generation occurs. Queries classified as out_of_scope_medical_advice are blocked immediately, never reaching the retrieval pipeline. Post-generation safety validation provides a second check, and the quality gate ensures response confidence meets minimum thresholds.

Critical Design Principle

The security layer operates on a deny-by-default model. A query must actively pass through each safety checkpoint to receive a response. Any single layer can halt the pipeline and return a safe fallback message.

Service Layer

The service layer contains the core intelligence of the system. The Query Pipeline Orchestrator coordinates the sequence of operations following the retrieve-then-generate paradigm established by Lewis et al. 2020: query rewriting for conversational context, sequential vector, BM25, and graph retrieval (asyncpg single-session constraint), a Value Framework affinity rerank to prevent cross-category contamination, response generation with source grounding, and hybrid quality evaluation. Each service is independently testable and replaceable.

Value Framework intent-category affinity rerank (backend/app/services/value_framework/affinity.py) executes between retrieval and context assembly on every non-cached query (Stage 5b). It multiplies each chunk's relevance score by an intent_class × content_category affinity coefficient, demoting chunks whose tagged category is mismatched with the classified intent. The mechanism is hospital-agnostic — it reads chunk text + intent class only, never tenant-specific facts — and per-turn outputs are written to app.category_mismatch_telemetry (migration 066) for the Operations dashboard. See Query Pipeline §Stage 5b for the algorithm and rationale (the wheelchair-vs-cardiology cross-contamination regression).

structured_call thin helper as the LLM-call boundary: 8 LLM call sites — intent classification, query decomposition, feedback investigation, feedback digest, adversarial eval, diagnostic runner, voice turn evaluator, conversation classifier — invoke the model through a 190-LOC structured_call helper over the raw AsyncOpenAI client. The helper enforces JSON-schema validation on responses, retries on validation failure, and raises a typed fallback exception only when the model still cannot satisfy the schema. This shape replaced Pydantic AI on 2026-05-12 (commit b8d8da67) after production telemetry showed Pydantic AI's Agent.run() added ~720 ms per call; the thin helper preserves the validation contract while paying zero latency tax. The removal incident is canonised in Decision-Cost Rubric as the load-bearing case study behind the methodology v2.3 Brainstorm Gate.

Data Layer

Four specialized data stores serve distinct purposes:

StoreTechnologyPurpose
Vector StorePostgreSQL + pgvectorSemantic similarity search over document chunks
TaxonomyPostgreSQLStructured entity relationships and graph queries
Semantic Query CachePostgreSQL + pgvectorTwo-tier query result cache (hash + embedding similarity), see ADR-0031
Intent Classification CacheMemory (LRU) or RedisPer-(tenant, query, language) cache of intent classification results — skips the ~2,300 ms LLM call on repeat queries. Backend is runtime-selectable via INTENT_CACHE_BACKEND. See ADR-0054
Sessions / Rate-limitRedisSession management, rate limiting, token blacklist
Object StorageMinIORaw document storage (markdown, HTML)

External Services

The system integrates a single external LLM provider:

  • OpenAI provides the LLM models and embeddings directly. All API calls use the OpenAI direct endpoint (Tier 2 for intent classification, entity extraction, evaluation, and canonical questions; Tier 2 or Tier 3 for generation in full mode). Embeddings always use OpenAI text-embedding-3-large per ADR-0048, @openai2024embeddings.
Ollama is configured but not deployed

The LLM_FALLBACK_CHAIN setting in backend/.env.example lists Ollama as a final-tier emergency fallback, and LLMProviderFactory._create_ollama_client exists in code. In practice Ollama is not deployed on pilot and the fallback path has never been validated end-to-end — only mock unit tests exist. The path should be treated as vestigial until either deployed and tested, or removed.

Technology Rationale

The technology choices reflect three guiding principles:

  1. Right tool for the right job -- pgvector for vectors (@pgvector_docs), PostgreSQL taxonomy tables for entity relationships, Redis for ephemeral state
  2. Cloud-first -- Embeddings and LLM calls use OpenAI direct API (single LLM provider in production; see the note in External Services for the unused Ollama fallback configuration)
  3. Standards-based -- FastAPI (OpenAPI), PostgreSQL (SQL standard), WebSocket (RFC 6455) ensure interoperability and long-term maintainability

Deployment Architecture

All infrastructure components run as Docker containers orchestrated via Docker Compose, enabling reproducible local development and straightforward deployment. PostgreSQL serves double duty as both the vector store (pgvector) and the entity taxonomy store. Keycloak provides OIDC-based authentication with JWT tokens. The frontend and backend are the only components that run outside of Docker during development, with hot-reloading enabled for rapid iteration.

Channels — Web and Voice

The system serves two production input channels that share the retrieval backend but diverge at the LLM call:

ChannelEntry pathOrchestratorTTS / response shape
Web (chat)HTTPS / WebSocket from React UIRAGService (backend/app/services/rag_service.py)Streamed Markdown + citation list
Voice (telephony)Twilio Elastic SIP → LiveKit SIP → LiveKit AgentsVoiceLLMOrchestrator (backend/app/services/voice/voice_llm_orchestrator.py)ElevenLabs Multilingual v2 TTS @elevenlabs_multilingual_v2; answer-shaped for spoken delivery (no inline [N] citation markers)

Routing happens at request time on the channel field of QueryRequest (backend/app/models/schemas.py:216Literal["web", "whatsapp", "voice"]). The voice path uses a deliberately thin pipeline (regex pre-filter → FAQ tool → RAG fallback) per ADR-0049 / thin-voice-architecture — the previously-attempted 8-stage agentic orchestrator was removed in 2026-05-02 (commit 158d793) after the cache-hit rate failed to justify its complexity. Speech-to-text uses Deepgram Nova-3 @deepgram_nova3 with the language locked at first utterance per ADR-0052 / voice-language-locking.

The voice and web channels share:

  • the same retrieval pipeline (vector + BM25 + taxonomy)
  • the same Value Framework affinity rerank (Stage 5b)
  • the same safety / disclaimer policy (the medical-advice deny-by-default barrier sits before retrieval on both channels)
  • the same evaluation surface (app.pipeline_telemetry, app.category_mismatch_telemetry)

What differs is the answer shaping layer (backend/app/services/voice/voice_answer_shaper.py) which strips Markdown, fits the response within a target token window for TTS latency, and routes citation context through a separate metadata field instead of inline [N] markers (backend/app/services/voice/voice_faq_tool.py).

Multi-Tenancy

The system supports per-hospital configuration via DB-driven settings. All hospital-specific behaviour — crawl rules, boilerplate selectors, URL patterns, LLM prompt identity — is stored in PostgreSQL and loaded at runtime. No code changes are required to add a new hospital.

Two configuration planes coexist:

  • Web / RAG plane: app.site_crawl_configs (crawl behaviour) + app.golden_pages (high-value extraction targets) + PromptContext (parameterized prompt identity, injected per-request from the hospital's YAML config).
  • Voice plane: backend/app/services/voice/tenant_overlays/_yaml/<slug>.yaml — per-tenant FAQ entries, STT phonetic-recovery rules ("afwrak" → "after-care"), and DB-driven answer renderers loaded via get_overlay(slug).

All taxonomy tables, documents, and Redis keys are scoped by tenant_id. See Multi-Tenancy & Hospital-Agnostic Architecture for the full design, onboarding flow, and auto-linker details.

Observability

Structured Logging

The backend uses structlog for structured logging with environment-aware output:

EnvironmentFormatPurpose
DevelopmentColored console with key-value pairsDeveloper readability
ProductionJSON lines (one object per log entry)Machine-parseable for log aggregation

All log entries include request correlation IDs, timestamps, and structured metadata. The observability middleware logs request/response pairs with timing information.

Health Endpoints

EndpointAuthPurpose
GET /healthPublicBasic liveness check (database, Redis, MinIO)
GET /health/readyPublicDeep readiness check including LLM circuit breaker state, PostgreSQL, Redis, and MinIO connectivity

The /health/ready endpoint is designed for orchestrator readiness probes. It reports the LLM circuit breaker state (closed, open, half-open), enabling automated detection of LLM API outages without requiring authenticated access.

Graceful Shutdown

The application implements graceful shutdown with request draining:

  1. On SIGTERM, a shutdown flag is set
  2. New requests receive HTTP 503 (Service Unavailable) with a Retry-After header
  3. In-flight requests are allowed to complete within the configured timeout
  4. Health endpoints (/health, /health/ready) remain available during shutdown for orchestrator polling

Configure via uvicorn's --timeout-graceful-shutdown flag (recommended: 30 seconds).

End-to-End Latency Budgets

End-to-end latency depends on cache state and channel. The web channel uses the figures below; voice channel TTFT (time-to-first-audio) and turn latency are reported in Voice Architecture when measured.

StageWeb — measured p50Web — measured p95Source
Cache hit (Tier 1 — SHA-256)~1 ms~3 msquery_cache_service.py; verifiable via app.pipeline_telemetry
Cache hit (Tier 2 — pgvector HNSW)~30 msnot yet measuredpgvector HNSW operator characteristics; verifiable in pilot
Intent classification (structured_call helper)~2.4 s p50 (LLM-dominated)not yet measured at p95pipeline_telemetry.intent_classification; post pydantic-ai removal latency dropped ~720 ms per call
Hybrid retrieval (vector + BM25 + taxonomy, sequential)not yet measurednot yet measuredsequential-execution constraint per asyncpg single-session policy
Context assembly (incl. ±1 chunk expansion + dedup)~50 msnot yet measuredcontext_assembly_service.py
Response generation (Tier 2 streaming)~3 snot yet measureddominant cost on cache miss; mitigated by streaming
Fast quality gate~600 msnot yet measuredembedding-similarity blend; runs before stream close
Background analytics (DeepEval)40–60 sn/a — non-blockingruns after response delivered

Verification path: each stage emits pipeline_telemetry.duration_ms per conversation_id. The Operations dashboard renders the percentile_cont(0.95) over the last period_days window via GET /api/v1/admin/feedback/telemetry-stats (see Feedback Dashboard Metrics). Wave 2.D leaves the missing stage-level p95 numbers as not yet measured rather than estimating; treat the 5.5-second blocking total in Query Pipeline §Timing Breakdown as a working estimate, not a contract.

GDPR Compliance Endpoints

EndpointMethodPurpose
DELETE /api/v1/gdpr/users/{id}/dataDELETEGDPR Art. 17 right-to-erasure: deletes all user data (conversations, messages, analytics, audit logs)

The GDPR deletion endpoint requires admin authentication and returns a summary of deleted records across all data categories.

Prompt Versioning

All LLM prompts are versioned via a PROMPT_VERSION constant (e.g., 2026.04.1). This version is logged with every evaluation result, enabling correlation between prompt changes and quality regressions.

Architectural Changes Since 2026-04-09

The previous last_verified snapshot of this page predates several material architectural changes; the table below summarises them so a reader of an older PDF / cached copy knows what's new.

DateChangeReference
2026-04-30Embedding model migrated BGE-M3 → OpenAI text-embedding-3-large (1024-dim → 1536-dim, Ollama → OpenAI)ADR-0048
2026-05-02Legacy 8-stage agentic voice orchestrator removed; thin pipeline (regex pre-filter → FAQ → RAG) becomes productionADR-0049 / thin-voice-architecture; commit 158d793
2026-05-02Neo4j Graph Data Science removed; entity relationships consolidated into PostgreSQL taxonomy tablesADR-0053
2026-05-09 → 2026-05-12Pydantic AI adopted (8 call sites) then removed after telemetry showed +720 ms/call; replaced with structured_call thin helper (~190 LOC)Decision-Cost Rubric case study; commit b8d8da67
2026-05-22LLM-first agentic voice pipeline accepted: native OpenAI streaming-with-tools, single call per tool-decision iterationADR-0053-llm-first-agentic-voice; commit 15c596b5
2026-05-23Voice quality refit: Rule 4.5 (no repeated clarifications) + temperature 0.3 → 0.0 + 80-term STT phonetic-recovery sweep + Rule 6.5 (procedure explanations) + tier-1 session rate limit + voice ops infrastructure (trace/replay/SLO)Voice — Architecture; commits 7900b5e0, 2c514cc1, 0242202e, 3bda7f00
2026-05-09Value Framework affinity rerank productised at Stage 5b; app.category_mismatch_telemetry (migration 066) and app.diagnostic_feedback (migration 067) addedQuery Pipeline §Stage 5b
2026-05-04Voice tenant-overlay system (backend/app/services/voice/tenant_overlays/) shipped; per-tenant FAQ + STT recovery + answer renderersMulti-Tenancy
2026-04-22Nightly auto-ingest live on pilot (INGEST_MODE=auto, daily 03:00 UTC); IngestRun audit records (migration 061)Document Ingestion §Scheduled Ingestion