Deployment Overview

The ZOL RAG system deploys as two units on a single Linux server: infrastructure (6 containerized services) and application (one image rebuilt per release). Embedding inference is performed against the OpenAI hosted API (text-embedding-3-large, 1536 dimensions) — there is no on-premise embedding container in production.

Checklist

Read this page to understand the architecture
Verify you have the prerequisites
Follow the guides in order: Server Setup → Infrastructure → Application → Data Seeding

Architecture

Internet
  │
  ▼
┌──────────────────────────────────┐
│  nginx :80/:443                  │  ← Static files (React frontend)
│  uvicorn :8000 (x4 workers)     │  ← FastAPI backend + reranker model
│  supervisord                     │  ← Process manager
└──────────┬───────────────────────┘
           │  zol-network (Docker bridge)
  ┌────────┼────────────────────────────────────┐
  │        ▼                                    │
  │  ┌──────────┐ ┌───────┐ ┌────────────────┐  │
  │  │PostgreSQL│ │ Redis │ │   Keycloak     │  │
  │  │ :5432    │ │ :6379 │ │   :8080        │  │
  │  │ pgvector │ │ cache │ │   OIDC IdP     │  │
  │  └──────────┘ └───────┘ └────────────────┘  │
  │  ┌──────────┐ ┌────────────┐ ┌─────────┐   │
  │  │  MinIO   │ │ Prometheus │ │ Grafana │   │
  │  │  :9000   │ │  :9090     │ │  :3000  │   │
  │  │ S3 docs  │ └────────────┘ └─────────┘   │
  │  └──────────┘                               │
  └─────────────────────────────────────────────┘
           │                        │
           ▼                        ▼
  OpenAI / OpenRouter API      OpenAI Embeddings API
       (LLM generation)         text-embedding-3-large
                                (1536 dim, ADR-0048)

Only ports 80 and 443 are exposed to the internet. All infrastructure ports (including Keycloak at 8080 and Grafana at 3000) are bound to 127.0.0.1 and accessible only via SSH tunnel.

Three-File Compose Strategy

The deployment uses a layered Docker Compose pattern:

File	What's Inside	When It Changes
`docker/docker-compose.infra.yml`	PostgreSQL, Redis, MinIO, Keycloak, Prometheus, Grafana	Rarely (version bumps)
`docker/docker-compose.app.yml`	FastAPI + React + nginx + reranker model	Every code release
`docker/docker-compose.ssl.yml`	Nginx SSL overlay (port 443, certificates)	Certificate renewal

During a normal release, only the application image is rebuilt and restarted. Infrastructure services persist data on Docker volumes.

Deploy Command

cd /opt/zol-rag && docker compose --env-file .env.prod \
  -f docker/docker-compose.infra.yml \
  -f docker/docker-compose.app.yml \
  -f docker/docker-compose.ssl.yml up -d

Prerequisites

Hardware

Resource	Minimum	Recommended	Why
CPU	4 cores	8 cores	Embedding inference + uvicorn workers
RAM	16 GB	32 GB	PostgreSQL + Keycloak + app are memory-hungry
Disk	100 GB SSD	250 GB NVMe	pgvector indexes, MinIO docs
Network	100 Mbps	1 Gbps	LLM API latency is the bottleneck
GPU	Not required	NVIDIA T4	Speeds up embedding from 300ms to 50ms
OS	Ubuntu 22.04 / Debian 12	Ubuntu 24.04	Docker support required

What You Need Before Starting

SSH access to the server (root or sudo)
A domain name pointing to the server IP (e.g., search.zol.be)
OpenRouter API key (from https://openrouter.ai)
The git repository URL
SNOMED CT Belgian Edition RF2 package (for terminology features)

External Dependencies

Service	Purpose	Cost Estimate
OpenAI APIs (direct)	Generation (gpt-4.1-mini / gpt-4.1) + embeddings (`text-embedding-3-large`)	~$0.01–0.05 per query (LLM dominant); embeddings ≈ $0.16/year at 25 K queries/mo (negligible)
OpenRouter (deprecated, optional override)	Legacy LLM fallback path; flag retained per ADR-0048	Pay-as-you-go
Jina API (optional)	Reranker fallback (local reranker is default)	Free tier available

Budget approximately $50-100/month for pilot traffic (~25,000 queries/month).

Memory Budget

Service	Reserved	Max Limit	Notes
PostgreSQL	2 GB	4 GB	Embeddings, vector search, taxonomy, Keycloak DB
Keycloak	512 MB	1 GB	OIDC identity provider, realm management
Redis	512 MB	1 GB	Cache (bounded by LRU policy)
MinIO	256 MB	512 MB	Document storage
App (nginx + uvicorn)	3 GB	6 GB	4 workers + reranker model (~400 MB)
Prometheus + Grafana	384 MB	768 MB	Metrics (30-day retention)
OS + Docker overhead	2 GB	4 GB	Kernel, daemon, logging
Total	~9 GB	~18 GB	16 GB min, 32 GB comfortable

Deployment Order

Follow these guides in sequence:

Server Setup — Docker, clone, secrets, firewall
Infrastructure — Start 6 services, verify health
Application — Build image, migrations, start app
Data Seeding — SNOMED, crawl, taxonomy extraction
SSL & DNS — TLS certificates, domain config
User Management — Keycloak users, roles
Monitoring — Grafana dashboards, health checks

For ongoing operations:

Updates — Code releases, rollback
Troubleshooting — Common issues, debug commands
Scripts Reference — All deployment scripts

Architectural Evolution

The deployment architecture has evolved significantly since the initial design. Neo4j was removed in March 2026 after all entity relationships were migrated to PostgreSQL taxonomy tables (taxonomy_entities and taxonomy_relationships) backed by @pgvector_docs, reducing the service count and simplifying operations — see ADR-0053 (master record). Keycloak was added as the OIDC identity provider (@openid_connect_core_1_0, @rfc6749_oauth2, @rfc7519_jwt), replacing the legacy cookie-based authentication to meet @gdpr_regulation Article 25 and @ai_act_regulation audit-trail requirements. The on-premise Ollama embedding container was retired entirely in April 2026 in favour of OpenAI's hosted text-embedding-3-large API (@openai2024embeddings, ADR-0048) — embedding latency dropped from 1.7–5.8 s (cold-start + serialization tax) to 150–211 ms per call, freeing ≈1.4 GB of RAM and removing two compose services. The single docker-compose.yml was refactored into a three-file overlay pattern for cleaner separation between infrastructure, application, and SSL concerns. Multi-tenant separation across the deployment follows the SaaS patterns of @bezemer2010multitenant; operational SLO and tail-latency reporting follow @beyer2016sre; compliance baselines are anchored in @iso27001_2022 and @iso27018_2019.

Checklist​

Architecture​

Three-File Compose Strategy​

Deploy Command​

Prerequisites​

Hardware​

What You Need Before Starting​

External Dependencies​

Memory Budget​

Deployment Order​

Architectural Evolution​