Skip to main content

Monitoring

Access Grafana dashboards, Prometheus metrics, and application logs to monitor system health and performance.

Checklist

  • Access Grafana at http://YOUR_SERVER:3000
  • Verify dashboards are loading
  • Understand the health endpoint
  • Know how to view application logs
  • Set up alert rules (optional)

Grafana Dashboards

Open http://YOUR_SERVER_IP:3000 in a browser.

Login credentials:

  • Username: admin
  • Password: the GRAFANA_ADMIN_PASSWORD from .env.prod

Available Dashboards

Seven dashboards ship with the deployment under Dashboards → Browse. The System Overview is an executive index with clickable links into the five specialist dashboards; the SLO dashboard is the stakeholder view.

DashboardUIDPurpose
ZOL RAG - System Overviewzol-rag-system-overviewExecutive index — 8 stat panels (status, request rate, p95, error %, LLM spend today, ingest status, voice TTFT p95, refusal rate) with clickable links to the specialist dashboards
ZOL RAG - Pipeline Overviewzol-rag-pipeline-overviewRAG behavioural metrics — intent distribution, stage-latency breakdown, safety refusal rate, graph injections
ZOL RAG - Infrastructure Healthzol-rag-infrastructure-healthHTTP plumbing, process resources, vector search latency, Python GC
ZOL RAG - LLM & Cost Trackingzol-rag-llm-cost-trackingAuthoritative Postgres-backed daily / weekly / monthly cost panels, plus Prometheus since-restart token + cost counters
ZOL RAG - Voice Channelzol-rag-voice-channelVoice TTFT p50 / p95 / p99, safety escalations by reason, LLM-judge per-dimension scores, speculative-STT hit rate + latency saved
ZOL RAG - Ingest Pipelinezol-rag-ingest-pipeline100% Postgres-backed — latest-run status + history, crawl corpus state, failure-class distribution, failed-URL table
ZOL RAG - Safety & Compliancezol-rag-safety-complianceStakeholder view tied to the ZERO medical-advice incidents SLO — refusals + voice escalations, refusal-rate %, citation-attached %, CRAG decisions
ZOL RAG - SLO Statuszol-rag-slo-dashboardSix headline SLO stats (availability, 5xx rate, RAG p95, voice TTFT p95, LLM error rate, medical-advice incidents) with red/yellow/green thresholds + error-budget panels

Dashboards are provisioned automatically from grafana/dashboards/.

Postgres-Backed Panels

Several panels on the LLM & Cost Tracking, Ingest Pipeline, and Safety & Compliance dashboards do not use Prometheus — they query the application database directly through the Grafana postgres datasource. This is intentional: Prometheus counters reset on container restart (you lose yesterday's cumulative spend), while Postgres tables like app.analytics_events, app.ingest_runs, and app.crawled_urls are restart-safe and authoritative for cost reporting, ingest history, and audit numbers.

The Postgres datasource is provisioned from grafana/datasources/prometheus.yml (yes, the filename is prometheus.yml but it declares both datasources). Key gotcha: the database: zol_rag key MUST live under jsonData:, not at the top level of the datasource block — see operations/telemetry-and-runbooks.md for the debug story.

Alerting

Six Prometheus alert rules ship in grafana/provisioning/alerting/zol-rag-alerts.yml under group zol-rag-core:

RuleSeverityCondition
BackendDowncriticalup == 0 for 1m
HighErrorRatecritical5xx ratio > 1% for 5m
LLMCostBurnRatewarningburn > $5/hr for 10m
SafetyRefusalSpikewarning5x the 1h baseline
VoiceTTFTHighwarningvoice TTFT p95 > 2000ms for 10m
LLMCircuitOpencriticalLLM error rate > 20% for 5m (proxy — no dedicated circuit-state gauge exists yet)

Before pilot deploy: contact-points.yml and notification-policies.yml in the same directory ship as templates. Production ops MUST set the real email recipients and/or Slack webhook URL before any alert can route. The volume mounts in docker/docker-compose.infra.yml and docker/docker-compose.yml pick the files up on Grafana container restart.

Health Endpoints

The application exposes two health check endpoints:

Basic Health (/health)

curl -s http://localhost:80/health | python3 -m json.tool
{
"status": "healthy",
"version": "0.1.0",
"components": {
"database": "healthy",
"redis": "healthy",
"minio": "healthy"
}
}
StatusMeaning
healthyAll components operational
degradedSome components have issues but service is available
unhealthyCritical component failure

Docker health check runs curl -f http://localhost:80/health every 30 seconds.

Deep Readiness (/health/ready)

The deep health check includes LLM circuit breaker state, making it suitable for orchestrator readiness probes:

curl -s http://localhost:80/health/ready | python3 -m json.tool
{
"status": "healthy",
"version": "0.1.0",
"components": {
"database": "healthy",
"redis": "healthy",
"minio": "healthy",
"llm_circuit": "closed"
}
}
LLM Circuit StateMeaning
closedLLM API is reachable and functioning normally
openLLM API is unreachable; requests are failing over to fallback
half-openCircuit breaker is testing whether the LLM API has recovered

Use /health/ready for Kubernetes/Docker readiness probes to detect LLM outages.

Prometheus Metrics

The backend exposes metrics at /metrics (Prometheus text format). Headline metrics:

MetricTypeDescription
zol_rag_requests_totalCounterTotal HTTP requests by method, path, status
zol_rag_request_latency_secondsHistogramRequest latency distribution
zol_rag_query_latency_secondsHistogramRAG query processing time, segmented by channel (web / voice_sip / voice_browser)
zol_rag_queries_totalCounterTotal RAG queries by intent and channel
zol_rag_llm_requests_totalCounterLLM API calls by model, status
zol_rag_llm_cost_usd_totalCounterCumulative LLM spend per model (Prometheus side; reset on restart — see Postgres-backed panels above)
zol_rag_safety_refusals_totalCounterSafety-blocked answer rate by reason
rag_query_ttft_msHistogramVoice time-to-first-token (the headline voice SLO)
rag_voice_safety_escalations_totalCounterVoice-channel safety escalations by reason

Voice-channel metrics use the rag_* prefix; application metrics use the zol_rag_* prefix. Most counter and histogram metrics carry a channel label so dashboards can split web traffic from voice traffic.

For the full metric catalog including labels and semantics, see operations/telemetry-and-runbooks.md. Prometheus scrapes the backend every 15 seconds.

Structured Logging

The backend uses structlog for structured logging. Log format is environment-aware:

EnvironmentFormatExample
DevelopmentColored console with key-value pairs[info] query processed intent=doctor_lookup latency=1.2s
ProductionJSON lines (one object per log entry){"event":"query processed","intent":"doctor_lookup","latency":1.2}

JSON log output in production is compatible with standard log aggregation tools (ELK, Loki, CloudWatch).

Viewing Logs

# Application logs (backend + nginx)
docker logs zol-app --tail 100 -f

# All infrastructure logs
docker compose -f docker/docker-compose.infra.yml logs --tail 50

# Specific service logs
docker logs zol-postgres --tail 50
docker logs zol-keycloak --tail 50
docker logs zol-redis --tail 50

# Export logs for analysis
docker logs zol-app --since 2h > /tmp/app-logs.txt 2>&1

Log Rotation

All containers use Docker's json-file log driver with rotation:

logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "5"

Maximum log storage per container: 50 MB.

Additional infrastructure alerts (not yet provisioned)

The Alerting section above describes the six application-level rules that ship in grafana/provisioning/alerting/zol-rag-alerts.yml. The following infrastructure-level alerts are recommended but not yet provisioned — add them to the YAML in a follow-up pass when ops capacity allows.

AlertConditionSeverityWhy it's worth adding
DiskSpaceLowAvailable disk on pilot < 10%WarningPostgreSQL + MinIO + Prometheus all fail noisily when the host volume fills
PostgresConnExhaustedActive connections > 80% of max_connectionsWarningpgvector is tolerant but the rest of the app deadlocks under exhaustion
EmbeddingLatencyHighhistogram_quantile(0.95, rate(zol_rag_embedding_latency_seconds_bucket[5m])) > 2WarningEmbeddings now run against the OpenAI API (Ollama retired April 2026, ADR-0048); a sustained p95 spike signals OpenAI slowness/errors-with-retries and lags ingest + voice retrieval. (zol_rag_embedding_latency_seconds is the only exported embedding metric — there is no error counter.)
RedisOOMMemory usage > 90% of maxmemoryWarningCache flapping causes cascade LLM cost spike

Container Health Overview

Quick check of all containers:

docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

Next: Updates & Releases →