Telemetry & Operational Runbooks
This page is the single reference for all telemetry emitted by the ZOL RAG application plus the operational runbooks needed to inspect it and act on what it surfaces. It complements deployment/monitoring.md — that page covers infrastructure-level health (containers, DB, Redis); this page covers application-level telemetry (RAG pipeline metrics, structured logs, evaluation traces, and how to operate the Grafana/Prometheus stack).
If you are reviewing the pilot for the first time, read this in order:
- Where telemetry lives (next section) — orient yourself to the four telemetry surfaces.
- Operating Grafana — get a dashboard open in 5 minutes.
- Prometheus metrics catalog — what every metric means and when it matters.
- Structured-log event catalog — grep-able tags you can use without Grafana.
- Runbooks — recipes for common operational tasks.
Where telemetry lives — the four surfaces
| Surface | What it carries | How to access |
|---|---|---|
Prometheus metrics (/metrics) | Counters, gauges, histograms — quantitative time-series | Grafana dashboards, or curl /metrics |
| Structured logs (stdout JSON in prod, key=value in dev) | Per-event records with structlog context: intent, latency, chunks=N, [CLUSTER2] tags | docker logs zol-app, or log aggregator (ELK / Loki / CloudWatch) |
Evaluation traces (s["timing"] dict on the chat orchestrator) | Per-request introspection: which intent, how many chunks, did the LLM bail to no-info, did the disclaimer fire | Captured in DB tables pipeline_telemetry + golden-eval runs |
| Audit DB tables | pipeline_telemetry, feedback_events, session_feedback, ingest_runs | PostgreSQL — see "Querying audit tables" below |
Each surface answers a different question:
- Metrics → "is the system healthy now, and how does it trend?"
- Logs → "what happened on that specific request?"
- Eval traces → "did the pipeline make the right decisions?"
- Audit tables → "what did users tell us afterwards?"
Operating Grafana — five-minute orientation
Accessing dashboards
Grafana on the pilot is bound to 127.0.0.1:3000 (loopback only — never exposed publicly), so any laptop session needs an SSH tunnel + the rotating admin password. The repo ships a helper that does both in one command and auto-tears-down on Ctrl+C.
# From your laptop — opens Grafana in your browser with credentials printed
./scripts/observability.sh
The script:
- Verifies SSH key reachability (and tells you what
ssh-addcommand to run if it can't connect non-interactively — see the pilot SSH note) - Checks the local port is free (override with
GRAFANA_LOCAL_PORT=3001) - Auto-fetches the current
GRAFANA_ADMIN_PASSWORDfrom/opt/zol-rag/.env.prod - Opens an SSH tunnel
localhost:3000 → pilot 127.0.0.1:3000 - Launches the default browser at
http://localhost:3000 - Prints username + password to the terminal so you can paste them in
- Cleans up the tunnel on
Ctrl+C
Manual fallback
If you'd rather drive the tunnel yourself (e.g., on Windows, or when the script's auto-browser misbehaves):
ssh -L 3000:localhost:3000 deploy@88.99.184.57
# Then open http://localhost:3000 in your browser
ssh deploy@88.99.184.57 'grep GRAFANA_ADMIN_PASSWORD /opt/zol-rag/.env.prod'
Provisioned dashboards
Eight dashboards ship with the deployment under Dashboards → Browse. The System Overview is an executive index that links into the specialist dashboards; the SLO Status dashboard is the stakeholder view tied to error budgets.
| Dashboard | Top panels you'll likely care about |
|---|---|
| ZOL RAG - System Overview | 8 stat panels (backend up, request rate, p95 latency, error %, LLM spend today, ingest status, voice TTFT p95, refusal rate) with clickable links to the specialist dashboards below |
| ZOL RAG - Pipeline Overview | Intent distribution, stage-latency breakdown (intent / retrieval / rerank / llm / safety), safety-refusal rate, graph injections, no-info bail rate |
| ZOL RAG - Infrastructure Health | HTTP request rate + 5xx by route, process CPU + RSS, vector search latency, Python GC pause time |
| ZOL RAG - LLM & Cost Tracking | Authoritative Postgres-backed daily / weekly / monthly USD by model, plus Prometheus since-restart tokens-in / tokens-out / cost counters |
| ZOL RAG - Voice Channel | Voice TTFT p50 / p95 / p99, safety escalations by reason, LLM-judge per-dimension scores (faithfulness / relevance / safety / fluency), speculative-STT hit rate + latency saved |
| ZOL RAG - Safety & Compliance | Refusals + voice escalations over time, refusal-rate %, citation-attached % (from Postgres), CRAG decisions, channel split |
| ZOL RAG - Ingest Pipeline | Latest-run status + duration, run-history table, crawl corpus state, failure-class distribution, failed-URL table (all Postgres-backed — no Prometheus side) |
| ZOL RAG - SLO Status | Six headline SLO stats (availability, 5xx rate, RAG p95, voice TTFT p95, LLM error rate, medical-advice incidents) with red/yellow/green thresholds; error-budget panels for 5xx and LLM |
Dashboards are version-controlled in grafana/dashboards/. To edit: change them in Grafana, then export the JSON and check it into the repo. Do NOT edit the JSON files directly — Grafana's export format has many computed fields and hand-editing produces malformed dashboards that fail provisioning.
Postgres-backed panels — the dual-backend pattern
Several panels on the LLM & Cost Tracking, Ingest Pipeline, and Safety & Compliance dashboards query the application database directly through the Grafana postgres datasource instead of Prometheus. This is intentional:
- Prometheus side — counters and histograms scraped every 15s; great for rates, p95s, and "what's happening right now." Resets on container restart, so you lose yesterday's cumulative spend.
- Postgres side —
app.analytics_events,app.ingest_runs,app.crawled_urls; restart-safe and authoritative. Used wherever the panel needs historical accuracy (daily / weekly / monthly cost, ingest run-history, citation-attached %) rather than fresh trend data.
The same dashboard often has both: the cost dashboard shows "since-restart" Prometheus counters next to "month-to-date" Postgres-backed cost so you get both the live signal and the auditable number.
When Grafana looks broken
The most common failure modes and their fixes:
| Symptom | Probable cause | Fix |
|---|---|---|
| "No data" on all panels | Prometheus can't reach the backend /metrics endpoint | docker exec zol-prometheus wget -O- http://zol-app:80/metrics | head — if this fails, check the backend is up and the zol-app container alias resolves on the Docker network |
| Some panels green, some empty | Metric label drift after a code change (e.g., we renamed an intent value but the dashboard hard-codes the old name) | Click the panel → Inspect → Query Inspector. The PromQL {intent="..."} filter shows the value the panel expects; cross-reference with current UserIntent enum |
| Dashboard provisioning fails on container start | Malformed JSON in grafana/dashboards/*.json | docker logs zol-grafana | grep -i error — find the offending file and re-export from a working Grafana instance |
| Grafana login redirects to a 502 page | Grafana container OOM-killed (rare; uses ~80MB) | docker compose -f docker-compose.app.yml restart grafana |
Postgres-backed panel shows "No data" but curl of the datasource API returns rows | database: zol_rag key in the wrong place in the datasource yaml | See callout below |
Datasource provisioning pitfall — database: under jsonData:
The Grafana Postgres datasource provisioning yaml at grafana/datasources/prometheus.yml requires database: zol_rag to live under jsonData:, NOT at the top level of the datasource block. When it sits at the top level, the /api/ds/query proxy tolerates it (so a curl smoke test against the datasource returns data and looks fine), but the panel-render path silently fails — the panels show "No data" with no server-side error log.
The actual error only surfaces in the browser console: "You do not currently have a default database configured for this data source. Postgres requires a default database with which to connect." This burned six wrong-direction debug iterations chasing "No data" on cost panels.
Rule for next time: when a Grafana panel shows "No data" while a manual curl against the same datasource API returns data, the bug is almost certainly in the datasource yaml — and the diagnostic is in the browser console, not Grafana server logs. Open DevTools first.
The correct shape:
datasources:
- name: postgres
type: postgres
url: zol-postgres:5432
user: zolrag
secureJsonData:
password: ${POSTGRES_PASSWORD}
jsonData:
database: zol_rag # MUST live here, not at the top level
sslmode: disable
postgresVersion: 1600
Prometheus directly
Prometheus on the pilot is docker-network-only — it has no host port binding (docker inspect zol-prometheus shows "9090/tcp":null), so ssh -L 9090:localhost:9090 lands on a closed port. There are three correct ways to query it:
1. From your laptop via the observability helper (recommended)
./scripts/observability.sh --prom-query 'rate(zol_rag_queries_total[5m])'
./scripts/observability.sh --prom-query 'histogram_quantile(0.95, rate(rag_query_duration_seconds_bucket[5m]))'
This SSH-execs docker exec zol-prometheus wget … against the pilot's Prometheus and pretty-prints the JSON with jq if available.
2. From the pilot host directly
ssh deploy@88.99.184.57
docker exec zol-prometheus wget -qO- 'http://localhost:9090/api/v1/query?query=zol_rag_queries_total'
docker exec zol-prometheus wget -qO- 'http://localhost:9090/api/v1/query?query=rate(zol_rag_queries_total[5m])'
3. Through Grafana → Explore (best for iterative exploration)
Once you've launched ./scripts/observability.sh, Grafana's Explore tab (compass icon in the left rail) gives you a full PromQL query editor with autocomplete and time-range controls, backed by the same Prometheus datasource the dashboards use. This is usually the fastest path for ad-hoc work.
Alerting
Grafana ships with six Prometheus alert rules provisioned from grafana/provisioning/alerting/:
| File | Purpose |
|---|---|
zol-rag-alerts.yml | The six rules below, grouped as zol-rag-core |
contact-points.yml | Email + Slack webhook destinations — template only, ops must edit before pilot deploy |
notification-policies.yml | Routing tree (which severities go where) — template only, ops must edit before pilot deploy |
The six rules
| Rule | Severity | Condition | Why it exists |
|---|---|---|---|
BackendDown | critical | up == 0 for 1m | Prometheus can't scrape the backend at all — the system is down |
HighErrorRate | critical | 5xx ratio > 1% for 5m | Sustained server errors above the SLO error budget |
LLMCostBurnRate | warning | spend > $5/hr for 10m | Catches runaway model usage (loops, prompt regressions, abuse) before the day's bill blows up |
SafetyRefusalSpike | warning | refusal rate 5x the 1h baseline | Sudden refusal spike usually means a prompt regression made answers fail the safety check, not that traffic suddenly went malicious |
VoiceTTFTHigh | warning | voice TTFT p95 > 2000ms for 10m | Voice UX falls apart above 2s TTFT |
LLMCircuitOpen | critical | LLM error rate > 20% for 5m | Proxy alert — see caveat below |
LLMCircuitOpen proxy caveat
The application's LLM circuit breaker has internal closed / open / half-open state but does not currently expose a dedicated llm_circuit_state gauge. Until that gauge exists, LLMCircuitOpen is a proxy — it fires on sustained high LLM error rate, which is what would cause the breaker to open in the first place. False positives are possible (e.g., a model returning malformed JSON for a small subset of queries can spike the error rate without actually tripping the breaker). When it fires, confirm against /health/ready (which does expose the live circuit state) before treating it as a confirmed circuit-open event.
What ops MUST change before deploy
contact-points.yml ships with placeholder email addresses and a placeholder Slack webhook. Until these are replaced with the production destinations, no alert can actually route — Grafana evaluates them, logs the firing state, and silently drops the notification. The pre-deploy checklist:
- Edit
contact-points.yml— set the on-call email distribution list and the Slack webhook URL (or remove the Slack receiver if you're email-only). - Edit
notification-policies.yml— confirm the severity-to-receiver routing matches your team's preferences (e.g., critical → both email + Slack with 0s group_wait; warning → Slack only with 5m group_wait). docker compose -f docker/docker-compose.infra.yml -f docker/docker-compose.app.yml restart grafana— the volume mounts in both compose files pick the updated YAML up on restart.- Trigger a test alert via Grafana UI (Alerting → Contact points → Test) to confirm the destination actually receives it before pilot traffic depends on it.
Prometheus metrics catalog
All metrics are emitted from backend/app/api/metrics.py and surfaced via the /metrics endpoint. Recording helpers live in backend/app/infrastructure/metrics.py (the indirection avoids a circular import).
HTTP layer
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
zol_rag_requests_total | Counter | method, endpoint, status_code | Throughput and error rate per route |
zol_rag_request_latency_seconds | Histogram | method, endpoint | p50 / p95 / p99 latency per route |
zol_rag_websocket_connections_active | Gauge | (none) | Currently-connected WS clients |
RAG pipeline
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
zol_rag_queries_total | Counter | status (success/error), intent | Query volume by intent class — useful for tracking whether institutional_treatment_info and doctor_schedule_query are firing as expected after Cluster 1/3 |
zol_rag_query_latency_seconds | Histogram | intent | End-to-end RAG latency per intent class |
zol_rag_pipeline_stage_seconds | Histogram | stage (intent / retrieval / rerank / llm / safety) | Stage-level breakdown — find the slowest step |
zol_rag_safety_refusals_total | Counter | reason (regex / llm_judge / guardrail / etc.) | Safety-blocked answer rate — track per-reason to spot prompt regressions |
zol_rag_cache_hits_total / zol_rag_cache_misses_total | Counter | cache_type (intent / semantic / faq) | Cache effectiveness — sub-50% hit rate is a deploy-day surprise |
LLM cost & reliability
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
zol_rag_llm_tokens_total | Counter | model, direction (input/output) | Token consumption per model |
zol_rag_llm_cost_usd_total | Counter | model | Cumulative USD spend per model |
zol_rag_llm_requests_total | Counter | model, status | Model usage + error rate |
zol_rag_llm_fallbacks_total | Counter | from_model, to_model, reason | Circuit-breaker activations — every spike here is an investigation |
Vector / graph / reranker
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
zol_rag_embedding_latency_seconds | Histogram | provider | Embedding-API call time |
zol_rag_vector_search_latency_seconds | Histogram | (none) | pgvector search time |
zol_rag_reranker_latency_seconds | Histogram | reranker | Reranker model call time |
zol_rag_graph_injections_total | Counter | strategy | How often graph-augmented retrieval was used |
zol_rag_crag_decisions_total | Counter | decision (correct / incorrect / ambiguous) | Corrective RAG outcomes |
Evaluation & corpus health
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
evaluation_scores | Histogram | metric_type (faithfulness / relevancy / context_precision / context_recall) | DeepEval LLM-judge scores from golden runs |
document_count | Gauge | tenant_id | Live corpus size per tenant |
Structured-log event catalog
Production logs are JSON; development uses colored key=value. Both have the same field set — query the same way regardless of environment.
Application tags emitted by the chat orchestrator
These are grep-able tags surfaced specifically for the Cluster 1–3 fixes shipped 2026-05-13. They are the primary way to audit whether each fix is actually firing on prod traffic.
| Tag | Where emitted | What it means |
|---|---|---|
[INTENT] Institutional treatment-info pattern detected | intent_classification_service.py:detect_institutional_treatment_query | Cluster 1 fix fired — Q37/Q38/Q42-class query routed to institutional intent |
[INTENT] Prepended medical-content disclaimer | safety_mixin.py:_qs_apply_intent_disclaimer | Disclaimer prepended on an institutional answer (Cluster 1) |
[CLUSTER2] no_info_with_chunks chunks=<N> | safety_mixin.py:_qs_log_no_info_with_chunks_warning | Regression — LLM bailed to no-info despite N retrieved chunks. WARNING-level. Triage by reading the chunk contents from the matching request id |
[doctor_schedule_tool] no completed doc found | doctor_schedule_tool.py:lookup_doctor_schedule | Cluster 3 tool fired but the doctor wasn't found — fallback to LLM path |
[SAFETY] Medical advice pattern detected | intent_classification_service.py:detect_medical_advice_query | Q41-class personal-symptom triage caught by the pre-LLM regex |
Searching production logs
# All Cluster 2 telemetry warnings from the last 24h
ssh deploy@88.99.184.57 'docker logs --since 24h zol-app 2>&1 | grep CLUSTER2'
# Every institutional-info query routed today
ssh deploy@88.99.184.57 'docker logs --since 24h zol-app 2>&1 | grep "Institutional treatment-info pattern detected"'
# Every safety refusal today, grouped by reason
ssh deploy@88.99.184.57 'docker logs --since 24h zol-app 2>&1 | grep "\[SAFETY\]" | awk -F"\\(" "{print \$2}" | sort | uniq -c'
Eval-trace fields (in s["timing"])
The chat orchestrator records per-request introspection into a dict that gets persisted to app.pipeline_telemetry. These are the fields the golden-eval harness reads.
| Field | Type | Notes |
|---|---|---|
intent | string | The classified intent (member of UserIntent enum) |
intent_classification_ms | float | Time spent in the intent classifier |
retrieval_ms | float | Time spent retrieving chunks (vector + graph) |
llm_ms | float | Time spent in the response-generation LLM call |
safety_ms | float | Time spent in regex + LLM-judge safety validation |
safety_regex_violations | int | Number of regex safety violations on the response |
safety_llm_violations | int | Number of LLM-judge violations |
retrieval_chunks_returned | int | Cluster 2 — chunk count before LLM call |
answer_says_no_info | bool | Cluster 2 — detector flagged the final answer as a Class C no-info template |
When the eval harness sees retrieval_chunks_returned > 0 AND answer_says_no_info == true, that's the Q5-class regression — the prompt-side fix should have prevented it, the telemetry catches the case where it didn't.
Runbooks
Runbook 1 — Backfill consultation_schedule on pilot (Cluster 3)
Background. ADR-0058 Layer C added an automatic schedule extractor at ingest time, but doctor profiles ingested before the extractor shipped have a NULL metadata->>'consultation_schedule'. The query_doctor_schedule tool (introduced in Cluster 3) falls back to the LLM-reads-markdown path when this column is NULL. Running the backfill populates the column for all existing doctor profiles and unlocks the tool's structured-JSON answer path for Q25/Q27.
Script. backend/scripts/backfill_consultation_schedule.py (idempotent — safe to run repeatedly).
Steps (on pilot host).
ssh deploy@88.99.184.57
cd /opt/zol-rag
# Sanity check before: how many doctor profiles will the script touch?
docker exec zol-postgres psql -U zolrag -d zol_rag -c "
SELECT COUNT(*)
FROM app.documents
WHERE status='completed'
AND (metadata->>'consultation_schedule' IS NULL
OR metadata->>'consultation_schedule' = 'null')
AND EXISTS (
SELECT 1 FROM app.document_chunks c
WHERE c.document_id = documents.id
AND c.content LIKE '%| MA | Di | WO%'
);
"
# Run the backfill (idempotent — re-runnable if it fails midway)
docker exec zol-app python -m scripts.backfill_consultation_schedule
# Sanity check after: how many doctor profiles now have the JSON?
docker exec zol-postgres psql -U zolrag -d zol_rag -c "
SELECT COUNT(*)
FROM app.documents
WHERE status='completed'
AND metadata->>'consultation_schedule' IS NOT NULL
AND metadata->>'consultation_schedule' != 'null';
"
Expected outcome. Before-count should drop to zero (or near-zero — any non-zero remainder is profiles that don't have the schedule table; these are normal).
Verification. Q25 ("Is er raadpleging voor Dr. Matthias Dupont op woensdag?") should now answer "Ja, Dr. Dupont houdt 2-wekelijks raadpleging op woensdagvoormiddag" — exercise it on the pilot chat UI.
Runbook 2 — Find a specific request in the logs
When a user reports a bad answer, you need the structured-log record for that specific request. The chat path emits the request ID as request_id field on every log line.
# 1. User reports "I asked about laadpalen at 14:32 and got no info"
# 2. Find the request id
ssh deploy@88.99.184.57 'docker logs --since 1h zol-app 2>&1 | grep -i laadpaal | head'
# 3. Replay the full request lifecycle by request_id
REQ_ID=abc123-def456
ssh deploy@88.99.184.57 "docker logs zol-app 2>&1 | grep $REQ_ID"
# 4. Inspect the persisted telemetry row
ssh deploy@88.99.184.57 'docker exec zol-postgres psql -U zolrag -d zol_rag -c "
SELECT intent, retrieval_chunks_returned, answer_says_no_info,
safety_regex_violations, total_ms
FROM app.pipeline_telemetry WHERE request_id = '"'"'abc123-def456'"'"';"'
Runbook 3 — Querying audit DB tables directly
For deeper analysis than logs can give, query the audit tables.
-- Top 10 intents by volume in the last 24h
SELECT intent, COUNT(*) AS n
FROM app.pipeline_telemetry
WHERE created_at > now() - interval '24 hours'
GROUP BY intent
ORDER BY n DESC LIMIT 10;
-- Cluster 2 regression check — how often are we bailing to no-info
-- despite having chunks?
SELECT
COUNT(*) FILTER (WHERE retrieval_chunks_returned > 0 AND answer_says_no_info)
AS regression_count,
COUNT(*) AS total_queries,
(COUNT(*) FILTER (WHERE retrieval_chunks_returned > 0 AND answer_says_no_info))::float
/ NULLIF(COUNT(*), 0) AS regression_rate
FROM app.pipeline_telemetry
WHERE created_at > now() - interval '7 days';
-- Cluster 1 firing rate — what fraction of queries hit the new intent?
SELECT intent, COUNT(*) AS n
FROM app.pipeline_telemetry
WHERE created_at > now() - interval '7 days'
AND intent IN ('institutional_treatment_info', 'doctor_schedule_query',
'out_of_scope_medical_advice')
GROUP BY intent;
-- Negative-feedback events with full request context
SELECT f.id, f.created_at, f.rating, f.user_comment,
pt.intent, pt.retrieval_chunks_returned, pt.answer_says_no_info
FROM app.session_feedback f
LEFT JOIN app.pipeline_telemetry pt ON pt.request_id = f.request_id
WHERE f.rating IN ('negative', 'disputed')
AND f.created_at > now() - interval '7 days'
ORDER BY f.created_at DESC LIMIT 50;
Runbook 4 — Add a new metric
You want to track a new value on prod traffic. The flow:
-
Define the Prometheus metric in
backend/app/api/metrics.py:MY_NEW_METRIC = Counter("my_new_metric_total","Description of what this counts",["label1", "label2"],) -
Add a recording helper in
backend/app/infrastructure/metrics.py:def record_my_new_event(label1: str, label2: str) -> None:_m().MY_NEW_METRIC.labels(label1=label1, label2=label2).inc() -
Call from your code via
safe_record:from app.infrastructure.metrics import safe_record, record_my_new_eventsafe_record(record_my_new_event, "value1", "value2") -
Deploy. The metric appears at
/metricsimmediately. Add a panel to the relevant Grafana dashboard. Check it into git via the dashboard export → JSON pipeline.
Runbook 5 — Re-run the 50-Q MedChat vs ZOL benchmark
After a deploy, validate the comparison-report-v5 cluster fixes still hold:
ssh deploy@88.99.184.57
cd /opt/zol-rag
# Run the benchmark against pilot via the new --pilot-golden flag
docker exec zol-app python -m tests.evaluation.run_evaluation \
--pilot-golden \
--base-url https://test.medchat.health \
--output /tmp/benchmark-$(date +%Y%m%d-%H%M).json
# Inspect the summary
docker exec zol-app cat /tmp/benchmark-*.json | python -m json.tool | head -40
The harness loads questions from the DB-backed app.golden_questions table (302 seed rows + any /add-to-golden-derived feedback rows). Compare wins / avg / per-question scores against the prior run to confirm the cluster fixes landed without regression.
Reading-list pointers
backend/app/api/metrics.py— Prometheus metric object definitionsbackend/app/infrastructure/metrics.py— recording helpers +safe_recordbackend/app/services/rag/safety_mixin.py— Cluster 1+2 telemetry hooksbackend/app/services/doctor_schedule_tool.py— Cluster 3 tool + intent integrationgrafana/dashboards/— versioned dashboard JSONgrafana/datasources/— Prometheus + Loki provisioningdeployment/monitoring.md— infrastructure-level monitoring (this page's companion)