Skip to main content

Telemetry & Operational Runbooks

This page is the single reference for all telemetry emitted by the ZOL RAG application plus the operational runbooks needed to inspect it and act on what it surfaces. It complements deployment/monitoring.md — that page covers infrastructure-level health (containers, DB, Redis); this page covers application-level telemetry (RAG pipeline metrics, structured logs, evaluation traces, and how to operate the Grafana/Prometheus stack).

If you are reviewing the pilot for the first time, read this in order:

  1. Where telemetry lives (next section) — orient yourself to the four telemetry surfaces.
  2. Operating Grafana — get a dashboard open in 5 minutes.
  3. Prometheus metrics catalog — what every metric means and when it matters.
  4. Structured-log event catalog — grep-able tags you can use without Grafana.
  5. Runbooks — recipes for common operational tasks.

Where telemetry lives — the four surfaces

SurfaceWhat it carriesHow to access
Prometheus metrics (/metrics)Counters, gauges, histograms — quantitative time-seriesGrafana dashboards, or curl /metrics
Structured logs (stdout JSON in prod, key=value in dev)Per-event records with structlog context: intent, latency, chunks=N, [CLUSTER2] tagsdocker logs zol-app, or log aggregator (ELK / Loki / CloudWatch)
Evaluation traces (s["timing"] dict on the chat orchestrator)Per-request introspection: which intent, how many chunks, did the LLM bail to no-info, did the disclaimer fireCaptured in DB tables pipeline_telemetry + golden-eval runs
Audit DB tablespipeline_telemetry, feedback_events, session_feedback, ingest_runsPostgreSQL — see "Querying audit tables" below

Each surface answers a different question:

  • Metrics → "is the system healthy now, and how does it trend?"
  • Logs → "what happened on that specific request?"
  • Eval traces → "did the pipeline make the right decisions?"
  • Audit tables → "what did users tell us afterwards?"

Operating Grafana — five-minute orientation

Accessing dashboards

Grafana on the pilot is bound to 127.0.0.1:3000 (loopback only — never exposed publicly), so any laptop session needs an SSH tunnel + the rotating admin password. The repo ships a helper that does both in one command and auto-tears-down on Ctrl+C.

# From your laptop — opens Grafana in your browser with credentials printed
./scripts/observability.sh

The script:

  • Verifies SSH key reachability (and tells you what ssh-add command to run if it can't connect non-interactively — see the pilot SSH note)
  • Checks the local port is free (override with GRAFANA_LOCAL_PORT=3001)
  • Auto-fetches the current GRAFANA_ADMIN_PASSWORD from /opt/zol-rag/.env.prod
  • Opens an SSH tunnel localhost:3000 → pilot 127.0.0.1:3000
  • Launches the default browser at http://localhost:3000
  • Prints username + password to the terminal so you can paste them in
  • Cleans up the tunnel on Ctrl+C

Manual fallback

If you'd rather drive the tunnel yourself (e.g., on Windows, or when the script's auto-browser misbehaves):

ssh -L 3000:localhost:3000 deploy@88.99.184.57
# Then open http://localhost:3000 in your browser
ssh deploy@88.99.184.57 'grep GRAFANA_ADMIN_PASSWORD /opt/zol-rag/.env.prod'

Provisioned dashboards

Eight dashboards ship with the deployment under Dashboards → Browse. The System Overview is an executive index that links into the specialist dashboards; the SLO Status dashboard is the stakeholder view tied to error budgets.

DashboardTop panels you'll likely care about
ZOL RAG - System Overview8 stat panels (backend up, request rate, p95 latency, error %, LLM spend today, ingest status, voice TTFT p95, refusal rate) with clickable links to the specialist dashboards below
ZOL RAG - Pipeline OverviewIntent distribution, stage-latency breakdown (intent / retrieval / rerank / llm / safety), safety-refusal rate, graph injections, no-info bail rate
ZOL RAG - Infrastructure HealthHTTP request rate + 5xx by route, process CPU + RSS, vector search latency, Python GC pause time
ZOL RAG - LLM & Cost TrackingAuthoritative Postgres-backed daily / weekly / monthly USD by model, plus Prometheus since-restart tokens-in / tokens-out / cost counters
ZOL RAG - Voice ChannelVoice TTFT p50 / p95 / p99, safety escalations by reason, LLM-judge per-dimension scores (faithfulness / relevance / safety / fluency), speculative-STT hit rate + latency saved
ZOL RAG - Safety & ComplianceRefusals + voice escalations over time, refusal-rate %, citation-attached % (from Postgres), CRAG decisions, channel split
ZOL RAG - Ingest PipelineLatest-run status + duration, run-history table, crawl corpus state, failure-class distribution, failed-URL table (all Postgres-backed — no Prometheus side)
ZOL RAG - SLO StatusSix headline SLO stats (availability, 5xx rate, RAG p95, voice TTFT p95, LLM error rate, medical-advice incidents) with red/yellow/green thresholds; error-budget panels for 5xx and LLM

Dashboards are version-controlled in grafana/dashboards/. To edit: change them in Grafana, then export the JSON and check it into the repo. Do NOT edit the JSON files directly — Grafana's export format has many computed fields and hand-editing produces malformed dashboards that fail provisioning.

Postgres-backed panels — the dual-backend pattern

Several panels on the LLM & Cost Tracking, Ingest Pipeline, and Safety & Compliance dashboards query the application database directly through the Grafana postgres datasource instead of Prometheus. This is intentional:

  • Prometheus side — counters and histograms scraped every 15s; great for rates, p95s, and "what's happening right now." Resets on container restart, so you lose yesterday's cumulative spend.
  • Postgres sideapp.analytics_events, app.ingest_runs, app.crawled_urls; restart-safe and authoritative. Used wherever the panel needs historical accuracy (daily / weekly / monthly cost, ingest run-history, citation-attached %) rather than fresh trend data.

The same dashboard often has both: the cost dashboard shows "since-restart" Prometheus counters next to "month-to-date" Postgres-backed cost so you get both the live signal and the auditable number.

When Grafana looks broken

The most common failure modes and their fixes:

SymptomProbable causeFix
"No data" on all panelsPrometheus can't reach the backend /metrics endpointdocker exec zol-prometheus wget -O- http://zol-app:80/metrics | head — if this fails, check the backend is up and the zol-app container alias resolves on the Docker network
Some panels green, some emptyMetric label drift after a code change (e.g., we renamed an intent value but the dashboard hard-codes the old name)Click the panel → Inspect → Query Inspector. The PromQL {intent="..."} filter shows the value the panel expects; cross-reference with current UserIntent enum
Dashboard provisioning fails on container startMalformed JSON in grafana/dashboards/*.jsondocker logs zol-grafana | grep -i error — find the offending file and re-export from a working Grafana instance
Grafana login redirects to a 502 pageGrafana container OOM-killed (rare; uses ~80MB)docker compose -f docker-compose.app.yml restart grafana
Postgres-backed panel shows "No data" but curl of the datasource API returns rowsdatabase: zol_rag key in the wrong place in the datasource yamlSee callout below

Datasource provisioning pitfall — database: under jsonData:

The Grafana Postgres datasource provisioning yaml at grafana/datasources/prometheus.yml requires database: zol_rag to live under jsonData:, NOT at the top level of the datasource block. When it sits at the top level, the /api/ds/query proxy tolerates it (so a curl smoke test against the datasource returns data and looks fine), but the panel-render path silently fails — the panels show "No data" with no server-side error log.

The actual error only surfaces in the browser console: "You do not currently have a default database configured for this data source. Postgres requires a default database with which to connect." This burned six wrong-direction debug iterations chasing "No data" on cost panels.

Rule for next time: when a Grafana panel shows "No data" while a manual curl against the same datasource API returns data, the bug is almost certainly in the datasource yaml — and the diagnostic is in the browser console, not Grafana server logs. Open DevTools first.

The correct shape:

datasources:
- name: postgres
type: postgres
url: zol-postgres:5432
user: zolrag
secureJsonData:
password: ${POSTGRES_PASSWORD}
jsonData:
database: zol_rag # MUST live here, not at the top level
sslmode: disable
postgresVersion: 1600

Prometheus directly

Prometheus on the pilot is docker-network-only — it has no host port binding (docker inspect zol-prometheus shows "9090/tcp":null), so ssh -L 9090:localhost:9090 lands on a closed port. There are three correct ways to query it:

1. From your laptop via the observability helper (recommended)

./scripts/observability.sh --prom-query 'rate(zol_rag_queries_total[5m])'
./scripts/observability.sh --prom-query 'histogram_quantile(0.95, rate(rag_query_duration_seconds_bucket[5m]))'

This SSH-execs docker exec zol-prometheus wget … against the pilot's Prometheus and pretty-prints the JSON with jq if available.

2. From the pilot host directly

ssh deploy@88.99.184.57
docker exec zol-prometheus wget -qO- 'http://localhost:9090/api/v1/query?query=zol_rag_queries_total'
docker exec zol-prometheus wget -qO- 'http://localhost:9090/api/v1/query?query=rate(zol_rag_queries_total[5m])'

3. Through Grafana → Explore (best for iterative exploration)

Once you've launched ./scripts/observability.sh, Grafana's Explore tab (compass icon in the left rail) gives you a full PromQL query editor with autocomplete and time-range controls, backed by the same Prometheus datasource the dashboards use. This is usually the fastest path for ad-hoc work.


Alerting

Grafana ships with six Prometheus alert rules provisioned from grafana/provisioning/alerting/:

FilePurpose
zol-rag-alerts.ymlThe six rules below, grouped as zol-rag-core
contact-points.ymlEmail + Slack webhook destinations — template only, ops must edit before pilot deploy
notification-policies.ymlRouting tree (which severities go where) — template only, ops must edit before pilot deploy

The six rules

RuleSeverityConditionWhy it exists
BackendDowncriticalup == 0 for 1mPrometheus can't scrape the backend at all — the system is down
HighErrorRatecritical5xx ratio > 1% for 5mSustained server errors above the SLO error budget
LLMCostBurnRatewarningspend > $5/hr for 10mCatches runaway model usage (loops, prompt regressions, abuse) before the day's bill blows up
SafetyRefusalSpikewarningrefusal rate 5x the 1h baselineSudden refusal spike usually means a prompt regression made answers fail the safety check, not that traffic suddenly went malicious
VoiceTTFTHighwarningvoice TTFT p95 > 2000ms for 10mVoice UX falls apart above 2s TTFT
LLMCircuitOpencriticalLLM error rate > 20% for 5mProxy alert — see caveat below

LLMCircuitOpen proxy caveat

The application's LLM circuit breaker has internal closed / open / half-open state but does not currently expose a dedicated llm_circuit_state gauge. Until that gauge exists, LLMCircuitOpen is a proxy — it fires on sustained high LLM error rate, which is what would cause the breaker to open in the first place. False positives are possible (e.g., a model returning malformed JSON for a small subset of queries can spike the error rate without actually tripping the breaker). When it fires, confirm against /health/ready (which does expose the live circuit state) before treating it as a confirmed circuit-open event.

What ops MUST change before deploy

contact-points.yml ships with placeholder email addresses and a placeholder Slack webhook. Until these are replaced with the production destinations, no alert can actually route — Grafana evaluates them, logs the firing state, and silently drops the notification. The pre-deploy checklist:

  1. Edit contact-points.yml — set the on-call email distribution list and the Slack webhook URL (or remove the Slack receiver if you're email-only).
  2. Edit notification-policies.yml — confirm the severity-to-receiver routing matches your team's preferences (e.g., critical → both email + Slack with 0s group_wait; warning → Slack only with 5m group_wait).
  3. docker compose -f docker/docker-compose.infra.yml -f docker/docker-compose.app.yml restart grafana — the volume mounts in both compose files pick the updated YAML up on restart.
  4. Trigger a test alert via Grafana UI (Alerting → Contact points → Test) to confirm the destination actually receives it before pilot traffic depends on it.

Prometheus metrics catalog

All metrics are emitted from backend/app/api/metrics.py and surfaced via the /metrics endpoint. Recording helpers live in backend/app/infrastructure/metrics.py (the indirection avoids a circular import).

HTTP layer

MetricTypeLabelsWhat it tells you
zol_rag_requests_totalCountermethod, endpoint, status_codeThroughput and error rate per route
zol_rag_request_latency_secondsHistogrammethod, endpointp50 / p95 / p99 latency per route
zol_rag_websocket_connections_activeGauge(none)Currently-connected WS clients

RAG pipeline

MetricTypeLabelsWhat it tells you
zol_rag_queries_totalCounterstatus (success/error), intentQuery volume by intent class — useful for tracking whether institutional_treatment_info and doctor_schedule_query are firing as expected after Cluster 1/3
zol_rag_query_latency_secondsHistogramintentEnd-to-end RAG latency per intent class
zol_rag_pipeline_stage_secondsHistogramstage (intent / retrieval / rerank / llm / safety)Stage-level breakdown — find the slowest step
zol_rag_safety_refusals_totalCounterreason (regex / llm_judge / guardrail / etc.)Safety-blocked answer rate — track per-reason to spot prompt regressions
zol_rag_cache_hits_total / zol_rag_cache_misses_totalCountercache_type (intent / semantic / faq)Cache effectiveness — sub-50% hit rate is a deploy-day surprise

LLM cost & reliability

MetricTypeLabelsWhat it tells you
zol_rag_llm_tokens_totalCountermodel, direction (input/output)Token consumption per model
zol_rag_llm_cost_usd_totalCountermodelCumulative USD spend per model
zol_rag_llm_requests_totalCountermodel, statusModel usage + error rate
zol_rag_llm_fallbacks_totalCounterfrom_model, to_model, reasonCircuit-breaker activations — every spike here is an investigation

Vector / graph / reranker

MetricTypeLabelsWhat it tells you
zol_rag_embedding_latency_secondsHistogramproviderEmbedding-API call time
zol_rag_vector_search_latency_secondsHistogram(none)pgvector search time
zol_rag_reranker_latency_secondsHistogramrerankerReranker model call time
zol_rag_graph_injections_totalCounterstrategyHow often graph-augmented retrieval was used
zol_rag_crag_decisions_totalCounterdecision (correct / incorrect / ambiguous)Corrective RAG outcomes

Evaluation & corpus health

MetricTypeLabelsWhat it tells you
evaluation_scoresHistogrammetric_type (faithfulness / relevancy / context_precision / context_recall)DeepEval LLM-judge scores from golden runs
document_countGaugetenant_idLive corpus size per tenant

Structured-log event catalog

Production logs are JSON; development uses colored key=value. Both have the same field set — query the same way regardless of environment.

Application tags emitted by the chat orchestrator

These are grep-able tags surfaced specifically for the Cluster 1–3 fixes shipped 2026-05-13. They are the primary way to audit whether each fix is actually firing on prod traffic.

TagWhere emittedWhat it means
[INTENT] Institutional treatment-info pattern detectedintent_classification_service.py:detect_institutional_treatment_queryCluster 1 fix fired — Q37/Q38/Q42-class query routed to institutional intent
[INTENT] Prepended medical-content disclaimersafety_mixin.py:_qs_apply_intent_disclaimerDisclaimer prepended on an institutional answer (Cluster 1)
[CLUSTER2] no_info_with_chunks chunks=<N>safety_mixin.py:_qs_log_no_info_with_chunks_warningRegression — LLM bailed to no-info despite N retrieved chunks. WARNING-level. Triage by reading the chunk contents from the matching request id
[doctor_schedule_tool] no completed doc founddoctor_schedule_tool.py:lookup_doctor_scheduleCluster 3 tool fired but the doctor wasn't found — fallback to LLM path
[SAFETY] Medical advice pattern detectedintent_classification_service.py:detect_medical_advice_queryQ41-class personal-symptom triage caught by the pre-LLM regex

Searching production logs

# All Cluster 2 telemetry warnings from the last 24h
ssh deploy@88.99.184.57 'docker logs --since 24h zol-app 2>&1 | grep CLUSTER2'

# Every institutional-info query routed today
ssh deploy@88.99.184.57 'docker logs --since 24h zol-app 2>&1 | grep "Institutional treatment-info pattern detected"'

# Every safety refusal today, grouped by reason
ssh deploy@88.99.184.57 'docker logs --since 24h zol-app 2>&1 | grep "\[SAFETY\]" | awk -F"\\(" "{print \$2}" | sort | uniq -c'

Eval-trace fields (in s["timing"])

The chat orchestrator records per-request introspection into a dict that gets persisted to app.pipeline_telemetry. These are the fields the golden-eval harness reads.

FieldTypeNotes
intentstringThe classified intent (member of UserIntent enum)
intent_classification_msfloatTime spent in the intent classifier
retrieval_msfloatTime spent retrieving chunks (vector + graph)
llm_msfloatTime spent in the response-generation LLM call
safety_msfloatTime spent in regex + LLM-judge safety validation
safety_regex_violationsintNumber of regex safety violations on the response
safety_llm_violationsintNumber of LLM-judge violations
retrieval_chunks_returnedintCluster 2 — chunk count before LLM call
answer_says_no_infoboolCluster 2 — detector flagged the final answer as a Class C no-info template

When the eval harness sees retrieval_chunks_returned > 0 AND answer_says_no_info == true, that's the Q5-class regression — the prompt-side fix should have prevented it, the telemetry catches the case where it didn't.


Runbooks

Runbook 1 — Backfill consultation_schedule on pilot (Cluster 3)

Background. ADR-0058 Layer C added an automatic schedule extractor at ingest time, but doctor profiles ingested before the extractor shipped have a NULL metadata->>'consultation_schedule'. The query_doctor_schedule tool (introduced in Cluster 3) falls back to the LLM-reads-markdown path when this column is NULL. Running the backfill populates the column for all existing doctor profiles and unlocks the tool's structured-JSON answer path for Q25/Q27.

Script. backend/scripts/backfill_consultation_schedule.py (idempotent — safe to run repeatedly).

Steps (on pilot host).

ssh deploy@88.99.184.57
cd /opt/zol-rag

# Sanity check before: how many doctor profiles will the script touch?
docker exec zol-postgres psql -U zolrag -d zol_rag -c "
SELECT COUNT(*)
FROM app.documents
WHERE status='completed'
AND (metadata->>'consultation_schedule' IS NULL
OR metadata->>'consultation_schedule' = 'null')
AND EXISTS (
SELECT 1 FROM app.document_chunks c
WHERE c.document_id = documents.id
AND c.content LIKE '%| MA | Di | WO%'
);
"

# Run the backfill (idempotent — re-runnable if it fails midway)
docker exec zol-app python -m scripts.backfill_consultation_schedule

# Sanity check after: how many doctor profiles now have the JSON?
docker exec zol-postgres psql -U zolrag -d zol_rag -c "
SELECT COUNT(*)
FROM app.documents
WHERE status='completed'
AND metadata->>'consultation_schedule' IS NOT NULL
AND metadata->>'consultation_schedule' != 'null';
"

Expected outcome. Before-count should drop to zero (or near-zero — any non-zero remainder is profiles that don't have the schedule table; these are normal).

Verification. Q25 ("Is er raadpleging voor Dr. Matthias Dupont op woensdag?") should now answer "Ja, Dr. Dupont houdt 2-wekelijks raadpleging op woensdagvoormiddag" — exercise it on the pilot chat UI.


Runbook 2 — Find a specific request in the logs

When a user reports a bad answer, you need the structured-log record for that specific request. The chat path emits the request ID as request_id field on every log line.

# 1. User reports "I asked about laadpalen at 14:32 and got no info"
# 2. Find the request id
ssh deploy@88.99.184.57 'docker logs --since 1h zol-app 2>&1 | grep -i laadpaal | head'

# 3. Replay the full request lifecycle by request_id
REQ_ID=abc123-def456
ssh deploy@88.99.184.57 "docker logs zol-app 2>&1 | grep $REQ_ID"

# 4. Inspect the persisted telemetry row
ssh deploy@88.99.184.57 'docker exec zol-postgres psql -U zolrag -d zol_rag -c "
SELECT intent, retrieval_chunks_returned, answer_says_no_info,
safety_regex_violations, total_ms
FROM app.pipeline_telemetry WHERE request_id = '"'"'abc123-def456'"'"';"'

Runbook 3 — Querying audit DB tables directly

For deeper analysis than logs can give, query the audit tables.

-- Top 10 intents by volume in the last 24h
SELECT intent, COUNT(*) AS n
FROM app.pipeline_telemetry
WHERE created_at > now() - interval '24 hours'
GROUP BY intent
ORDER BY n DESC LIMIT 10;

-- Cluster 2 regression check — how often are we bailing to no-info
-- despite having chunks?
SELECT
COUNT(*) FILTER (WHERE retrieval_chunks_returned > 0 AND answer_says_no_info)
AS regression_count,
COUNT(*) AS total_queries,
(COUNT(*) FILTER (WHERE retrieval_chunks_returned > 0 AND answer_says_no_info))::float
/ NULLIF(COUNT(*), 0) AS regression_rate
FROM app.pipeline_telemetry
WHERE created_at > now() - interval '7 days';

-- Cluster 1 firing rate — what fraction of queries hit the new intent?
SELECT intent, COUNT(*) AS n
FROM app.pipeline_telemetry
WHERE created_at > now() - interval '7 days'
AND intent IN ('institutional_treatment_info', 'doctor_schedule_query',
'out_of_scope_medical_advice')
GROUP BY intent;

-- Negative-feedback events with full request context
SELECT f.id, f.created_at, f.rating, f.user_comment,
pt.intent, pt.retrieval_chunks_returned, pt.answer_says_no_info
FROM app.session_feedback f
LEFT JOIN app.pipeline_telemetry pt ON pt.request_id = f.request_id
WHERE f.rating IN ('negative', 'disputed')
AND f.created_at > now() - interval '7 days'
ORDER BY f.created_at DESC LIMIT 50;

Runbook 4 — Add a new metric

You want to track a new value on prod traffic. The flow:

  1. Define the Prometheus metric in backend/app/api/metrics.py:

    MY_NEW_METRIC = Counter(
    "my_new_metric_total",
    "Description of what this counts",
    ["label1", "label2"],
    )
  2. Add a recording helper in backend/app/infrastructure/metrics.py:

    def record_my_new_event(label1: str, label2: str) -> None:
    _m().MY_NEW_METRIC.labels(label1=label1, label2=label2).inc()
  3. Call from your code via safe_record:

    from app.infrastructure.metrics import safe_record, record_my_new_event
    safe_record(record_my_new_event, "value1", "value2")
  4. Deploy. The metric appears at /metrics immediately. Add a panel to the relevant Grafana dashboard. Check it into git via the dashboard export → JSON pipeline.


Runbook 5 — Re-run the 50-Q MedChat vs ZOL benchmark

After a deploy, validate the comparison-report-v5 cluster fixes still hold:

ssh deploy@88.99.184.57
cd /opt/zol-rag

# Run the benchmark against pilot via the new --pilot-golden flag
docker exec zol-app python -m tests.evaluation.run_evaluation \
--pilot-golden \
--base-url https://test.medchat.health \
--output /tmp/benchmark-$(date +%Y%m%d-%H%M).json

# Inspect the summary
docker exec zol-app cat /tmp/benchmark-*.json | python -m json.tool | head -40

The harness loads questions from the DB-backed app.golden_questions table (302 seed rows + any /add-to-golden-derived feedback rows). Compare wins / avg / per-question scores against the prior run to confirm the cluster fixes landed without regression.


Reading-list pointers

  • backend/app/api/metrics.py — Prometheus metric object definitions
  • backend/app/infrastructure/metrics.py — recording helpers + safe_record
  • backend/app/services/rag/safety_mixin.py — Cluster 1+2 telemetry hooks
  • backend/app/services/doctor_schedule_tool.py — Cluster 3 tool + intent integration
  • grafana/dashboards/ — versioned dashboard JSON
  • grafana/datasources/ — Prometheus + Loki provisioning
  • deployment/monitoring.md — infrastructure-level monitoring (this page's companion)