Telemetry & Operational Runbooks

This page is the single reference for all telemetry emitted by the ZOL RAG application plus the operational runbooks needed to inspect it and act on what it surfaces. It complements deployment/monitoring.md — that page covers infrastructure-level health (containers, DB, Redis); this page covers application-level telemetry (RAG pipeline metrics, structured logs, evaluation traces, and how to operate the Grafana/Prometheus stack).

If you are reviewing the pilot for the first time, read this in order:

Where telemetry lives (next section) — orient yourself to the four telemetry surfaces.
Operating Grafana — get a dashboard open in 5 minutes.
Prometheus metrics catalog — what every metric means and when it matters.
Structured-log event catalog — grep-able tags you can use without Grafana.
Runbooks — recipes for common operational tasks.

Where telemetry lives — the four surfaces

Surface	What it carries	How to access
Prometheus metrics (`/metrics`)	Counters, gauges, histograms — quantitative time-series	Grafana dashboards, or `curl /metrics`
Structured logs (stdout JSON in prod, key=value in dev)	Per-event records with structlog context: `intent`, `latency`, `chunks=N`, `[CLUSTER2]` tags	`docker logs zol-app`, or log aggregator (ELK / Loki / CloudWatch)
Evaluation traces (`s["timing"]` dict on the chat orchestrator)	Per-request introspection: which intent, how many chunks, did the LLM bail to no-info, did the disclaimer fire	Captured in DB tables `pipeline_telemetry` + golden-eval runs
Audit DB tables	`pipeline_telemetry`, `feedback_events`, `session_feedback`, `ingest_runs`	PostgreSQL — see "Querying audit tables" below

Each surface answers a different question:

Metrics → "is the system healthy now, and how does it trend?"
Logs → "what happened on that specific request?"
Eval traces → "did the pipeline make the right decisions?"
Audit tables → "what did users tell us afterwards?"

Operating Grafana — five-minute orientation

Accessing dashboards

Grafana on the pilot is bound to 127.0.0.1:3000 (loopback only — never exposed publicly), so any laptop session needs an SSH tunnel + the rotating admin password. The repo ships a helper that does both in one command and auto-tears-down on Ctrl+C.

# From your laptop — opens Grafana in your browser with credentials printed
./scripts/observability.sh

The script:

Verifies SSH key reachability (and tells you what ssh-add command to run if it can't connect non-interactively — see the pilot SSH note)
Checks the local port is free (override with GRAFANA_LOCAL_PORT=3001)
Auto-fetches the current GRAFANA_ADMIN_PASSWORD from <ENV_FILE>
Opens an SSH tunnel localhost:3000 → pilot 127.0.0.1:3000
Launches the default browser at http://localhost:3000
Prints username + password to the terminal so you can paste them in
Cleans up the tunnel on Ctrl+C

Manual fallback

If you'd rather drive the tunnel yourself (e.g., on Windows, or when the script's auto-browser misbehaves):

ssh -L 3000:localhost:3000 <DEPLOY_USER>@<PILOT_HOST>
# Then open http://localhost:3000 in your browser
ssh <DEPLOY_USER>@<PILOT_HOST> 'grep GRAFANA_ADMIN_PASSWORD <ENV_FILE>'

Provisioned dashboards

Eight dashboards ship with the deployment under Dashboards → Browse. The System Overview is an executive index that links into the specialist dashboards; the SLO Status dashboard is the stakeholder view tied to error budgets.

Dashboard	Top panels you'll likely care about
ZOL RAG - System Overview	8 stat panels (backend up, request rate, p95 latency, error %, LLM spend today, ingest status, voice TTFT p95, refusal rate) with clickable links to the specialist dashboards below
ZOL RAG - Pipeline Overview	Intent distribution, stage-latency breakdown (intent / retrieval / rerank / llm / safety), safety-refusal rate, graph injections, no-info bail rate
ZOL RAG - Infrastructure Health	HTTP request rate + 5xx by route, process CPU + RSS, vector search latency, Python GC pause time
ZOL RAG - LLM & Cost Tracking	Authoritative Postgres-backed daily / weekly / monthly USD by model, plus Prometheus since-restart tokens-in / tokens-out / cost counters
ZOL RAG - Voice Channel	Voice TTFT p50 / p95 / p99, safety escalations by reason, LLM-judge per-dimension scores (faithfulness / relevance / safety / fluency), speculative-STT hit rate + latency saved
ZOL RAG - Safety & Compliance	Refusals + voice escalations over time, refusal-rate %, citation-attached % (from Postgres), CRAG decisions, channel split
ZOL RAG - Ingest Pipeline	Latest-run status + duration, run-history table, crawl corpus state, failure-class distribution, failed-URL table (all Postgres-backed — no Prometheus side)
ZOL RAG - SLO Status	Six headline SLO stats (availability, 5xx rate, RAG p95, voice TTFT p95, LLM error rate, medical-advice incidents) with red/yellow/green thresholds; error-budget panels for 5xx and LLM

Dashboards are version-controlled in grafana/dashboards/. To edit: change them in Grafana, then export the JSON and check it into the repo. Do NOT edit the JSON files directly — Grafana's export format has many computed fields and hand-editing produces malformed dashboards that fail provisioning.

Postgres-backed panels — the dual-backend pattern

Several panels on the LLM & Cost Tracking, Ingest Pipeline, and Safety & Compliance dashboards query the application database directly through the Grafana postgres datasource instead of Prometheus. This is intentional:

Prometheus side — counters and histograms scraped every 15s; great for rates, p95s, and "what's happening right now." Resets on container restart, so you lose yesterday's cumulative spend.
Postgres side — app.analytics_events, app.ingest_runs, app.crawled_urls; restart-safe and authoritative. Used wherever the panel needs historical accuracy (daily / weekly / monthly cost, ingest run-history, citation-attached %) rather than fresh trend data.

The same dashboard often has both: the cost dashboard shows "since-restart" Prometheus counters next to "month-to-date" Postgres-backed cost so you get both the live signal and the auditable number.

When Grafana looks broken

The most common failure modes and their fixes:

Symptom	Probable cause	Fix
"No data" on all panels	Prometheus can't reach the backend `/metrics` endpoint	`docker exec zol-prometheus wget -O- http://zol-app:80/metrics \| head` — if this fails, check the backend is up and the `zol-app` container alias resolves on the Docker network
Some panels green, some empty	Metric label drift after a code change (e.g., we renamed an `intent` value but the dashboard hard-codes the old name)	Click the panel → Inspect → Query Inspector. The PromQL `{intent="..."}` filter shows the value the panel expects; cross-reference with current `UserIntent` enum
Dashboard provisioning fails on container start	Malformed JSON in `grafana/dashboards/*.json`	`docker logs zol-grafana \| grep -i error` — find the offending file and re-export from a working Grafana instance
Grafana login redirects to a 502 page	Grafana container OOM-killed (rare; uses ~80MB)	`docker compose -f docker-compose.app.yml restart grafana`
Postgres-backed panel shows "No data" but `curl` of the datasource API returns rows	`database: zol_rag` key in the wrong place in the datasource yaml	See callout below

Datasource provisioning pitfall — `database:` under `jsonData:`

The Grafana Postgres datasource provisioning yaml at grafana/datasources/prometheus.yml requires database: zol_rag to live under jsonData:, NOT at the top level of the datasource block. When it sits at the top level, the /api/ds/query proxy tolerates it (so a curl smoke test against the datasource returns data and looks fine), but the panel-render path silently fails — the panels show "No data" with no server-side error log.

The actual error only surfaces in the browser console: "You do not currently have a default database configured for this data source. Postgres requires a default database with which to connect." This burned six wrong-direction debug iterations chasing "No data" on cost panels.

Rule for next time: when a Grafana panel shows "No data" while a manual curl against the same datasource API returns data, the bug is almost certainly in the datasource yaml — and the diagnostic is in the browser console, not Grafana server logs. Open DevTools first.

The correct shape:

datasources:
  - name: postgres
    type: postgres
    url: zol-postgres:5432
    user: zolrag
    secureJsonData:
      password: ${POSTGRES_PASSWORD}
    jsonData:
      database: zol_rag        # MUST live here, not at the top level
      sslmode: disable
      postgresVersion: 1600

Prometheus directly

Prometheus on the pilot is docker-network-only — it has no host port binding (docker inspect zol-prometheus shows "9090/tcp":null), so ssh -L 9090:localhost:9090 lands on a closed port. There are three correct ways to query it:

1. From your laptop via the observability helper (recommended)

./scripts/observability.sh --prom-query 'rate(zol_rag_queries_total[5m])'
./scripts/observability.sh --prom-query 'histogram_quantile(0.95, rate(rag_query_duration_seconds_bucket[5m]))'

This SSH-execs docker exec zol-prometheus wget … against the pilot's Prometheus and pretty-prints the JSON with jq if available.

2. From the pilot host directly

ssh <DEPLOY_USER>@<PILOT_HOST>
docker exec zol-prometheus wget -qO- 'http://localhost:9090/api/v1/query?query=zol_rag_queries_total'
docker exec zol-prometheus wget -qO- 'http://localhost:9090/api/v1/query?query=rate(zol_rag_queries_total[5m])'

3. Through Grafana → Explore (best for iterative exploration)

Once you've launched ./scripts/observability.sh, Grafana's Explore tab (compass icon in the left rail) gives you a full PromQL query editor with autocomplete and time-range controls, backed by the same Prometheus datasource the dashboards use. This is usually the fastest path for ad-hoc work.

Alerting

Grafana ships with six Prometheus alert rules provisioned from grafana/provisioning/alerting/:

File	Purpose
`zol-rag-alerts.yml`	The six rules below, grouped as `zol-rag-core`
`contact-points.yml`	Email + Slack webhook destinations — template only, ops must edit before pilot deploy
`notification-policies.yml`	Routing tree (which severities go where) — template only, ops must edit before pilot deploy

The six rules

Rule	Severity	Condition	Why it exists
`BackendDown`	critical	`up == 0` for 1m	Prometheus can't scrape the backend at all — the system is down
`HighErrorRate`	critical	5xx ratio > 1% for 5m	Sustained server errors above the SLO error budget
`LLMCostBurnRate`	warning	spend > $5/hr for 10m	Catches runaway model usage (loops, prompt regressions, abuse) before the day's bill blows up
`SafetyRefusalSpike`	warning	refusal rate 5x the 1h baseline	Sudden refusal spike usually means a prompt regression made answers fail the safety check, not that traffic suddenly went malicious
`VoiceTTFTHigh`	warning	voice TTFT p95 > 2000ms for 10m	Voice UX falls apart above 2s TTFT
`LLMCircuitOpen`	critical	LLM error rate > 20% for 5m	Proxy alert — see caveat below

`LLMCircuitOpen` proxy caveat

The application's LLM circuit breaker has internal closed / open / half-open state but does not currently expose a dedicated llm_circuit_state gauge. Until that gauge exists, LLMCircuitOpen is a proxy — it fires on sustained high LLM error rate, which is what would cause the breaker to open in the first place. False positives are possible (e.g., a model returning malformed JSON for a small subset of queries can spike the error rate without actually tripping the breaker). When it fires, confirm against /health/ready (which does expose the live circuit state) before treating it as a confirmed circuit-open event.

What ops MUST change before deploy

contact-points.yml ships with placeholder email addresses and a placeholder Slack webhook. Until these are replaced with the production destinations, no alert can actually route — Grafana evaluates them, logs the firing state, and silently drops the notification. The pre-deploy checklist:

Edit contact-points.yml — set the on-call email distribution list and the Slack webhook URL (or remove the Slack receiver if you're email-only).
Edit notification-policies.yml — confirm the severity-to-receiver routing matches your team's preferences (e.g., critical → both email + Slack with 0s group_wait; warning → Slack only with 5m group_wait).
docker compose -f docker/docker-compose.infra.yml -f docker/docker-compose.app.yml restart grafana — the volume mounts in both compose files pick the updated YAML up on restart.
Trigger a test alert via Grafana UI (Alerting → Contact points → Test) to confirm the destination actually receives it before pilot traffic depends on it.

Prometheus metrics catalog

All metrics are emitted from backend/app/api/metrics.py and surfaced via the /metrics endpoint. Recording helpers live in backend/app/infrastructure/metrics.py (the indirection avoids a circular import).

HTTP layer

Metric	Type	Labels	What it tells you
`zol_rag_requests_total`	Counter	`method`, `endpoint`, `status_code`	Throughput and error rate per route
`zol_rag_request_latency_seconds`	Histogram	`method`, `endpoint`	p50 / p95 / p99 latency per route
`zol_rag_websocket_connections_active`	Gauge	(none)	Currently-connected WS clients

RAG pipeline

Metric	Type	Labels	What it tells you
`zol_rag_queries_total`	Counter	`status` (success/error), `intent`	Query volume by intent class — useful for tracking whether `institutional_treatment_info` and `doctor_schedule_query` are firing as expected after Cluster 1/3
`zol_rag_query_latency_seconds`	Histogram	`intent`	End-to-end RAG latency per intent class
`zol_rag_pipeline_stage_seconds`	Histogram	`stage` (intent / retrieval / rerank / llm / safety)	Stage-level breakdown — find the slowest step
`zol_rag_safety_refusals_total`	Counter	`reason` (regex / llm_judge / guardrail / etc.)	Safety-blocked answer rate — track per-reason to spot prompt regressions
`zol_rag_cache_hits_total` / `zol_rag_cache_misses_total`	Counter	`cache_type` (intent / semantic / faq)	Cache effectiveness — sub-50% hit rate is a deploy-day surprise

LLM cost & reliability

Metric	Type	Labels	What it tells you
`zol_rag_llm_tokens_total`	Counter	`model`, `direction` (input/output)	Token consumption per model
`zol_rag_llm_cost_usd_total`	Counter	`model`	Cumulative USD spend per model
`zol_rag_llm_requests_total`	Counter	`model`, `status`	Model usage + error rate
`zol_rag_llm_fallbacks_total`	Counter	`from_model`, `to_model`, `reason`	Circuit-breaker activations — every spike here is an investigation

Vector / graph / reranker

Metric	Type	Labels	What it tells you
`zol_rag_embedding_latency_seconds`	Histogram	`provider`	Embedding-API call time
`zol_rag_vector_search_latency_seconds`	Histogram	(none)	pgvector search time
`zol_rag_reranker_latency_seconds`	Histogram	`reranker`	Reranker model call time
`zol_rag_graph_injections_total`	Counter	`strategy`	How often graph-augmented retrieval was used
`zol_rag_crag_decisions_total`	Counter	`decision` (correct / incorrect / ambiguous)	Corrective RAG outcomes

Evaluation & corpus health

Metric	Type	Labels	What it tells you
`evaluation_scores`	Histogram	`metric_type` (faithfulness / relevancy / context_precision / context_recall)	DeepEval LLM-judge scores from golden runs
`document_count`	Gauge	`tenant_id`	Live corpus size per tenant

Structured-log event catalog

Production logs are JSON; development uses colored key=value. Both have the same field set — query the same way regardless of environment.

Application tags emitted by the chat orchestrator

These are grep-able tags surfaced specifically for the Cluster 1–3 fixes shipped 2026-05-13. They are the primary way to audit whether each fix is actually firing on prod traffic.

Tag	Where emitted	What it means
`[INTENT] Institutional treatment-info pattern detected`	`intent_classification_service.py:detect_institutional_treatment_query`	Cluster 1 fix fired — Q37/Q38/Q42-class query routed to institutional intent
`[INTENT] Prepended medical-content disclaimer`	`safety_mixin.py:_qs_apply_intent_disclaimer`	Disclaimer prepended on an institutional answer (Cluster 1)
`[CLUSTER2] no_info_with_chunks chunks=<N>`	`safety_mixin.py:_qs_log_no_info_with_chunks_warning`	Regression — LLM bailed to no-info despite N retrieved chunks. WARNING-level. Triage by reading the chunk contents from the matching request id
`[doctor_schedule_tool] no completed doc found`	`doctor_schedule_tool.py:lookup_doctor_schedule`	Cluster 3 tool fired but the doctor wasn't found — fallback to LLM path
`[SAFETY] Medical advice pattern detected`	`intent_classification_service.py:detect_medical_advice_query`	Q41-class personal-symptom triage caught by the pre-LLM regex

Searching production logs

# All Cluster 2 telemetry warnings from the last 24h
ssh <DEPLOY_USER>@<PILOT_HOST> 'docker logs --since 24h zol-app 2>&1 | grep CLUSTER2'

# Every institutional-info query routed today
ssh <DEPLOY_USER>@<PILOT_HOST> 'docker logs --since 24h zol-app 2>&1 | grep "Institutional treatment-info pattern detected"'

# Every safety refusal today, grouped by reason
ssh <DEPLOY_USER>@<PILOT_HOST> 'docker logs --since 24h zol-app 2>&1 | grep "\[SAFETY\]" | awk -F"\\(" "{print \$2}" | sort | uniq -c'

Eval-trace fields (in s["timing"])

The chat orchestrator records per-request introspection into a dict that gets persisted to app.pipeline_telemetry. These are the fields the golden-eval harness reads.

Field	Type	Notes
`intent`	string	The classified intent (member of `UserIntent` enum)
`intent_classification_ms`	float	Time spent in the intent classifier
`retrieval_ms`	float	Time spent retrieving chunks (vector + graph)
`llm_ms`	float	Time spent in the response-generation LLM call
`safety_ms`	float	Time spent in regex + LLM-judge safety validation
`safety_regex_violations`	int	Number of regex safety violations on the response
`safety_llm_violations`	int	Number of LLM-judge violations
`retrieval_chunks_returned`	int	Cluster 2 — chunk count before LLM call
`answer_says_no_info`	bool	Cluster 2 — detector flagged the final answer as a Class C no-info template

When the eval harness sees retrieval_chunks_returned > 0 AND answer_says_no_info == true, that's the Q5-class regression — the prompt-side fix should have prevented it, the telemetry catches the case where it didn't.

Runbooks

Runbook 1 — Backfill consultation_schedule on pilot (Cluster 3)

Background. ADR-0058 Layer C added an automatic schedule extractor at ingest time, but doctor profiles ingested before the extractor shipped have a NULL metadata->>'consultation_schedule'. The query_doctor_schedule tool (introduced in Cluster 3) falls back to the LLM-reads-markdown path when this column is NULL. Running the backfill populates the column for all existing doctor profiles and unlocks the tool's structured-JSON answer path for Q25/Q27.

Script. backend/scripts/backfill_consultation_schedule.py (idempotent — safe to run repeatedly).

Steps (on pilot host).

ssh <DEPLOY_USER>@<PILOT_HOST>
cd /opt/zol-rag

# Sanity check before: how many doctor profiles will the script touch?
docker exec zol-postgres psql -U zolrag -d zol_rag -c "
  SELECT COUNT(*)
  FROM app.documents
  WHERE status='completed'
    AND (metadata->>'consultation_schedule' IS NULL
         OR metadata->>'consultation_schedule' = 'null')
    AND EXISTS (
      SELECT 1 FROM app.document_chunks c
      WHERE c.document_id = documents.id
        AND c.content LIKE '%| MA | Di | WO%'
    );
"

# Run the backfill (idempotent — re-runnable if it fails midway)
docker exec zol-app python -m scripts.backfill_consultation_schedule

# Sanity check after: how many doctor profiles now have the JSON?
docker exec zol-postgres psql -U zolrag -d zol_rag -c "
  SELECT COUNT(*)
  FROM app.documents
  WHERE status='completed'
    AND metadata->>'consultation_schedule' IS NOT NULL
    AND metadata->>'consultation_schedule' != 'null';
"

Expected outcome. Before-count should drop to zero (or near-zero — any non-zero remainder is profiles that don't have the schedule table; these are normal).

Verification. Q25 ("Is er raadpleging voor Dr. Matthias Dupont op woensdag?") should now answer "Ja, Dr. Dupont houdt 2-wekelijks raadpleging op woensdagvoormiddag" — exercise it on the pilot chat UI.

Runbook 2 — Find a specific request in the logs

When a user reports a bad answer, you need the structured-log record for that specific request. The chat path emits the request ID as request_id field on every log line.

# 1. User reports "I asked about laadpalen at 14:32 and got no info"
# 2. Find the request id
ssh <DEPLOY_USER>@<PILOT_HOST> 'docker logs --since 1h zol-app 2>&1 | grep -i laadpaal | head'

# 3. Replay the full request lifecycle by request_id
REQ_ID=abc123-def456
ssh <DEPLOY_USER>@<PILOT_HOST> "docker logs zol-app 2>&1 | grep $REQ_ID"

# 4. Inspect the persisted telemetry row
ssh <DEPLOY_USER>@<PILOT_HOST> 'docker exec zol-postgres psql -U zolrag -d zol_rag -c "
  SELECT intent, retrieval_chunks_returned, answer_says_no_info,
         safety_regex_violations, total_ms
  FROM app.pipeline_telemetry WHERE request_id = '"'"'abc123-def456'"'"';"'

Runbook 3 — Querying audit DB tables directly

For deeper analysis than logs can give, query the audit tables.

-- Top 10 intents by volume in the last 24h
SELECT intent, COUNT(*) AS n
FROM app.pipeline_telemetry
WHERE created_at > now() - interval '24 hours'
GROUP BY intent
ORDER BY n DESC LIMIT 10;

-- Cluster 2 regression check — how often are we bailing to no-info
-- despite having chunks?
SELECT
  COUNT(*) FILTER (WHERE retrieval_chunks_returned > 0 AND answer_says_no_info)
    AS regression_count,
  COUNT(*) AS total_queries,
  (COUNT(*) FILTER (WHERE retrieval_chunks_returned > 0 AND answer_says_no_info))::float
    / NULLIF(COUNT(*), 0) AS regression_rate
FROM app.pipeline_telemetry
WHERE created_at > now() - interval '7 days';

-- Cluster 1 firing rate — what fraction of queries hit the new intent?
SELECT intent, COUNT(*) AS n
FROM app.pipeline_telemetry
WHERE created_at > now() - interval '7 days'
  AND intent IN ('institutional_treatment_info', 'doctor_schedule_query',
                 'out_of_scope_medical_advice')
GROUP BY intent;

-- Negative-feedback events with full request context
SELECT f.id, f.created_at, f.rating, f.user_comment,
       pt.intent, pt.retrieval_chunks_returned, pt.answer_says_no_info
FROM app.session_feedback f
LEFT JOIN app.pipeline_telemetry pt ON pt.request_id = f.request_id
WHERE f.rating IN ('negative', 'disputed')
  AND f.created_at > now() - interval '7 days'
ORDER BY f.created_at DESC LIMIT 50;

Runbook 4 — Add a new metric

You want to track a new value on prod traffic. The flow:

Define the Prometheus metric in backend/app/api/metrics.py:

MY_NEW_METRIC = Counter(
    "my_new_metric_total",
    "Description of what this counts",
    ["label1", "label2"],
)

Add a recording helper in backend/app/infrastructure/metrics.py:

def record_my_new_event(label1: str, label2: str) -> None:
    _m().MY_NEW_METRIC.labels(label1=label1, label2=label2).inc()

Call from your code via safe_record:

from app.infrastructure.metrics import safe_record, record_my_new_event
safe_record(record_my_new_event, "value1", "value2")

Deploy. The metric appears at /metrics immediately. Add a panel to the relevant Grafana dashboard. Check it into git via the dashboard export → JSON pipeline.

Runbook 5 — Re-run the 50-Q MedChat vs ZOL benchmark

After a deploy, validate the comparison-report-v5 cluster fixes still hold:

ssh <DEPLOY_USER>@<PILOT_HOST>
cd /opt/zol-rag

# Run the benchmark against pilot via the new --pilot-golden flag
docker exec zol-app python -m tests.evaluation.run_evaluation \
  --pilot-golden \
  --base-url https://test.medchat.health \
  --output /tmp/benchmark-$(date +%Y%m%d-%H%M).json

# Inspect the summary
docker exec zol-app cat /tmp/benchmark-*.json | python -m json.tool | head -40

The harness loads questions from the DB-backed app.golden_questions table (302 seed rows + any /add-to-golden-derived feedback rows). Compare wins / avg / per-question scores against the prior run to confirm the cluster fixes landed without regression.

Reading-list pointers

backend/app/api/metrics.py — Prometheus metric object definitions
backend/app/infrastructure/metrics.py — recording helpers + safe_record
backend/app/services/rag/safety_mixin.py — Cluster 1+2 telemetry hooks
backend/app/services/doctor_schedule_tool.py — Cluster 3 tool + intent integration
grafana/dashboards/ — versioned dashboard JSON
grafana/datasources/ — Prometheus + Loki provisioning
deployment/monitoring.md — infrastructure-level monitoring (this page's companion)

Where telemetry lives — the four surfaces​

Operating Grafana — five-minute orientation​

Accessing dashboards​

Manual fallback​

Provisioned dashboards​

Postgres-backed panels — the dual-backend pattern​

When Grafana looks broken​

Datasource provisioning pitfall — database: under jsonData:​

Prometheus directly​

Alerting​

The six rules​

LLMCircuitOpen proxy caveat​

What ops MUST change before deploy​

Prometheus metrics catalog​

HTTP layer​

RAG pipeline​

LLM cost & reliability​

Vector / graph / reranker​

Evaluation & corpus health​

Structured-log event catalog​

Application tags emitted by the chat orchestrator​

Searching production logs​

Eval-trace fields (in s["timing"])​

Runbooks​

Runbook 1 — Backfill consultation_schedule on pilot (Cluster 3)​

Runbook 2 — Find a specific request in the logs​

Runbook 3 — Querying audit DB tables directly​

Runbook 4 — Add a new metric​

Runbook 5 — Re-run the 50-Q MedChat vs ZOL benchmark​

Reading-list pointers​