Skip to main content

Morning Check

A five-minute daily routine for the ZOL RAG on-call. Run through these five steps once at the start of your shift and you will know whether the system is healthy, whether yesterday's spend was sane, whether the nightly ingest fired, and whether anything is silently waiting for you to acknowledge it.

Anything that takes longer than 5 minutes means something is wrong — drop into the relevant dashboard and investigate.


1. Open the SLO dashboard first

From your laptop:

./scripts/observability.sh

When Grafana opens, click Dashboards → Browse → ZOL RAG - SLO Status (uid zol-rag-slo-dashboard). This is your single source of truth for "is the product working right now". The dashboard renders six headline stats colored red / yellow / green against the SLO thresholds.

Glance at all six. Green everywhere = move on. Anything red = act on the table below before going to step 2.

If you see...Means...Do...
API Availability red (<99% over 24h)Backend is flappingOpen Infrastructure Health → Service Health panel. Then SSH to pilot and run docker logs zol-app --tail 200 to see what's restarting.
HTTP 5xx red (>=1% over 5m)Endpoint regressionOpen Pipeline Overview → Response Time by Intent. Look for one intent spiking while others are flat — that's your suspect.
RAG p95 latency red (>=5s over 5m)LLM latency degradationOpen LLM & Cost Tracking → Cost Burn Rate. If burn is also up, OpenAI itself is slow; if burn is flat, the slowdown is in our pipeline.
Voice TTFT p95 red (>=2000ms over 5m)Voice channel slowOpen Voice Channel dashboard and scan all four rows. Replay a recent call via the transcript-replay tool to reproduce.
LLM Error Rate red (>=2% over 5m)OpenAI errors or circuit issuesOpen LLM & Cost Tracking → Cost Burn Rate and look for gaps (means calls stopped going out). Then curl /health/ready and inspect the llm_circuit field.
Medical-Advice Incidents >= 1LOAD-BEARING SLO BREACHOpen Safety & Compliance dashboard, identify the specific event_id from analytics_events, and notify the clinical lead immediately. This is the one SLO that is never allowed to be non-zero.

The medical-advice row is the only red that always means stop-everything. The other five reds normally let you finish the morning check before diving in, but use judgement: a 5xx storm at 50% is not "finish the routine first".


2. Check yesterday's cost

Open ZOL RAG - LLM & Cost Tracking (uid zol-rag-llm-cost-tracking). The top section is labeled Authoritative Cost Tracking and is Postgres-backed (not Prometheus aggregates), so the numbers here are the real bill.

Sanity baselines for current pilot traffic:

StatNormal range
Cost Today<$5
Cost Last 7 Days<$25
Daily Cost Trendflat-ish line, no step changes

If today's spend is already 3× yesterday's at the same hour, look at the Cost by Channel panel. The two common causes:

  • Voice channel runaway (one stuck call burning tokens in a tight loop) — open Voice Channel dashboard and check active sessions.
  • Someone fired a mass eval against the pilot — check who's running benchmarks today before assuming it's a bug.

3. Verify nightly ingest succeeded

Open ZOL RAG - Ingest Pipeline (uid zol-rag-ingest-pipeline). The top row, Last Ingest Run, tells you everything you need:

StatWhat you want to see
StatusOK (green)
Hours Since Last Successful Runbetween 6 and 27 — the 03:00 UTC nightly is what you're looking at
Failed count<10 — small numbers are normal, large numbers need triage

If failed count is in the hundreds, scroll to the Failure Diagnostics row and check which failure_class is spiking (e.g., DEAD_EMPTY_CONTENT, HTTP_404, TIMEOUT). One class dominating points at a single root cause.

If the nightly is missing entirely (Hours Since Last Successful Run >27), the most likely cause is that someone set INGEST_MODE=manual for a deploy window and forgot to flip it back:

ssh deploy@88.99.184.57 'grep INGEST_MODE /opt/zol-rag/.env.prod'

Should print INGEST_MODE=auto. If it prints manual or off, that's why the nightly didn't fire — coordinate with whoever set it before flipping back.


4. Spot-check Safety

Open ZOL RAG - Safety & Compliance (uid zol-rag-safety-compliance). The headline row is the ZERO target — medical-advice incidents must be zero (you already checked this in step 1, but the safety dashboard gives you the supporting metrics).

What to look at:

MetricNormalConcerning
Refusals todayany non-zero (filter is working)spike >10× yesterday's value at same hour
Voice safety escalations todayany non-zero (escalation flow is working)sudden spike, or zero when refusals are spiking (escalation may be broken)
Refusal rate<2%>5% sustained means over-refusal regression
Citation-attached rate>90%<70% means the citation pipeline is broken

If citation-attached rate dropped below 70%, the most likely culprit is rag_service.py — grep its logs for citation_renumber or bronnen_dedupe errors. The dedup/renumber logic has been the source of three citation regressions historically.


5. Acknowledge any active alerts

Go to Grafana → Alerting → Alert rules and filter to Firing. The six rules that auto-page are:

  • BackendDown
  • HighErrorRate
  • LLMCostBurnRate (triggers above $5/hr)
  • SafetyRefusalSpike
  • VoiceTTFTHigh
  • LLMCircuitOpen

Each firing rule has annotations (summary, description) that explain its condition. For each one:

  1. Either ACK it (silence with a reason in the silence form), or
  2. Escalate to the appropriate channel.

Do not leave alerts unacknowledged. The next on-call has no way to tell whether a firing alert means "Claude is on it" or "nobody has noticed yet" — an ACK with a reason resolves that ambiguity.


What if everything is green?

That's the goal. Spend the remaining 2 minutes glancing at trend lines on the ZOL RAG - Pipeline Overview dashboard (uid zol-rag-pipeline-overview) to spot anything weird that doesn't yet trip an SLO:

  • One intent suddenly dominating the mix (something changed in user behaviour, or the classifier is biased)
  • p95 latency drifting up over a week (slow regression, no single deploy to blame)
  • Refusal rate slowly climbing (prompt drift, or a content gap)

Then close the tab. The system is healthy.


Tools

ToolPurpose
./scripts/observability.shOpens Grafana with SSH tunnel and admin credentials prefilled
./scripts/observability.sh --prom-query '<expr>'Ad-hoc PromQL query against the pilot's Prometheus
Deployment OverviewFull deployment runbook — what to do when you need to ship a change
Telemetry & RunbooksFull metric catalog, structured-log catalog, and longer-form runbooks