Feedback Dashboard Metrics
The admin dashboards aggregate user satisfaction and operational quality signals across two pages:
FeedbackDashboardPage— feedback metrics: P95 latency, Think Harder funnel, satisfaction trends, content flagging.Analytics.tsx— three tabs (queries, activity, costs); the costs tab (Owner role) carries the operational metrics added with migrations 066 and 067.
Three features were added to the existing feedback dashboard: a P95 Latency Comparison card, a Think Harder Funnel visualisation, and a Flag for Review action on individual feedback records. Two operations metrics charts (Category Mismatch Trend, Diagnostic Accuracy Trend) live on the Analytics costs tab.
What the Dashboard Already Had
Before these additions the dashboard provided:
- Satisfaction summary cards (positive %, negative %, category distribution)
- ADR-0008 metric cards: Think Harder count, improvement acceptance rate, override count, nano/timeout triggers
- Satisfaction trend line chart
- Intent distribution chart
- Feedback list with AI investigation and rating override actions
- Add to golden evaluation set action
P95 Latency Comparison
Purpose
The P95 Latency Comparison card exposes end-to-end pipeline latency at the 95th percentile, split by whether the query was escalated through Think Harder. This makes the cost of escalation visible in production terms.
Backend — GET /admin/feedback/telemetry-stats
GET /api/v1/admin/feedback/telemetry-stats?period_days=30
Authorization: Bearer <admin token>
The endpoint is defined in backend/app/api/admin_feedback.py and requires require_admin. It accepts a single query parameter:
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
period_days | int | 30 | 1–365 | Look-back window |
SQL approach
The query uses a CTE to sum duration_ms across all pipeline stages per conversation_id, then joins to app.feedback_events to determine whether any think_harder event was recorded for that conversation. percentile_cont(0.95) with a FILTER clause computes separate P95 values for normal and escalated groups in a single pass:
WITH query_latency AS (
SELECT
pt.conversation_id,
SUM(pt.duration_ms) AS total_ms,
BOOL_OR(fe.feedback_type = 'think_harder') AS is_escalated
FROM app.pipeline_telemetry pt
LEFT JOIN app.feedback_events fe
ON fe.conversation_id = pt.conversation_id
AND fe.feedback_type = 'think_harder'
WHERE pt.created_at >= :cutoff
GROUP BY pt.conversation_id
)
SELECT
COUNT(*)::int AS total_queries,
COUNT(*) FILTER (WHERE NOT is_escalated)::int AS normal_queries,
COUNT(*) FILTER (WHERE is_escalated)::int AS escalated_queries,
percentile_cont(0.95) WITHIN GROUP (ORDER BY total_ms)
FILTER (WHERE NOT is_escalated) AS p95_normal,
percentile_cont(0.95) WITHIN GROUP (ORDER BY total_ms)
FILTER (WHERE is_escalated) AS p95_escalated,
percentile_cont(0.95) WITHIN GROUP (ORDER BY total_ms) AS p95_overall
FROM query_latency
A second query computes per-stage average latency, returned in stage_avg_ms.
Response model
{
"period_days": 30,
"total_queries": 1842,
"normal_queries": 1756,
"escalated_queries": 86,
"p95_normal_ms": 6840.0,
"p95_escalated_ms": 13250.0,
"p95_overall_ms": 7100.0,
"stage_avg_ms": {
"generation": 4210.3,
"reranking": 1840.1,
"retrieval": 620.5,
"intent": 210.0
}
}
All latency fields are nullable — they return null when no data exists for that group within the period.
UI card layout
The card renders three stat tiles side by side: P95 Normal, P95 Escalated, and P95 Overall. When both normal and escalated values are available, the overall tile shows a relative difference label. The label turns amber when the escalated P95 exceeds 1.5× the normal P95, and green otherwise.
Data flow diagram
Think Harder Funnel
Purpose
The funnel visualizes the full conversion path from user dissatisfaction to accepted improvement, giving content managers a single-glance view of how effectively Think Harder recovers poor search experiences.
Data source
The funnel reads directly from the existing /admin/feedback/summary endpoint — no new backend endpoint was required. The three counts used are already part of FeedbackSummary:
| Field | Meaning |
|---|---|
negative_count | Total thumbs-down feedback events in the period |
think_harder_count | Escalations triggered from negative feedback |
improvement_accepted_count | Escalated responses accepted by users |
Conversion rate calculation
The funnel renders two inline conversion rates:
- Escalation rate:
think_harder_count / negative_count × 100(shown in blue under the Escalated column) - Acceptance rate:
improvement_accepted_count / think_harder_count × 100(shown in green under the Accepted column)
Both rates are only rendered when the denominator is non-zero.
Visual layout
[ Negative ] → [ Escalated ] → [ Accepted ]
N M (X%) K (Y%)
The three columns are equal-width. Arrows between columns are decorative HTML entities. The percentage labels sit below the count in a smaller contrasting colour.
Flag for Review
Purpose
Admins can mark individual feedback records for follow-up by content teams without leaving the dashboard. The flag is stored as a JSONB metadata field on the feedback record so it survives schema changes and requires no migration.
Backend — POST /admin/feedback/{id}/flag
POST /api/v1/admin/feedback/{feedback_id}/flag
Authorization: Bearer <admin token>
The endpoint requires require_admin and accepts no request body. It performs an in-place JSONB merge on app.session_feedback:
UPDATE app.session_feedback
SET metadata = COALESCE(metadata, '{}'::jsonb) || CAST(:flag_data AS jsonb)
WHERE id = CAST(:feedback_id AS uuid)
RETURNING id::text
The flag_data value is {"flagged": true}. Using CAST(:param AS jsonb) (not ::jsonb) ensures compatibility with asyncpg parameter binding.
A 404 is raised if the feedback record does not exist.
Response model
{
"status": "flagged",
"feedback_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6"
}
UI behaviour
Each row in the feedback list has an onFlagForReview callback. Clicking the button:
- Sets
flaggingIdto the feedback record's ID (disabling the button while in-flight) - Calls
feedbackAdminService.flagForReview(id) - Shows a
toast.success('Feedback flagged for review')on success - Shows a
toast.error('Failed to flag feedback')on failure - Clears
flaggingIdin thefinallyblock
The flag state is not reflected back in the list UI after the action — it is a fire-and-forget signal for content review workflows.
API Reference
GET /admin/feedback/telemetry-stats
| Method | GET |
| Path | /api/v1/admin/feedback/telemetry-stats |
| Auth | Admin role required |
| Query params | period_days (int, default 30, range 1–365) |
| Response | TelemetryStats |
| Source | backend/app/api/admin_feedback.py |
TelemetryStats schema
| Field | Type | Description |
|---|---|---|
period_days | int | The requested period |
total_queries | int | Total distinct conversations in period |
normal_queries | int | Conversations without Think Harder |
escalated_queries | int | Conversations with at least one Think Harder event |
p95_normal_ms | float | null | P95 total latency for normal queries (ms) |
p95_escalated_ms | float | null | P95 total latency for escalated queries (ms) |
p95_overall_ms | float | null | P95 total latency across all queries (ms) |
stage_avg_ms | object | Map of stage name to average duration (ms) |
POST /admin/feedback/{feedback_id}/flag
| Method | POST |
| Path | /api/v1/admin/feedback/{feedback_id}/flag |
| Auth | Admin role required |
| Path param | feedback_id (UUID) |
| Request body | None |
| Response | FlagResponse |
| Source | backend/app/api/admin_feedback.py |
FlagResponse schema
| Field | Type | Description |
|---|---|---|
status | string | Always "flagged" on success |
feedback_id | string | UUID of the updated record |
Operations Metrics (Costs Tab)
The Analytics page (frontend/src/pages/Analytics.tsx) carries three tabs — queries, activity, and costs (Owner role only). The costs tab adds two operational quality charts that surface system-quality drift independently of explicit user feedback. Both were added in 2026-05-09 alongside the Value Framework rollout.
Category Mismatch Trend
| Component | frontend/src/components/Analytics/CategoryMismatchTrend.tsx |
| Backend endpoint | GET /api/v1/admin/ops/category-mismatch (backend/app/api/admin_ops.py:342) |
| Data source | app.category_mismatch_telemetry (alembic migration 066) |
| What it plots | Per-day mean of mismatch_rate — the fraction of top-K chunks whose tagged content category was OFF the intent's preferred set after the Stage 5b affinity rerank. |
| Why it matters | A persistent rise in mismatch_rate signals retrieval-steering drift: either the affinity coefficients need retuning, or the corpus has accumulated content the intent classifier doesn't model. The chart is the operator's first line of detection for the wheelchair-vs-cardiology class of regression. |
Diagnostic Accuracy Trend
| Component | frontend/src/components/Analytics/DiagnosticAccuracyTrend.tsx |
| Backend endpoint | GET /api/v1/admin/ops/diagnostic-accuracy (backend/app/api/admin_ops.py:448) |
| Data source | app.diagnostic_feedback (alembic migration 067) |
| What it plots | Per-day operator agreement rate — the share of verdict='agree' rows out of all rated v2 diagnostic-investigation outputs in the period. |
| Why it matters | Closes the calibration loop on the diagnostic v2 investigation runner. The v1 surface had no operator feedback channel, so drift was invisible until the next golden-eval pass. The v2 surface emits one investigation row per voice turn and lets operators rate them as agree / partial / disagree; the chart aggregates those verdicts into a daily accuracy time series. |
Voice diagnostic feedback workflow
Why P95 (not mean)
The latency cards use the 95th percentile rather than the mean because mean latency hides tail behaviour: a single 30-second outlier in a thousand-query dataset moves the mean by 30 ms but moves the 95th percentile not at all. Operators care about the tail (the worst 1-in-20 user experience) far more than the mean. This convention follows the SRE practitioner literature; see Beyer et al. 2016 (chapter 4 "Service Level Objectives") for the canonical statement of why latency SLOs are written at the tail.
Related
- ADR-0008: User Feedback and Think Harder — rationale for the Think Harder escalation design
- Query Processing Pipeline — pipeline stages recorded in
app.pipeline_telemetry - Storage Architecture §Operational & Analytics Tables — table schemas for
category_mismatch_telemetryanddiagnostic_feedback