Feedback Dashboard Metrics

The admin dashboards aggregate user satisfaction and operational quality signals across two pages:

FeedbackDashboardPage — feedback metrics: P95 latency, Think Harder funnel, satisfaction trends, content flagging.
Analytics.tsx — three tabs (queries, activity, costs); the costs tab (Owner role) carries the operational metrics added with migrations 066 and 067.

Three features were added to the existing feedback dashboard: a P95 Latency Comparison card, a Think Harder Funnel visualisation, and a Flag for Review action on individual feedback records. Two operations metrics charts (Category Mismatch Trend, Diagnostic Accuracy Trend) live on the Analytics costs tab.

What the Dashboard Already Had

Before these additions the dashboard provided:

Satisfaction summary cards (positive %, negative %, category distribution)
ADR-0008 metric cards: Think Harder count, improvement acceptance rate, override count, nano/timeout triggers
Satisfaction trend line chart
Intent distribution chart
Feedback list with AI investigation and rating override actions
Add to golden evaluation set action

P95 Latency Comparison

Purpose

The P95 Latency Comparison card exposes end-to-end pipeline latency at the 95th percentile, split by whether the query was escalated through Think Harder. This makes the cost of escalation visible in production terms.

Backend — `GET /admin/feedback/telemetry-stats`

GET /api/v1/admin/feedback/telemetry-stats?period_days=30
Authorization: Bearer <admin token>

The endpoint is defined in backend/app/api/admin_feedback.py and requires require_admin. It accepts a single query parameter:

Parameter	Type	Default	Range	Description
`period_days`	int	30	1–365	Look-back window

SQL approach

The query uses a CTE to sum duration_ms across all pipeline stages per conversation_id, then joins to app.feedback_events to determine whether any think_harder event was recorded for that conversation. percentile_cont(0.95) with a FILTER clause computes separate P95 values for normal and escalated groups in a single pass:

WITH query_latency AS (
    SELECT
        pt.conversation_id,
        SUM(pt.duration_ms) AS total_ms,
        BOOL_OR(fe.feedback_type = 'think_harder') AS is_escalated
    FROM app.pipeline_telemetry pt
    LEFT JOIN app.feedback_events fe
        ON fe.conversation_id = pt.conversation_id
        AND fe.feedback_type = 'think_harder'
    WHERE pt.created_at >= :cutoff
    GROUP BY pt.conversation_id
)
SELECT
    COUNT(*)::int AS total_queries,
    COUNT(*) FILTER (WHERE NOT is_escalated)::int AS normal_queries,
    COUNT(*) FILTER (WHERE is_escalated)::int AS escalated_queries,
    percentile_cont(0.95) WITHIN GROUP (ORDER BY total_ms)
        FILTER (WHERE NOT is_escalated) AS p95_normal,
    percentile_cont(0.95) WITHIN GROUP (ORDER BY total_ms)
        FILTER (WHERE is_escalated) AS p95_escalated,
    percentile_cont(0.95) WITHIN GROUP (ORDER BY total_ms) AS p95_overall
FROM query_latency

A second query computes per-stage average latency, returned in stage_avg_ms.

Response model

{
  "period_days": 30,
  "total_queries": 1842,
  "normal_queries": 1756,
  "escalated_queries": 86,
  "p95_normal_ms": 6840.0,
  "p95_escalated_ms": 13250.0,
  "p95_overall_ms": 7100.0,
  "stage_avg_ms": {
    "generation": 4210.3,
    "reranking": 1840.1,
    "retrieval": 620.5,
    "intent": 210.0
  }
}

All latency fields are nullable — they return null when no data exists for that group within the period.

UI card layout

The card renders three stat tiles side by side: P95 Normal, P95 Escalated, and P95 Overall. When both normal and escalated values are available, the overall tile shows a relative difference label. The label turns amber when the escalated P95 exceeds 1.5× the normal P95, and green otherwise.

Data flow diagram

Think Harder Funnel

Purpose

The funnel visualizes the full conversion path from user dissatisfaction to accepted improvement, giving content managers a single-glance view of how effectively Think Harder recovers poor search experiences.

Data source

The funnel reads directly from the existing /admin/feedback/summary endpoint — no new backend endpoint was required. The three counts used are already part of FeedbackSummary:

Field	Meaning
`negative_count`	Total thumbs-down feedback events in the period
`think_harder_count`	Escalations triggered from negative feedback
`improvement_accepted_count`	Escalated responses accepted by users

Conversion rate calculation

The funnel renders two inline conversion rates:

Escalation rate: think_harder_count / negative_count × 100 (shown in blue under the Escalated column)
Acceptance rate: improvement_accepted_count / think_harder_count × 100 (shown in green under the Accepted column)

Both rates are only rendered when the denominator is non-zero.

Visual layout

[ Negative ]  →  [ Escalated ]  →  [ Accepted ]
    N               M  (X%)          K  (Y%)

The three columns are equal-width. Arrows between columns are decorative HTML entities. The percentage labels sit below the count in a smaller contrasting colour.

Flag for Review

Purpose

Admins can mark individual feedback records for follow-up by content teams without leaving the dashboard. The flag is stored as a JSONB metadata field on the feedback record so it survives schema changes and requires no migration.

Backend — `POST /admin/feedback/{id}/flag`

POST /api/v1/admin/feedback/&#123;feedback_id&#125;/flag
Authorization: Bearer <admin token>

The endpoint requires require_admin and accepts no request body. It performs an in-place JSONB merge on app.session_feedback:

UPDATE app.session_feedback
SET metadata = COALESCE(metadata, '{}'::jsonb) || CAST(:flag_data AS jsonb)
WHERE id = CAST(:feedback_id AS uuid)
RETURNING id::text

The flag_data value is {"flagged": true}. Using CAST(:param AS jsonb) (not ::jsonb) ensures compatibility with asyncpg parameter binding.

A 404 is raised if the feedback record does not exist.

Response model

{
  "status": "flagged",
  "feedback_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6"
}

UI behaviour

Each row in the feedback list has an onFlagForReview callback. Clicking the button:

Sets flaggingId to the feedback record's ID (disabling the button while in-flight)
Calls feedbackAdminService.flagForReview(id)
Shows a toast.success('Feedback flagged for review') on success
Shows a toast.error('Failed to flag feedback') on failure
Clears flaggingId in the finally block

The flag state is not reflected back in the list UI after the action — it is a fire-and-forget signal for content review workflows.

API Reference

GET /admin/feedback/telemetry-stats


Method	GET
Path	`/api/v1/admin/feedback/telemetry-stats`
Auth	Admin role required
Query params	`period_days` (int, default 30, range 1–365)
Response	`TelemetryStats`
Source	`backend/app/api/admin_feedback.py`

TelemetryStats schema

Field	Type	Description
`period_days`	int	The requested period
`total_queries`	int	Total distinct conversations in period
`normal_queries`	int	Conversations without Think Harder
`escalated_queries`	int	Conversations with at least one Think Harder event
`p95_normal_ms`	float \| null	P95 total latency for normal queries (ms)
`p95_escalated_ms`	float \| null	P95 total latency for escalated queries (ms)
`p95_overall_ms`	float \| null	P95 total latency across all queries (ms)
`stage_avg_ms`	object	Map of stage name to average duration (ms)

POST /admin/feedback/{feedback_id}/flag


Method	POST
Path	`/api/v1/admin/feedback/{feedback_id}/flag`
Auth	Admin role required
Path param	`feedback_id` (UUID)
Request body	None
Response	`FlagResponse`
Source	`backend/app/api/admin_feedback.py`

FlagResponse schema

Field	Type	Description
`status`	string	Always `"flagged"` on success
`feedback_id`	string	UUID of the updated record

Operations Metrics (Costs Tab)

The Analytics page (frontend/src/pages/Analytics.tsx) carries three tabs — queries, activity, and costs (Owner role only). The costs tab adds two operational quality charts that surface system-quality drift independently of explicit user feedback. Both were added in 2026-05-09 alongside the Value Framework rollout.

Category Mismatch Trend


Component	`frontend/src/components/Analytics/CategoryMismatchTrend.tsx`
Backend endpoint	`GET /api/v1/admin/ops/category-mismatch` (`backend/app/api/admin_ops.py:346`)
Data source	`app.category_mismatch_telemetry` (alembic migration 066)
What it plots	Per-day mean of `mismatch_rate` — the fraction of top-K chunks whose tagged content category was OFF the intent's preferred set after the Stage 5b affinity rerank.
Why it matters	A persistent rise in mismatch_rate signals retrieval-steering drift: either the affinity coefficients need retuning, or the corpus has accumulated content the intent classifier doesn't model. The chart is the operator's first line of detection for the wheelchair-vs-cardiology class of regression.

Diagnostic Accuracy Trend


Component	`frontend/src/components/Analytics/DiagnosticAccuracyTrend.tsx`
Backend endpoint	`GET /api/v1/admin/ops/diagnostic-accuracy` (`backend/app/api/admin_ops.py:456`)
Data source	`app.diagnostic_feedback` (alembic migration 067)
What it plots	Per-day operator agreement rate — the share of `verdict='agree'` rows out of all rated v2 diagnostic-investigation outputs in the period.
Why it matters	Closes the calibration loop on the diagnostic v2 investigation runner. The v1 surface had no operator feedback channel, so drift was invisible until the next golden-eval pass. The v2 surface emits one investigation row per voice turn and lets operators rate them as `agree` / `partial` / `disagree`; the chart aggregates those verdicts into a daily accuracy time series.

Voice diagnostic feedback workflow

Why P95 (not mean)

The latency cards use the 95th percentile rather than the mean because mean latency hides tail behaviour: a single 30-second outlier in a thousand-query dataset moves the mean by 30 ms but moves the 95th percentile not at all. Operators care about the tail (the worst 1-in-20 user experience) far more than the mean. This convention follows the SRE practitioner literature; see Beyer et al. 2016 (chapter 4 "Service Level Objectives") for the canonical statement of why latency SLOs are written at the tail.

ADR-0008: User Feedback and Think Harder — rationale for the Think Harder escalation design
Query Processing Pipeline — pipeline stages recorded in app.pipeline_telemetry
Storage Architecture §Operational & Analytics Tables — table schemas for category_mismatch_telemetry and diagnostic_feedback

What the Dashboard Already Had​

P95 Latency Comparison​

Purpose​

Backend — GET /admin/feedback/telemetry-stats​

SQL approach​

Response model​

UI card layout​

Data flow diagram​

Think Harder Funnel​

Purpose​

Data source​

Conversion rate calculation​

Visual layout​

Flag for Review​

Purpose​

Backend — POST /admin/feedback/{id}/flag​

Response model​

UI behaviour​

API Reference​

GET /admin/feedback/telemetry-stats​

POST /admin/feedback/{feedback_id}/flag​

Operations Metrics (Costs Tab)​

Category Mismatch Trend​

Diagnostic Accuracy Trend​

Voice diagnostic feedback workflow​

Why P95 (not mean)​

Related​

What the Dashboard Already Had

P95 Latency Comparison

Purpose

Backend — `GET /admin/feedback/telemetry-stats`

SQL approach

Response model

UI card layout

Data flow diagram

Think Harder Funnel

Purpose

Data source

Conversion rate calculation

Visual layout

Flag for Review

Purpose

Backend — `POST /admin/feedback/{id}/flag`

Response model

UI behaviour

API Reference

GET /admin/feedback/telemetry-stats

POST /admin/feedback/{feedback_id}/flag

Operations Metrics (Costs Tab)

Category Mismatch Trend

Diagnostic Accuracy Trend

Voice diagnostic feedback workflow

Why P95 (not mean)

Related