Annotator Tool & Evaluation Dashboard Roadmap

1. Motivation

The current golden evaluation set (v4.0, 261 questions) was constructed programmatically by a combination of automated generation and developer review. While this approach achieves high coverage (20 categories, 261 questions), it lacks the human validation loop that distinguishes a research-grade evaluation benchmark from a developer test suite.

Academic evaluation methodology (Voorhees, 2002; Sanderson, 2010) emphasises the importance of inter-annotator agreement and expert validation to ensure question quality, ground truth correctness, and scoring reliability. The current workflow — developer creates question, developer validates answer — introduces systematic bias because the same person who designed the system judges whether it works correctly.

An annotator tool with an evaluation dashboard would enable:

Domain expert validation: Hospital staff, medical librarians, and content managers validate that ground truth answers are clinically and informationally correct
Inter-annotator agreement: Multiple annotators independently score the same questions to measure consistency (Cohen's kappa, Fleiss' kappa)
Continuous improvement: Annotators can create new questions based on real user search patterns observed in analytics
Regression monitoring: A dashboard tracking evaluation results over time allows immediate detection of quality degradation after code changes

2. Current State

Aspect	Current	Target
Question creation	Developer-authored	Expert-validated via annotator tool
Ground truth	Developer-verified	Multi-annotator validated
Evaluation trigger	Manual CLI (`run_evaluation`)	On-demand from dashboard
Result tracking	JSON files in git	Time-series database with visualisation
Question versioning	Git commits	Structured version control with diff tracking
Quality assurance	Self-review	Inter-annotator agreement metrics

3. Annotator Tool Requirements

3.1 Core Functionality

The annotator tool operates independently from the main software system — it is a standalone application with its own deployment, authentication, and data management. This decoupling ensures that evaluation infrastructure does not depend on the system under test.

Question management:

Create new golden evaluation questions from scratch with guided templates
Modify existing golden questions (edit question text, ground truth, expected entities, categories)
Validation workflows to ensure question quality before inclusion in the benchmark
Version control for tracking question iterations with full audit trail
Permission management for annotator access levels (viewer, annotator, reviewer, admin)

Output format:

File-based system compatible with existing golden_questions.json schema
Standardised schema for evaluation questions with metadata
Multiple export formats (JSON, CSV, XLSX) for compatibility with external tools
Metadata inclusion for traceability (annotator ID, timestamp, validation status, confidence score)

3.2 Annotation Workflow

3.3 Question Quality Metrics

Each golden question should be scored on:

Dimension	Description	Measurement
Clarity	Is the question unambiguous to a native Dutch speaker?	Annotator rating (1-5)
Representativeness	Does this reflect a real user query pattern?	Analytics correlation
Ground truth accuracy	Is the expected answer clinically/informationally correct?	Domain expert validation
Discriminative power	Does this question differentiate good systems from bad?	Pass/fail variance across system versions
Entity precision	Are the expected entities correct and complete?	Inter-annotator agreement

4. Evaluation Dashboard Specifications

4.1 Core Visualisations

Performance overview:

Pass rate trend over time (line chart with confidence bands)
Entity recall distribution (histogram per category)
Category-level heatmap showing pass/fail rates across evaluation runs
Response time distribution and P95/P99 latency tracking

Drill-down capabilities:

Click any category to see individual question results
Compare two evaluation runs side-by-side (diff view)
Filter by date range, category, difficulty, annotator, question version
Export evaluation reports for stakeholder communication

Real-time assessment:

Near real-time updates as individual evaluations complete
Live status monitoring for running evaluation batches
Automated refresh with WebSocket push notifications

4.2 Evaluation Lifecycle

4.3 User Roles

Role	Capabilities
Viewer	View dashboard, export reports
Annotator	Create/edit questions, trigger evaluations
Reviewer	Approve/reject questions, adjudicate disagreements
Admin	Manage users, configure evaluation parameters, delete questions

5. Multi-Tenant Considerations

The annotator tool and evaluation dashboard must support the multi-tenant architecture planned for the RAG system (see Multi-Tenancy Roadmap):

Tenant isolation: Each hospital customer has a separate golden question set reflecting their specific departments, doctors, and services
Shared question templates: Common question patterns (safety, adversarial, multilingual) can be shared across tenants as templates
Per-tenant evaluation: Evaluations run against the tenant's specific RAG instance and knowledge graph
Cross-tenant benchmarking: Aggregate metrics (pass rate, entity recall) can be compared across tenants without exposing question content

6. Integration Architecture

Key design decisions:

Annotator tool is file-based and operates independently from the RAG system
Evaluation engine is the existing run_evaluation.py enhanced with API triggers
Dashboard uses a time-series database for historical tracking
All components communicate via standardised APIs and file formats

7. Implementation Priority

Phase	Scope	Effort	Value
Phase 1: Enhanced CLI	Add `--compare` flag to `run_evaluation.py` for run-to-run diff	S (2-4h)	Foundation for tracking
Phase 2: Result persistence	Store evaluation results in PostgreSQL with timestamps	M (8-12h)	Historical tracking
Phase 3: Basic dashboard	Read-only dashboard showing pass rate trends and category breakdown	L (16-24h)	Stakeholder visibility
Phase 4: Annotator tool MVP	Standalone question editor with validation and export	L (24-40h)	Expert validation
Phase 5: Full dashboard	Real-time evaluation, comparison views, role-based access	XL (40-60h)	Production-grade
Phase 6: Multi-tenant	Per-tenant question sets, cross-tenant benchmarking	XL (40-60h)	SaaS readiness

8. References

Voorhees, E. M. (2002). The philosophy of information retrieval evaluation. In Evaluation of Cross-Language Information Retrieval Systems, pp. 355--370.
Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval, 4(4), 247--375.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37--46.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378--382.

1. Motivation​

2. Current State​

3. Annotator Tool Requirements​

3.1 Core Functionality​

3.2 Annotation Workflow​

3.3 Question Quality Metrics​

4. Evaluation Dashboard Specifications​

4.1 Core Visualisations​

4.2 Evaluation Lifecycle​

4.3 User Roles​

5. Multi-Tenant Considerations​

6. Integration Architecture​

7. Implementation Priority​

8. References​