Annotator Tool & Evaluation Dashboard Roadmap
1. Motivation
The current golden evaluation set (v4.0, 261 questions) was constructed programmatically by a combination of automated generation and developer review. While this approach achieves high coverage (20 categories, 261 questions), it lacks the human validation loop that distinguishes a research-grade evaluation benchmark from a developer test suite.
Academic evaluation methodology (Voorhees, 2002; Sanderson, 2010) emphasises the importance of inter-annotator agreement and expert validation to ensure question quality, ground truth correctness, and scoring reliability. The current workflow — developer creates question, developer validates answer — introduces systematic bias because the same person who designed the system judges whether it works correctly.
An annotator tool with an evaluation dashboard would enable:
- Domain expert validation: Hospital staff, medical librarians, and content managers validate that ground truth answers are clinically and informationally correct
- Inter-annotator agreement: Multiple annotators independently score the same questions to measure consistency (Cohen's kappa, Fleiss' kappa)
- Continuous improvement: Annotators can create new questions based on real user search patterns observed in analytics
- Regression monitoring: A dashboard tracking evaluation results over time allows immediate detection of quality degradation after code changes
2. Current State
| Aspect | Current | Target |
|---|---|---|
| Question creation | Developer-authored | Expert-validated via annotator tool |
| Ground truth | Developer-verified | Multi-annotator validated |
| Evaluation trigger | Manual CLI (run_evaluation) | On-demand from dashboard |
| Result tracking | JSON files in git | Time-series database with visualisation |
| Question versioning | Git commits | Structured version control with diff tracking |
| Quality assurance | Self-review | Inter-annotator agreement metrics |
3. Annotator Tool Requirements
3.1 Core Functionality
The annotator tool operates independently from the main software system — it is a standalone application with its own deployment, authentication, and data management. This decoupling ensures that evaluation infrastructure does not depend on the system under test.
Question management:
- Create new golden evaluation questions from scratch with guided templates
- Modify existing golden questions (edit question text, ground truth, expected entities, categories)
- Validation workflows to ensure question quality before inclusion in the benchmark
- Version control for tracking question iterations with full audit trail
- Permission management for annotator access levels (viewer, annotator, reviewer, admin)
Output format:
- File-based system compatible with existing
golden_questions.jsonschema - Standardised schema for evaluation questions with metadata
- Multiple export formats (JSON, CSV, XLSX) for compatibility with external tools
- Metadata inclusion for traceability (annotator ID, timestamp, validation status, confidence score)
3.2 Annotation Workflow
3.3 Question Quality Metrics
Each golden question should be scored on:
| Dimension | Description | Measurement |
|---|---|---|
| Clarity | Is the question unambiguous to a native Dutch speaker? | Annotator rating (1-5) |
| Representativeness | Does this reflect a real user query pattern? | Analytics correlation |
| Ground truth accuracy | Is the expected answer clinically/informationally correct? | Domain expert validation |
| Discriminative power | Does this question differentiate good systems from bad? | Pass/fail variance across system versions |
| Entity precision | Are the expected entities correct and complete? | Inter-annotator agreement |
4. Evaluation Dashboard Specifications
4.1 Core Visualisations
Performance overview:
- Pass rate trend over time (line chart with confidence bands)
- Entity recall distribution (histogram per category)
- Category-level heatmap showing pass/fail rates across evaluation runs
- Response time distribution and P95/P99 latency tracking
Drill-down capabilities:
- Click any category to see individual question results
- Compare two evaluation runs side-by-side (diff view)
- Filter by date range, category, difficulty, annotator, question version
- Export evaluation reports for stakeholder communication
Real-time assessment:
- Near real-time updates as individual evaluations complete
- Live status monitoring for running evaluation batches
- Automated refresh with WebSocket push notifications
4.2 Evaluation Lifecycle
4.3 User Roles
| Role | Capabilities |
|---|---|
| Viewer | View dashboard, export reports |
| Annotator | Create/edit questions, trigger evaluations |
| Reviewer | Approve/reject questions, adjudicate disagreements |
| Admin | Manage users, configure evaluation parameters, delete questions |
5. Multi-Tenant Considerations
The annotator tool and evaluation dashboard must support the multi-tenant architecture planned for the RAG system (see Multi-Tenancy Roadmap):
- Tenant isolation: Each hospital customer has a separate golden question set reflecting their specific departments, doctors, and services
- Shared question templates: Common question patterns (safety, adversarial, multilingual) can be shared across tenants as templates
- Per-tenant evaluation: Evaluations run against the tenant's specific RAG instance and knowledge graph
- Cross-tenant benchmarking: Aggregate metrics (pass rate, entity recall) can be compared across tenants without exposing question content
6. Integration Architecture
Key design decisions:
- Annotator tool is file-based and operates independently from the RAG system
- Evaluation engine is the existing
run_evaluation.pyenhanced with API triggers - Dashboard uses a time-series database for historical tracking
- All components communicate via standardised APIs and file formats
7. Implementation Priority
| Phase | Scope | Effort | Value |
|---|---|---|---|
| Phase 1: Enhanced CLI | Add --compare flag to run_evaluation.py for run-to-run diff | S (2-4h) | Foundation for tracking |
| Phase 2: Result persistence | Store evaluation results in PostgreSQL with timestamps | M (8-12h) | Historical tracking |
| Phase 3: Basic dashboard | Read-only dashboard showing pass rate trends and category breakdown | L (16-24h) | Stakeholder visibility |
| Phase 4: Annotator tool MVP | Standalone question editor with validation and export | L (24-40h) | Expert validation |
| Phase 5: Full dashboard | Real-time evaluation, comparison views, role-based access | XL (40-60h) | Production-grade |
| Phase 6: Multi-tenant | Per-tenant question sets, cross-tenant benchmarking | XL (40-60h) | SaaS readiness |
8. References
- Voorhees, E. M. (2002). The philosophy of information retrieval evaluation. In Evaluation of Cross-Language Information Retrieval Systems, pp. 355--370.
- Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval, 4(4), 247--375.
- Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37--46.
- Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378--382.