Skip to main content

Annotator Tool & Evaluation Dashboard Roadmap

1. Motivation

The current golden evaluation set (v4.0, 261 questions) was constructed programmatically by a combination of automated generation and developer review. While this approach achieves high coverage (20 categories, 261 questions), it lacks the human validation loop that distinguishes a research-grade evaluation benchmark from a developer test suite.

Academic evaluation methodology (Voorhees, 2002; Sanderson, 2010) emphasises the importance of inter-annotator agreement and expert validation to ensure question quality, ground truth correctness, and scoring reliability. The current workflow — developer creates question, developer validates answer — introduces systematic bias because the same person who designed the system judges whether it works correctly.

An annotator tool with an evaluation dashboard would enable:

  1. Domain expert validation: Hospital staff, medical librarians, and content managers validate that ground truth answers are clinically and informationally correct
  2. Inter-annotator agreement: Multiple annotators independently score the same questions to measure consistency (Cohen's kappa, Fleiss' kappa)
  3. Continuous improvement: Annotators can create new questions based on real user search patterns observed in analytics
  4. Regression monitoring: A dashboard tracking evaluation results over time allows immediate detection of quality degradation after code changes

2. Current State

AspectCurrentTarget
Question creationDeveloper-authoredExpert-validated via annotator tool
Ground truthDeveloper-verifiedMulti-annotator validated
Evaluation triggerManual CLI (run_evaluation)On-demand from dashboard
Result trackingJSON files in gitTime-series database with visualisation
Question versioningGit commitsStructured version control with diff tracking
Quality assuranceSelf-reviewInter-annotator agreement metrics

3. Annotator Tool Requirements

3.1 Core Functionality

The annotator tool operates independently from the main software system — it is a standalone application with its own deployment, authentication, and data management. This decoupling ensures that evaluation infrastructure does not depend on the system under test.

Question management:

  • Create new golden evaluation questions from scratch with guided templates
  • Modify existing golden questions (edit question text, ground truth, expected entities, categories)
  • Validation workflows to ensure question quality before inclusion in the benchmark
  • Version control for tracking question iterations with full audit trail
  • Permission management for annotator access levels (viewer, annotator, reviewer, admin)

Output format:

  • File-based system compatible with existing golden_questions.json schema
  • Standardised schema for evaluation questions with metadata
  • Multiple export formats (JSON, CSV, XLSX) for compatibility with external tools
  • Metadata inclusion for traceability (annotator ID, timestamp, validation status, confidence score)

3.2 Annotation Workflow

3.3 Question Quality Metrics

Each golden question should be scored on:

DimensionDescriptionMeasurement
ClarityIs the question unambiguous to a native Dutch speaker?Annotator rating (1-5)
RepresentativenessDoes this reflect a real user query pattern?Analytics correlation
Ground truth accuracyIs the expected answer clinically/informationally correct?Domain expert validation
Discriminative powerDoes this question differentiate good systems from bad?Pass/fail variance across system versions
Entity precisionAre the expected entities correct and complete?Inter-annotator agreement

4. Evaluation Dashboard Specifications

4.1 Core Visualisations

Performance overview:

  • Pass rate trend over time (line chart with confidence bands)
  • Entity recall distribution (histogram per category)
  • Category-level heatmap showing pass/fail rates across evaluation runs
  • Response time distribution and P95/P99 latency tracking

Drill-down capabilities:

  • Click any category to see individual question results
  • Compare two evaluation runs side-by-side (diff view)
  • Filter by date range, category, difficulty, annotator, question version
  • Export evaluation reports for stakeholder communication

Real-time assessment:

  • Near real-time updates as individual evaluations complete
  • Live status monitoring for running evaluation batches
  • Automated refresh with WebSocket push notifications

4.2 Evaluation Lifecycle

4.3 User Roles

RoleCapabilities
ViewerView dashboard, export reports
AnnotatorCreate/edit questions, trigger evaluations
ReviewerApprove/reject questions, adjudicate disagreements
AdminManage users, configure evaluation parameters, delete questions

5. Multi-Tenant Considerations

The annotator tool and evaluation dashboard must support the multi-tenant architecture planned for the RAG system (see Multi-Tenancy Roadmap):

  • Tenant isolation: Each hospital customer has a separate golden question set reflecting their specific departments, doctors, and services
  • Shared question templates: Common question patterns (safety, adversarial, multilingual) can be shared across tenants as templates
  • Per-tenant evaluation: Evaluations run against the tenant's specific RAG instance and knowledge graph
  • Cross-tenant benchmarking: Aggregate metrics (pass rate, entity recall) can be compared across tenants without exposing question content

6. Integration Architecture

Key design decisions:

  • Annotator tool is file-based and operates independently from the RAG system
  • Evaluation engine is the existing run_evaluation.py enhanced with API triggers
  • Dashboard uses a time-series database for historical tracking
  • All components communicate via standardised APIs and file formats

7. Implementation Priority

PhaseScopeEffortValue
Phase 1: Enhanced CLIAdd --compare flag to run_evaluation.py for run-to-run diffS (2-4h)Foundation for tracking
Phase 2: Result persistenceStore evaluation results in PostgreSQL with timestampsM (8-12h)Historical tracking
Phase 3: Basic dashboardRead-only dashboard showing pass rate trends and category breakdownL (16-24h)Stakeholder visibility
Phase 4: Annotator tool MVPStandalone question editor with validation and exportL (24-40h)Expert validation
Phase 5: Full dashboardReal-time evaluation, comparison views, role-based accessXL (40-60h)Production-grade
Phase 6: Multi-tenantPer-tenant question sets, cross-tenant benchmarkingXL (40-60h)SaaS readiness

8. References

  • Voorhees, E. M. (2002). The philosophy of information retrieval evaluation. In Evaluation of Cross-Language Information Retrieval Systems, pp. 355--370.
  • Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval, 4(4), 247--375.
  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37--46.
  • Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378--382.