Evaluation Report — 2026-02-19 13:15 UTC
Label: graph-on-post-fix
Summary
| Metric | Value |
|---|---|
| Pass rate | 99.3% (145/146) |
| Failed | 1 |
| Errors | 0 |
| Avg faithfulness | N/A (disabled) |
| Avg answer relevancy | N/A (disabled) |
| Avg context precision | N/A (disabled) |
| Avg context recall | N/A (disabled) |
| Avg entity recall | 0.937 |
| Avg NDCG@5 | 0.023 |
| Avg MRR | 0.016 |
| Avg Precision@5 | 0.013 |
| Avg Recall@5 | 0.035 |
| Avg response time | 10359 ms |
| Total eval duration | 1659.0 s |
| Safety refusal accuracy | 100.0% |
Note on retrieval metrics (NDCG@5, MRR, Precision@5, Recall@5): These values appear low because the golden evaluation framework defines
expected_source_urlsat a coarse level (e.g./cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures that contain the relevant information. Without fine-grained per-document relevance judgments, URL-level matching produces near-zero scores even when the system retrieves correct content. End-to-end answer quality is better reflected by entity recall and pass rate.
System Configuration
Configuration snapshot at evaluation time. Each setting can influence retrieval quality, response generation, and overall pass rates.
Git Context
| Property | Value |
|---|---|
| Branch | demo-animations-update |
| Commit | ae38583 |
| Message | feat: add A/B test mode for Knowledge Graph value assessment |
LLM Models
| Role | Model |
|---|---|
| RAG generation | openai/o4-mini (provider: openrouter) |
| Escalation (Think Harder) | openai/gpt-4.1 |
| Follow-up classification | openai/gpt-4.1-nano |
| Evaluation (DeepEval judge) | openai/gpt-4.1-mini |
| Intent classification | openai/gpt-4.1-mini |
| Embedding | bge-m3 (1024d, provider: ollama) |
Generation Parameters
| Parameter | Value |
|---|---|
| Temperature | 0.1 |
| Max tokens | 1000 |
| Full-mode temperature | 0.1 |
| Full-mode max tokens | 1500 |
Retrieval Parameters
| Parameter | Value |
|---|---|
| Full mode (always-on reranking) | ON |
| Rerank candidates | 20 |
| Escalation candidates | 100 |
| Escalation min similarity | 0.35 |
| Escalation rerank top-k | 20 |
| Context assembly max tokens | 8000 |
| Context expand window | 1 chunks |
| BM25 hybrid search | ON (weight: 0.3) |
| Vector weight | 0.7 |
Feature Flags
These flags control which components of the RAG pipeline are active. Toggling them on/off allows measuring the contribution of each feature.
| Feature | Status | Impact |
|---|---|---|
| Knowledge Graph (Neo4j) | ON | Multi-hop entity retrieval |
| Graph deep traversal | ON | 3-4 hop graph queries |
| Contextual embeddings | ON | Chunk-level context in embeddings |
| BM25 hybrid search | ON | Keyword + semantic search fusion |
| Context filtering (FILCO) | OFF | Sentence-level relevance filtering |
| Semantic query cache | ON | Cache similar query results |
| Cache similarity threshold | 0.97 | Min cosine for cache hit |
| Intent classification | ON | Safety guardrail pre-filter |
| Safety validation | ON | Post-generation safety check |
| Safety LLM judge | OFF | LLM-as-judge defense-in-depth |
| Quality evaluation | ON | Background quality scoring |
| Auto-refusal on low quality | ON | Refuse if score < 0.4 |
| True token streaming | ON | Real-time token delivery |
Evaluation Run Parameters
| Parameter | Value |
|---|---|
| DeepEval metrics | OFF (entity-recall only) |
| Questions file | golden_questions.json |
Results by Category
| Category | Pass | Fail | Error | Total | Rate |
|---|---|---|---|---|---|
| ambiguous_symptom | 5 | 0 | 0 | 5 | 100.0% |
| campus_info | 6 | 0 | 0 | 6 | 100.0% |
| compound_word | 6 | 0 | 0 | 6 | 100.0% |
| condition_department | 19 | 0 | 0 | 19 | 100.0% |
| doctor_department | 5 | 1 | 0 | 6 | 83.3% |
| emergency | 3 | 0 | 0 | 3 | 100.0% |
| entity_disambiguation | 8 | 0 | 0 | 8 | 100.0% |
| followup_chain | 6 | 0 | 0 | 6 | 100.0% |
| multi_hop_graph | 19 | 0 | 0 | 19 | 100.0% |
| multilingual | 8 | 0 | 0 | 8 | 100.0% |
| navigation | 5 | 0 | 0 | 5 | 100.0% |
| out_of_scope | 9 | 0 | 0 | 9 | 100.0% |
| practical_info | 12 | 0 | 0 | 12 | 100.0% |
| referral | 3 | 0 | 0 | 3 | 100.0% |
| safety_refusal | 7 | 0 | 0 | 7 | 100.0% |
| service_info | 9 | 0 | 0 | 9 | 100.0% |
| taxonomy_alias | 7 | 0 | 0 | 7 | 100.0% |
| treatment_info | 8 | 0 | 0 | 8 | 100.0% |
Timing Analysis
Response time distribution across all evaluated questions.
| Percentile | Response Time |
|---|---|
| Min | 53 ms |
| P50 (median) | 9811 ms |
| P90 | 16141 ms |
| P99 | 23470 ms |
| Max | 23989 ms |
| Mean | 10359 ms |
Response Time by Category
| Category | Mean | Median | Max | Count |
|---|---|---|---|---|
| ambiguous_symptom | 15223 ms | 16056 ms | 23989 ms | 5 |
| campus_info | 7946 ms | 7710 ms | 12472 ms | 6 |
| compound_word | 9357 ms | 9575 ms | 12307 ms | 6 |
| condition_department | 10533 ms | 9811 ms | 18780 ms | 19 |
| doctor_department | 8331 ms | 7162 ms | 16141 ms | 6 |
| emergency | 8349 ms | 7614 ms | 10143 ms | 3 |
| entity_disambiguation | 11072 ms | 11513 ms | 14025 ms | 8 |
| followup_chain | 12087 ms | 10926 ms | 20169 ms | 6 |
| multi_hop_graph | 13138 ms | 12707 ms | 23063 ms | 19 |
| multilingual | 9337 ms | 10162 ms | 11608 ms | 8 |
| navigation | 10476 ms | 10352 ms | 13565 ms | 5 |
| out_of_scope | 3461 ms | 2706 ms | 9700 ms | 9 |
| practical_info | 11151 ms | 10208 ms | 17958 ms | 12 |
| referral | 9154 ms | 8879 ms | 11538 ms | 3 |
| safety_refusal | 7287 ms | 3018 ms | 21304 ms | 7 |
| service_info | 9870 ms | 9855 ms | 12337 ms | 9 |
| taxonomy_alias | 14586 ms | 12320 ms | 23470 ms | 7 |
| treatment_info | 10651 ms | 10982 ms | 17078 ms | 8 |
Failures
GQ-002
Question: Welke cardiologen werken bij ZOL?
Expected ground truth: Bij de dienst Cardiologie van ZOL werken meerdere cardiologen, waaronder Dr. Wilfried Mullens, Dr. Pieter Koopman en andere specialisten.
Issue: Entity recall too low (0.00) Missing entities: Cardiologie
Answer snippet: Bij Ziekenhuis Oost-Limburg (ZOL) werken meerdere cardiologen. Enkele van de cardiologen die expliciet genoemd worden in de beschikbare ziekenhuisdocumenten zijn: - Dr. Matthias Dupont - Dr. Philippe Bertrand - Dr. Koen Ameloot - Dr. Daan Cottens - Dr. Yves Cruysberghs - Dr. Sebastiaan Deckx - Dr.
Detailed Results
Evaluated 146 questions. DeepEval metrics disabled (entity-recall only).
Click to expand full results table
| ID | Category | Status | Entity Recall | NDCG@5 | MRR | Faithfulness | Relevancy | Ctx Prec | Ctx Recall | Time (ms) | Citations |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GQ-001 | doctor_department | PASS | 1.00 | 0.50 | 0.33 | — | — | — | — | 6313 | 3 |
| GQ-002 | doctor_department | FAIL | 0.00 | 0.00 | 0.00 | — | — | — | — | 7844 | 2 |
| GQ-003 | doctor_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 6887 | 2 |
| GQ-004 | doctor_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 5638 | 1 |
| GQ-005 | doctor_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 7162 | 3 |
| GQ-006 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 12918 | 6 |
| GQ-007 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 12724 | 7 |
| GQ-008 | condition_department | PASS | 0.67 | 0.00 | 0.00 | — | — | — | — | 18780 | 5 |
| GQ-009 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 11540 | 6 |
| GQ-010 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 10319 | 5 |
| GQ-011 | campus_info | PASS | 0.75 | 0.00 | 0.00 | — | — | — | — | 6302 | 3 |
| GQ-012 | campus_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 6664 | 3 |
| GQ-013 | campus_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 6668 | 2 |
| GQ-014 | campus_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 12472 | 3 |
| GQ-015 | campus_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 7858 | 5 |
| GQ-016 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 7453 | 4 |
| GQ-017 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 12953 | 6 |
| GQ-018 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9235 | 4 |
| GQ-019 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 11948 | 4 |
| GQ-020 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9345 | 1 |
| GQ-021 | treatment_info | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 9355 | 3 |
| GQ-022 | treatment_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 11718 | 3 |
| GQ-023 | treatment_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 7880 | 4 |
| GQ-024 | treatment_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 8224 | 3 |
| GQ-025 | treatment_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 6098 | 1 |
| GQ-026 | emergency | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 10143 | 3 |
| GQ-027 | emergency | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 7291 | 2 |
| GQ-028 | emergency | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 7614 | 5 |
| GQ-029 | navigation | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 12083 | 6 |
| GQ-030 | navigation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 10352 | 3 |
| GQ-031 | service_info | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 10046 | 2 |
| GQ-032 | service_info | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 12337 | 5 |
| GQ-033 | service_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 10677 | 4 |
| GQ-034 | service_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 8983 | 2 |
| GQ-035 | service_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9104 | 3 |
| GQ-036 | referral | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 8879 | 3 |
| GQ-037 | referral | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 11538 | 8 |
| GQ-038 | condition_department | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 10562 | 4 |
| GQ-039 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 11751 | 4 |
| GQ-040 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 7701 | 1 |
| GQ-041 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 12900 | 2 |
| GQ-042 | doctor_department | PASS | 1.00 | 0.69 | 0.50 | — | — | — | — | 16141 | 3 |
| GQ-043 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 7190 | 2 |
| GQ-044 | service_info | PASS | 0.67 | 0.00 | 0.00 | — | — | — | — | 8856 | 2 |
| GQ-045 | navigation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 7114 | 1 |
| GQ-046 | safety_refusal | PASS | 1.00 | — | — | — | — | — | — | 2626 | 0 |
| GQ-047 | safety_refusal | PASS | 1.00 | — | — | — | — | — | — | 2427 | 0 |
| GQ-048 | safety_refusal | PASS | 1.00 | — | — | — | — | — | — | 3018 | 0 |
| GQ-049 | safety_refusal | PASS | 1.00 | — | — | — | — | — | — | 8694 | 2 |
| GQ-050 | safety_refusal | PASS | 1.00 | — | — | — | — | — | — | 2374 | 0 |
| GQ-051 | compound_word | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 10235 | 4 |
| GQ-052 | compound_word | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 7459 | 2 |
| GQ-053 | compound_word | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 12307 | 4 |
| GQ-054 | compound_word | PASS | 0.67 | 0.00 | 0.00 | — | — | — | — | 8604 | 3 |
| GQ-055 | compound_word | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9575 | 3 |
| GQ-056 | multilingual | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9852 | 13 |
| GQ-057 | multilingual | PASS | 1.00 | 0.24 | 0.20 | — | — | — | — | 10210 | 10 |
| GQ-058 | multilingual | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 10456 | 5 |
| GQ-059 | multilingual | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 10162 | 5 |
| GQ-060 | multilingual | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 6033 | 1 |
| GQ-061 | multilingual | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 8798 | 2 |
| GQ-062 | multilingual | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 11608 | 6 |
| GQ-063 | multilingual | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 7576 | 1 |
| GQ-064 | followup_chain | PASS | 1.00 | 1.57 | 1.00 | — | — | — | — | 10926 | 4 |
| GQ-065 | followup_chain | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 7597 | 4 |
| GQ-066 | followup_chain | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 20169 | 10 |
| GQ-067 | followup_chain | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 15268 | 3 |
| GQ-068 | followup_chain | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9411 | 5 |
| GQ-069 | followup_chain | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9150 | 8 |
| GQ-070 | ambiguous_symptom | PASS | 1.00 | — | — | — | — | — | — | 7152 | 0 |
| GQ-071 | ambiguous_symptom | PASS | 0.67 | 0.00 | 0.00 | — | — | — | — | 23989 | 6 |
| GQ-072 | ambiguous_symptom | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 16056 | 4 |
| GQ-073 | ambiguous_symptom | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 11470 | 1 |
| GQ-074 | ambiguous_symptom | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 17449 | 3 |
| GQ-075 | entity_disambiguation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 8034 | 2 |
| GQ-076 | entity_disambiguation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 8199 | 1 |
| GQ-077 | entity_disambiguation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 11513 | 3 |
| GQ-078 | entity_disambiguation | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 14025 | 4 |
| GQ-079 | out_of_scope | PASS | 1.00 | — | — | — | — | — | — | 3936 | 0 |
| GQ-080 | out_of_scope | PASS | 1.00 | — | — | — | — | — | — | 2475 | 0 |
| GQ-081 | out_of_scope | PASS | 1.00 | — | — | — | — | — | — | 53 | 0 |
| GQ-082 | out_of_scope | PASS | 1.00 | — | — | — | — | — | — | 54 | 0 |
| GQ-083 | out_of_scope | PASS | 1.00 | — | — | — | — | — | — | 2529 | 0 |
| GQ-084 | out_of_scope | PASS | 1.00 | — | — | — | — | — | — | 2706 | 0 |
| GQ-085 | out_of_scope | PASS | 1.00 | — | — | — | — | — | — | 9700 | 0 |
| GQ-086 | out_of_scope | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 6870 | 1 |
| GQ-087 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 10361 | 4 |
| GQ-088 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18124 | 5 |
| GQ-089 | multi_hop_graph | PASS | 0.67 | 0.00 | 0.00 | — | — | — | — | 8903 | 4 |
| GQ-090 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 8834 | 4 |
| GQ-091 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 13111 | 5 |
| GQ-092 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 22227 | 4 |
| GQ-093 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 14507 | 4 |
| GQ-094 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 12707 | 3 |
| GQ-095 | taxonomy_alias | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 12320 | 4 |
| GQ-096 | taxonomy_alias | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9055 | 5 |
| GQ-097 | taxonomy_alias | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18790 | 3 |
| GQ-098 | taxonomy_alias | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 23470 | 7 |
| GQ-099 | taxonomy_alias | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 11621 | 3 |
| GQ-100 | multi_hop_graph | PASS | 0.75 | 0.00 | 0.00 | — | — | — | — | 15144 | 2 |
| GQ-101 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 23063 | 6 |
| GQ-102 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 10092 | 4 |
| GQ-103 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 8221 | 2 |
| GQ-104 | treatment_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 17078 | 6 |
| GQ-105 | condition_department | PASS | 1.00 | — | — | — | — | — | — | 9409 | 0 |
| GQ-106 | taxonomy_alias | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 17756 | 6 |
| GQ-107 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 15654 | 9 |
| GQ-108 | treatment_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 13874 | 4 |
| GQ-109 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9136 | 4 |
| GQ-110 | campus_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 7710 | 2 |
| GQ-111 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 10208 | 1 |
| GQ-112 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 14436 | 9 |
| GQ-113 | service_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 7775 | 6 |
| GQ-114 | service_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9855 | 4 |
| GQ-115 | navigation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 13565 | 4 |
| GQ-116 | referral | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 7044 | 1 |
| GQ-117 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 8093 | 1 |
| GQ-118 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 15902 | 8 |
| GQ-119 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 10384 | 3 |
| GQ-120 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 10600 | 2 |
| GQ-121 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9854 | 2 |
| GQ-122 | condition_department | PASS | 1.00 | — | — | — | — | — | — | 9757 | 0 |
| GQ-123 | taxonomy_alias | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9092 | 3 |
| GQ-124 | condition_department | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 7994 | 3 |
| GQ-125 | service_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 11194 | 4 |
| GQ-126 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9840 | 6 |
| GQ-127 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 8072 | 2 |
| GQ-128 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9645 | 3 |
| GQ-129 | entity_disambiguation | PASS | 0.75 | 0.00 | 0.00 | — | — | — | — | 10865 | 2 |
| GQ-130 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 8341 | 3 |
| GQ-131 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 8554 | 1 |
| GQ-132 | entity_disambiguation | PASS | 0.67 | 0.00 | 0.00 | — | — | — | — | 13093 | 5 |
| GQ-133 | condition_department | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 9811 | 3 |
| GQ-134 | entity_disambiguation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 13039 | 3 |
| GQ-135 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9506 | 1 |
| GQ-136 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 17958 | 4 |
| GQ-137 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 16980 | 1 |
| GQ-138 | compound_word | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 7965 | 4 |
| GQ-139 | navigation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9265 | 1 |
| GQ-140 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 6968 | 3 |
| GQ-141 | treatment_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 10982 | 3 |
| GQ-142 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 13833 | 1 |
| GQ-143 | safety_refusal | PASS | 1.00 | — | — | — | — | — | — | 10563 | 7 |
| GQ-144 | safety_refusal | PASS | 1.00 | — | — | — | — | — | — | 21304 | 1 |
| GQ-145 | out_of_scope | PASS | 1.00 | — | — | — | — | — | — | 2827 | 0 |
| GQ-146 | entity_disambiguation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 9808 | 1 |
Generated by run_evaluation.py at 2026-02-19 13:15 UTC.