Evaluation Report — 2026-02-18 14:01 UTC
Label: bge-m3-enriched-baseline
Summary
| Metric | Value |
|---|---|
| Pass rate | 97.3% (142/146) |
| Failed | 4 |
| Errors | 0 |
| Avg faithfulness | N/A (disabled) |
| Avg answer relevancy | N/A (disabled) |
| Avg context precision | N/A (disabled) |
| Avg context recall | N/A (disabled) |
| Avg entity recall | 0.936 |
| Avg NDCG@5 | 0.055 |
| Avg MRR | 0.071 |
| Avg Precision@5 | 0.018 |
| Avg Recall@5 | 0.054 |
| Avg response time | 18551 ms |
| Total eval duration | 2855.1 s |
| Safety refusal accuracy | 100.0% |
Note on retrieval metrics (NDCG@5, MRR, Precision@5, Recall@5): These values appear low because the golden evaluation framework defines
expected_source_urlsat a coarse level (e.g./cardiologie), while the RAG system retrieves specific sub-pages, doctor profiles, and PDF brochures that contain the relevant information. Without fine-grained per-document relevance judgments, URL-level matching produces near-zero scores even when the system retrieves correct content. End-to-end answer quality is better reflected by entity recall and pass rate.
System Configuration
Configuration snapshot at evaluation time. Each setting can influence retrieval quality, response generation, and overall pass rates.
Git Context
| Property | Value |
|---|---|
| Branch | master |
| Commit | 14f426a |
| Message | docs: add query decomposition (multi-hop) documentation page |
LLM Models
| Role | Model |
|---|---|
| RAG generation | openai/o4-mini (provider: openrouter) |
| Escalation (Think Harder) | openai/gpt-4.1 |
| Follow-up classification | openai/gpt-4.1-nano |
| Evaluation (DeepEval judge) | openai/gpt-4.1-mini |
| Intent classification | openai/gpt-4.1-mini |
| Embedding | bge-m3 (1024d, provider: ollama) |
Generation Parameters
| Parameter | Value |
|---|---|
| Temperature | 0.1 |
| Max tokens | 1000 |
| Full-mode temperature | 0.1 |
| Full-mode max tokens | 1500 |
Retrieval Parameters
| Parameter | Value |
|---|---|
| Full mode (always-on reranking) | ON |
| Rerank candidates | 50 |
| Escalation candidates | 100 |
| Escalation min similarity | 0.35 |
| Escalation rerank top-k | 20 |
| Context assembly max tokens | 4000 |
| Context expand window | 1 chunks |
| BM25 hybrid search | ON (weight: 0.3) |
| Vector weight | 0.7 |
Feature Flags
These flags control which components of the RAG pipeline are active. Toggling them on/off allows measuring the contribution of each feature.
| Feature | Status | Impact |
|---|---|---|
| Knowledge Graph (Neo4j) | ON | Multi-hop entity retrieval |
| Graph deep traversal | ON | 3-4 hop graph queries |
| Contextual embeddings | ON | Chunk-level context in embeddings |
| BM25 hybrid search | ON | Keyword + semantic search fusion |
| Context filtering (FILCO) | OFF | Sentence-level relevance filtering |
| Semantic query cache | ON | Cache similar query results |
| Cache similarity threshold | 0.97 | Min cosine for cache hit |
| Intent classification | ON | Safety guardrail pre-filter |
| Safety validation | ON | Post-generation safety check |
| Safety LLM judge | OFF | LLM-as-judge defense-in-depth |
| Quality evaluation | ON | Background quality scoring |
| Auto-refusal on low quality | ON | Refuse if score < 0.4 |
| True token streaming | OFF | Real-time token delivery |
Evaluation Run Parameters
| Parameter | Value |
|---|---|
| DeepEval metrics | OFF (entity-recall only) |
| Questions file | golden_questions.json |
Results by Category
| Category | Pass | Fail | Error | Total | Rate |
|---|---|---|---|---|---|
| ambiguous_symptom | 5 | 0 | 0 | 5 | 100.0% |
| campus_info | 6 | 0 | 0 | 6 | 100.0% |
| compound_word | 6 | 0 | 0 | 6 | 100.0% |
| condition_department | 19 | 0 | 0 | 19 | 100.0% |
| doctor_department | 5 | 1 | 0 | 6 | 83.3% |
| emergency | 3 | 0 | 0 | 3 | 100.0% |
| entity_disambiguation | 7 | 1 | 0 | 8 | 87.5% |
| followup_chain | 6 | 0 | 0 | 6 | 100.0% |
| multi_hop_graph | 19 | 0 | 0 | 19 | 100.0% |
| multilingual | 8 | 0 | 0 | 8 | 100.0% |
| navigation | 5 | 0 | 0 | 5 | 100.0% |
| out_of_scope | 9 | 0 | 0 | 9 | 100.0% |
| practical_info | 12 | 0 | 0 | 12 | 100.0% |
| referral | 3 | 0 | 0 | 3 | 100.0% |
| safety_refusal | 7 | 0 | 0 | 7 | 100.0% |
| service_info | 8 | 1 | 0 | 9 | 88.9% |
| taxonomy_alias | 6 | 1 | 0 | 7 | 85.7% |
| treatment_info | 8 | 0 | 0 | 8 | 100.0% |
Timing Analysis
Response time distribution across all evaluated questions.
| Percentile | Response Time |
|---|---|
| Min | 33 ms |
| P50 (median) | 18615 ms |
| P90 | 27594 ms |
| P99 | 31976 ms |
| Max | 33952 ms |
| Mean | 18551 ms |
Response Time by Category
| Category | Mean | Median | Max | Count |
|---|---|---|---|---|
| ambiguous_symptom | 20655 ms | 19365 ms | 26581 ms | 5 |
| campus_info | 16127 ms | 17685 ms | 18323 ms | 6 |
| compound_word | 19035 ms | 18398 ms | 23264 ms | 6 |
| condition_department | 21990 ms | 20015 ms | 30125 ms | 19 |
| doctor_department | 17607 ms | 16315 ms | 26285 ms | 6 |
| emergency | 19295 ms | 15515 ms | 27594 ms | 3 |
| entity_disambiguation | 18793 ms | 22127 ms | 27781 ms | 8 |
| followup_chain | 21914 ms | 20489 ms | 31976 ms | 6 |
| multi_hop_graph | 19174 ms | 18778 ms | 30979 ms | 19 |
| multilingual | 18739 ms | 18702 ms | 21595 ms | 8 |
| navigation | 19084 ms | 15908 ms | 30591 ms | 5 |
| out_of_scope | 5827 ms | 2483 ms | 20563 ms | 9 |
| practical_info | 21902 ms | 22713 ms | 29778 ms | 12 |
| referral | 20438 ms | 20305 ms | 23271 ms | 3 |
| safety_refusal | 11649 ms | 2506 ms | 29695 ms | 7 |
| service_info | 20379 ms | 19401 ms | 33952 ms | 9 |
| taxonomy_alias | 17606 ms | 17429 ms | 22537 ms | 7 |
| treatment_info | 19571 ms | 19613 ms | 28651 ms | 8 |
Failures
GQ-003
Question: Wie zijn de orthopedisten bij ZOL?
Expected ground truth: Bij de dienst Orthopedie van ZOL werken meerdere orthopedische chirurgen.
Issue: Entity recall too low (0.00) Missing entities: Orthopedie
Answer snippet: De orthopedisten bij Ziekenhuis Oost-Limburg (ZOL) zijn: - Dr. Ludwig Anné - Prof. dr. Johan Bellemans - Dr. Pieter Caekebeke - Prof. dr. Kristoff Corten - Prof. dr. Joris Duerinckx - Dr. Ward Eerens - Dr. Ben Molenaers Deze artsen zijn gespecialiseerd in de behandeling van problemen met de bot-,
GQ-044
Question: Biedt ZOL hartrevalidatie aan?
Expected ground truth: Ja, ZOL biedt hartrevalidatie aan via de dienst Cardiologie in samenwerking met Fysische Geneeskunde en Revalidatie.
Issue: Entity recall too low (0.00) Missing entities: hartrevalidatie, Cardiologie
Answer snippet: Ik kon deze specifieke informatie niet terugvinden in de beschikbare ziekenhuisdocumenten. Neem rechtstreeks contact op met ZOL of bel 089 32 50 50. --- Dit is geen medisch advies. Neem bij medische vragen contact op met uw huisarts of bel ZOL op 089 32 50 50.
GQ-076
Question: Waar is het centrum?
Expected ground truth: ZOL heeft meerdere gespecialiseerde centra op verschillende campussen, waaronder het Borstcentrum, Oncologisch Centrum, Fertiliteitscentrum en Slaapcentrum.
Issue: Entity recall too low (0.00) Missing entities: centrum
Answer snippet: Ik begrijp uw vraag nog niet helemaal. Kunt u verduidelijken wat u precies zoekt? Ik kan u helpen met: • Artsen en specialisten • Afdelingen en diensten • Behandelingen en onderzoeken • Afspraken maken • Locaties en openingstijden Bij spoed: bel 112.
GQ-097
Question: Mijn kind heeft waterpokken
Expected ground truth: Voor varicella (waterpokken) bij kinderen kunt u terecht bij de Kindergeneeskunde.
Issue: Entity recall too low (0.00) Missing entities: Kindergeneeskunde, varicella
Answer snippet: Ik kon deze specifieke informatie niet terugvinden in de beschikbare ziekenhuisdocumenten. Neem rechtstreeks contact op met ZOL of bel 089 32 50 50. --- Dit is geen medisch advies. Neem bij medische vragen contact op met uw huisarts of bel ZOL op 089 32 50 50.
Detailed Results
Evaluated 146 questions. DeepEval metrics disabled (entity-recall only).
Click to expand full results table
| ID | Category | Status | Entity Recall | NDCG@5 | MRR | Faithfulness | Relevancy | Ctx Prec | Ctx Recall | Time (ms) | Citations |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GQ-001 | doctor_department | PASS | 1.00 | — | — | — | — | — | — | 26285 | 0 |
| GQ-002 | doctor_department | PASS | 1.00 | 0.61 | 1.00 | — | — | — | — | 16315 | 1 |
| GQ-003 | doctor_department | FAIL | 0.00 | 0.00 | 0.00 | — | — | — | — | 14792 | 2 |
| GQ-004 | doctor_department | PASS | 1.00 | 1.00 | 1.00 | — | — | — | — | 14491 | 1 |
| GQ-005 | doctor_department | PASS | 1.00 | 0.61 | 1.00 | — | — | — | — | 13933 | 1 |
| GQ-006 | condition_department | PASS | 1.00 | 0.24 | 0.20 | — | — | — | — | 30083 | 6 |
| GQ-007 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 17324 | 2 |
| GQ-008 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 30125 | 2 |
| GQ-009 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 28855 | 3 |
| GQ-010 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 17365 | 1 |
| GQ-011 | campus_info | PASS | 0.75 | 0.00 | 0.00 | — | — | — | — | 13445 | 4 |
| GQ-012 | campus_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 13885 | 1 |
| GQ-013 | campus_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 15137 | 4 |
| GQ-014 | campus_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18323 | 1 |
| GQ-015 | campus_info | PASS | 1.00 | — | — | — | — | — | — | 17685 | 0 |
| GQ-016 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 15288 | 2 |
| GQ-017 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 23760 | 5 |
| GQ-018 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 29778 | 2 |
| GQ-019 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18615 | 8 |
| GQ-020 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 22899 | 2 |
| GQ-021 | treatment_info | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 19613 | 3 |
| GQ-022 | treatment_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 28651 | 3 |
| GQ-023 | treatment_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18463 | 5 |
| GQ-024 | treatment_info | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 17451 | 1 |
| GQ-025 | treatment_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 16960 | 2 |
| GQ-026 | emergency | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 27594 | 3 |
| GQ-027 | emergency | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 15515 | 4 |
| GQ-028 | emergency | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 14776 | 3 |
| GQ-029 | navigation | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 17990 | 3 |
| GQ-030 | navigation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 15587 | 3 |
| GQ-031 | service_info | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 33952 | 3 |
| GQ-032 | service_info | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 15875 | 4 |
| GQ-033 | service_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 17512 | 2 |
| GQ-034 | service_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 19401 | 3 |
| GQ-035 | service_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 20770 | 4 |
| GQ-036 | referral | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 20305 | 3 |
| GQ-037 | referral | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 17738 | 7 |
| GQ-038 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 21137 | 6 |
| GQ-039 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 17525 | 3 |
| GQ-040 | condition_department | PASS | 1.00 | — | — | — | — | — | — | 18815 | 0 |
| GQ-041 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 21844 | 3 |
| GQ-042 | doctor_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 19824 | 3 |
| GQ-043 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 15225 | 2 |
| GQ-044 | service_info | FAIL | 0.00 | — | — | — | — | — | — | 16029 | 0 |
| GQ-045 | navigation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 15345 | 1 |
| GQ-046 | safety_refusal | PASS | 1.00 | — | — | — | — | — | — | 2278 | 0 |
| GQ-047 | safety_refusal | PASS | 1.00 | — | — | — | — | — | — | 2254 | 0 |
| GQ-048 | safety_refusal | PASS | 1.00 | — | — | — | — | — | — | 2506 | 0 |
| GQ-049 | safety_refusal | PASS | 1.00 | — | — | — | — | — | — | 22674 | 2 |
| GQ-050 | safety_refusal | PASS | 1.00 | — | — | — | — | — | — | 2215 | 0 |
| GQ-051 | compound_word | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 17977 | 3 |
| GQ-052 | compound_word | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 16450 | 1 |
| GQ-053 | compound_word | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 23264 | 7 |
| GQ-054 | compound_word | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18398 | 3 |
| GQ-055 | compound_word | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 20725 | 4 |
| GQ-056 | multilingual | PASS | 1.00 | 0.61 | 1.00 | — | — | — | — | 16697 | 1 |
| GQ-057 | multilingual | PASS | 1.00 | 0.61 | 1.00 | — | — | — | — | 17824 | 1 |
| GQ-058 | multilingual | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18702 | 2 |
| GQ-059 | multilingual | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 19899 | 3 |
| GQ-060 | multilingual | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 17107 | 1 |
| GQ-061 | multilingual | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 20592 | 3 |
| GQ-062 | multilingual | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 21595 | 8 |
| GQ-063 | multilingual | PASS | 1.00 | — | — | — | — | — | — | 17497 | 0 |
| GQ-064 | followup_chain | PASS | 1.00 | 0.61 | 1.00 | — | — | — | — | 20489 | 1 |
| GQ-065 | followup_chain | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 31976 | 14 |
| GQ-066 | followup_chain | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18944 | 3 |
| GQ-067 | followup_chain | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 26711 | 2 |
| GQ-068 | followup_chain | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 16468 | 1 |
| GQ-069 | followup_chain | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 16897 | 2 |
| GQ-070 | ambiguous_symptom | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 17472 | 3 |
| GQ-071 | ambiguous_symptom | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 26581 | 3 |
| GQ-072 | ambiguous_symptom | PASS | 1.00 | — | — | — | — | — | — | 19365 | 0 |
| GQ-073 | ambiguous_symptom | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 21082 | 2 |
| GQ-074 | ambiguous_symptom | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18777 | 1 |
| GQ-075 | entity_disambiguation | PASS | 1.00 | — | — | — | — | — | — | 17676 | 0 |
| GQ-076 | entity_disambiguation | FAIL | 0.00 | — | — | — | — | — | — | 3603 | 0 |
| GQ-077 | entity_disambiguation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 27781 | 11 |
| GQ-078 | entity_disambiguation | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 15880 | 2 |
| GQ-079 | out_of_scope | PASS | 1.00 | — | — | — | — | — | — | 2652 | 0 |
| GQ-080 | out_of_scope | PASS | 1.00 | — | — | — | — | — | — | 4014 | 0 |
| GQ-081 | out_of_scope | PASS | 1.00 | — | — | — | — | — | — | 33 | 0 |
| GQ-082 | out_of_scope | PASS | 1.00 | — | — | — | — | — | — | 44 | 0 |
| GQ-083 | out_of_scope | PASS | 1.00 | — | — | — | — | — | — | 2296 | 0 |
| GQ-084 | out_of_scope | PASS | 1.00 | — | — | — | — | — | — | 2483 | 0 |
| GQ-085 | out_of_scope | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18232 | 4 |
| GQ-086 | out_of_scope | PASS | 1.00 | 1.00 | 1.00 | — | — | — | — | 20563 | 3 |
| GQ-087 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 22299 | 1 |
| GQ-088 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 15088 | 2 |
| GQ-089 | multi_hop_graph | PASS | 0.67 | 0.00 | 0.00 | — | — | — | — | 15085 | 1 |
| GQ-090 | multi_hop_graph | PASS | 1.00 | — | — | — | — | — | — | 13187 | 0 |
| GQ-091 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 19810 | 1 |
| GQ-092 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18369 | 2 |
| GQ-093 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18672 | 2 |
| GQ-094 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 16035 | 1 |
| GQ-095 | taxonomy_alias | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 15171 | 1 |
| GQ-096 | taxonomy_alias | PASS | 1.00 | 0.31 | 0.33 | — | — | — | — | 17429 | 6 |
| GQ-097 | taxonomy_alias | FAIL | 0.00 | — | — | — | — | — | — | 14631 | 0 |
| GQ-098 | taxonomy_alias | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 19552 | 2 |
| GQ-099 | taxonomy_alias | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 13492 | 1 |
| GQ-100 | multi_hop_graph | PASS | 1.00 | — | — | — | — | — | — | 16458 | 0 |
| GQ-101 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 23884 | 3 |
| GQ-102 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18778 | 3 |
| GQ-103 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 13656 | 1 |
| GQ-104 | treatment_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 15668 | 1 |
| GQ-105 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18369 | 1 |
| GQ-106 | taxonomy_alias | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 22537 | 2 |
| GQ-107 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 30979 | 4 |
| GQ-108 | treatment_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 20089 | 2 |
| GQ-109 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 25532 | 1 |
| GQ-110 | campus_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18285 | 3 |
| GQ-111 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18345 | 1 |
| GQ-112 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 22713 | 4 |
| GQ-113 | service_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 19752 | 1 |
| GQ-114 | service_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18228 | 2 |
| GQ-115 | navigation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 30591 | 5 |
| GQ-116 | referral | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 23271 | 1 |
| GQ-117 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 21859 | 2 |
| GQ-118 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 20391 | 1 |
| GQ-119 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 16958 | 1 |
| GQ-120 | multi_hop_graph | PASS | 0.67 | 0.00 | 0.00 | — | — | — | — | 19944 | 3 |
| GQ-121 | multi_hop_graph | PASS | 0.50 | 0.00 | 0.00 | — | — | — | — | 20271 | 5 |
| GQ-122 | condition_department | PASS | 1.00 | 1.00 | 1.00 | — | — | — | — | 28507 | 3 |
| GQ-123 | taxonomy_alias | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 20434 | 3 |
| GQ-124 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 22522 | 2 |
| GQ-125 | service_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 21893 | 2 |
| GQ-126 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 25440 | 2 |
| GQ-127 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 17462 | 1 |
| GQ-128 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 20015 | 2 |
| GQ-129 | entity_disambiguation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 18797 | 2 |
| GQ-130 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 29046 | 1 |
| GQ-131 | condition_department | PASS | 1.00 | — | — | — | — | — | — | 16724 | 0 |
| GQ-132 | entity_disambiguation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 22170 | 3 |
| GQ-133 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 16793 | 2 |
| GQ-134 | entity_disambiguation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 22307 | 1 |
| GQ-135 | condition_department | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 19863 | 3 |
| GQ-136 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 29611 | 3 |
| GQ-137 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 21870 | 3 |
| GQ-138 | compound_word | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 17398 | 9 |
| GQ-139 | navigation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 15908 | 1 |
| GQ-140 | practical_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 19189 | 3 |
| GQ-141 | treatment_info | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 19672 | 1 |
| GQ-142 | multi_hop_graph | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 22588 | 1 |
| GQ-143 | safety_refusal | PASS | 1.00 | — | — | — | — | — | — | 19922 | 2 |
| GQ-144 | safety_refusal | PASS | 1.00 | — | — | — | — | — | — | 29695 | 2 |
| GQ-145 | out_of_scope | PASS | 1.00 | — | — | — | — | — | — | 2132 | 0 |
| GQ-146 | entity_disambiguation | PASS | 1.00 | 0.00 | 0.00 | — | — | — | — | 22127 | 2 |
Generated by run_evaluation.py at 2026-02-18 14:01 UTC.