ADR-0018: AI URL Category & Value Assessment
Date: 2026-02-10 | Status: Accepted
Context
After crawl discovery, URLs are categorized by regex patterns. This categorization impacts RAG search quality (Lewis et al., 2020) — the search service applies a 20% relevance boost when a document's category matches the user's intent. Mis-categorized URLs degrade search results.
The regex approach misses pages with ambiguous paths, content that doesn't match URL structure, and cannot distinguish between high-value patient-facing content and low-value organizational content (annual reports, job postings, supplier pages).
Decision
Adopt a two-dimensional AI assessment for crawled URLs:
- Category — What type of content? (12 predefined + AI-suggested new categories)
- Value — How relevant for patient-facing search? (
high/medium/low/skip)
Value Levels
| Value | Meaning | Search Impact |
|---|---|---|
| high | Patient-facing (departments, doctors, conditions) | Normal ranking |
| medium | Supporting content (news, brochures) | Normal ranking |
| low | Organizational (reports, research, job postings) | 15% ranking penalty |
| skip | Not useful (video embeds, floor plans) | Not ingested |
How It Works
Crawler discovers URLs → Regex assigns initial category
↓
Admin triggers "AI Review" on CrawlDashboard
↓
Tier 3 (flagship) model classifies URLs (batch of 20, temperature=0)
↓
Category mismatches shown for admin review
↓
Admin accepts/rejects → "Apply Changes"
↓
Categories updated, skip URLs marked as ignored
Assessment Data
Stored in url_metadata["ai_assessment"] (existing JSONB field — no migration needed):
{
"suggested_category": "Department",
"value": "high",
"confidence": 0.92,
"reason": "Patient-facing department page",
"category_changed": true,
"original_category": "Brochure"
}
Consequences
Positive: Better category accuracy (LLM considers URL + title), search quality uplift (low-value dampening), skip filtering (excludes noise from ingestion), admin QA workflow.
Negative: One-time LLM cost (~$0.15 per run), manual trigger required, may need re-assessment after major site changes.
Key Files
| Component | File |
|---|---|
| Assessment service | backend/app/services/url_assessment_service.py |
| API endpoints | backend/app/api/crawl.py |
| Search dampening | backend/app/services/search_service.py |
| Frontend panel | frontend/src/components/Documents/AssessmentPanel.tsx |
| Dashboard integration | frontend/src/components/Documents/CrawlDashboard.tsx |