Skip to main content

ADR-0018: AI URL Category & Value Assessment

Date: 2026-02-10 | Status: Accepted

Context

After crawl discovery, URLs are categorized by regex patterns. This categorization impacts RAG search quality (Lewis et al., 2020) — the search service applies a 20% relevance boost when a document's category matches the user's intent. Mis-categorized URLs degrade search results.

The regex approach misses pages with ambiguous paths, content that doesn't match URL structure, and cannot distinguish between high-value patient-facing content and low-value organizational content (annual reports, job postings, supplier pages).

Decision

Adopt a two-dimensional AI assessment for crawled URLs:

  1. Category — What type of content? (12 predefined + AI-suggested new categories)
  2. Value — How relevant for patient-facing search? (high / medium / low / skip)

Value Levels

ValueMeaningSearch Impact
highPatient-facing (departments, doctors, conditions)Normal ranking
mediumSupporting content (news, brochures)Normal ranking
lowOrganizational (reports, research, job postings)15% ranking penalty
skipNot useful (video embeds, floor plans)Not ingested

How It Works

Crawler discovers URLs → Regex assigns initial category

Admin triggers "AI Review" on CrawlDashboard

Tier 3 (flagship) model classifies URLs (batch of 20, temperature=0)

Category mismatches shown for admin review

Admin accepts/rejects → "Apply Changes"

Categories updated, skip URLs marked as ignored

Assessment Data

Stored in url_metadata["ai_assessment"] (existing JSONB field — no migration needed):

{
"suggested_category": "Department",
"value": "high",
"confidence": 0.92,
"reason": "Patient-facing department page",
"category_changed": true,
"original_category": "Brochure"
}

Consequences

Positive: Better category accuracy (LLM considers URL + title), search quality uplift (low-value dampening), skip filtering (excludes noise from ingestion), admin QA workflow.

Negative: One-time LLM cost (~$0.15 per run), manual trigger required, may need re-assessment after major site changes.

Key Files

ComponentFile
Assessment servicebackend/app/services/url_assessment_service.py
API endpointsbackend/app/api/crawl.py
Search dampeningbackend/app/services/search_service.py
Frontend panelfrontend/src/components/Documents/AssessmentPanel.tsx
Dashboard integrationfrontend/src/components/Documents/CrawlDashboard.tsx