Data Seeding

After the application is running, seed it with SNOMED terminology, an initial content crawl, and taxonomy extraction via the Taxonomy Platform.

Checklist

Don't skip the post-ingest scripts (Step 6b and 6c)

After a full crawl + taxonomy extraction, Step 6b and 6c must be run manually — they are NOT part of the ingest pipeline. Skipping them leaves features silently broken on a fresh server: spoken-specialty resolution fails, and SNOMED domain guards have no mapping. Step 6a (schedule stamping) is no longer required on fresh installs — department_schedule_meta is now stamped automatically at ingest time. Run 6a only to re-stamp legacy docs or after a validation-logic change. See Step 6.

Step 1: SNOMED CT Import

SNOMED CT provides the medical terminology backbone -- synonym expansion, concept hierarchy, and body-site-to-department mapping.

From Your Mac (Recommended)

Use the push-snomed.sh script to upload and import in one step:

# From your local machine (Mac)
cd /path/to/zol-rag
./scripts/push-snomed.sh deploy@YOUR_SERVER_IP

This will:

Find ~/Downloads/SnomedCT_ManagedServiceBE_PRODUCTION_*
Compress (~2 GB → ~400 MB)
Upload to /opt/zol-rag/snomed/ on the server
Run the import inside the app container
Report timing and row counts

Preview without executing:

./scripts/push-snomed.sh deploy@YOUR_SERVER_IP --dry-run

Manual Import (If Already on Server)

If you've already copied the RF2 files to the server:

# Find the Snapshot directory
find /opt/zol-rag/snomed/ -type d -name Snapshot

# Run the import via docker exec
docker exec zol-app python -m scripts.import_snomed_rf2 \
  --rf2-dir /opt/zol-rag/snomed/SnomedCT_*/Snapshot

Expected Results

Table	Expected Rows
`snomed_concepts`	~350,000
`snomed_descriptions`	~500,000
`snomed_relationships`	~900,000
`snomed_transitive_closure`	~5,000,000

Import takes 5-15 minutes depending on server speed.

Verify SNOMED Import

docker exec zol-app python -c "
import os, psycopg2
url = os.environ.get('DATABASE_URL', '').replace('+asyncpg', '')
conn = psycopg2.connect(url)
cur = conn.cursor()
for t in ['snomed_concepts', 'snomed_descriptions', 'snomed_relationships', 'snomed_transitive_closure']:
    cur.execute(f'SELECT COUNT(*) FROM app.{t}')
    print(f'{t}: {cur.fetchone()[0]:,} rows')
conn.close()
"

Step 2: Create Admin User in Keycloak

Before seeding content, create an admin user in Keycloak. You can do this via the Keycloak admin console or via the Keycloak REST API.

Via Keycloak Admin Console (Recommended)

Open an SSH tunnel: ssh -L 8080:127.0.0.1:8080 deploy@YOUR_SERVER_IP
Navigate to http://localhost:8080/admin
Login with Keycloak admin credentials (from .env.prod)
Select the "zol" realm
Create a new user (see User Management for detailed steps)
Assign the admin realm role

Via Keycloak REST API

# Get a Keycloak admin token
KC_ADMIN_TOKEN=$(curl -s -X POST \
  "http://localhost:8080/realms/master/protocol/openid-connect/token" \
  -d "grant_type=password" \
  -d "client_id=admin-cli" \
  -d "username=admin" \
  -d "password=${KEYCLOAK_ADMIN_PASSWORD}" \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")

# Create the admin user
curl -s -X POST \
  "http://localhost:8080/admin/realms/zol/users" \
  -H "Authorization: Bearer $KC_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "username": "admin@zol.be",
    "email": "admin@zol.be",
    "firstName": "ZOL",
    "lastName": "Admin",
    "enabled": true,
    "emailVerified": true,
    "credentials": [{
      "type": "password",
      "value": "YOUR_ADMIN_PASSWORD_HERE",
      "temporary": false
    }]
  }'

# Get the user ID
USER_ID=$(curl -s \
  "http://localhost:8080/admin/realms/zol/users?username=admin@zol.be" \
  -H "Authorization: Bearer $KC_ADMIN_TOKEN" \
  | python3 -c "import sys,json; print(json.load(sys.stdin)[0]['id'])")

# Get the admin role ID
ADMIN_ROLE=$(curl -s \
  "http://localhost:8080/admin/realms/zol/roles/admin" \
  -H "Authorization: Bearer $KC_ADMIN_TOKEN")

# Assign the admin role
curl -s -X POST \
  "http://localhost:8080/admin/realms/zol/users/${USER_ID}/role-mappings/realm" \
  -H "Authorization: Bearer $KC_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d "[${ADMIN_ROLE}]"

Step 3: Obtain a Bearer Token

All authenticated API calls require a Bearer token from Keycloak.

# Obtain an access token for the admin user
TOKEN=$(curl -s -X POST \
  "https://YOUR_DOMAIN/realms/zol/protocol/openid-connect/token" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "grant_type=password" \
  -d "client_id=zol-rag-frontend" \
  -d "username=admin@zol.be" \
  -d "password=YOUR_ADMIN_PASSWORD_HERE" \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")

# Verify the token works
curl -s https://YOUR_DOMAIN/api/v1/auth/me \
  -H "Authorization: Bearer $TOKEN" | python3 -m json.tool

warning

Access tokens expire after 5 minutes by default. If you are running a long seeding procedure, you may need to re-obtain the token between steps.

Step 4: Initial Content Crawl

The system crawls the ZOL website to build its content database:

# Trigger a content crawl
curl -X POST https://YOUR_DOMAIN/api/v1/crawl/start \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"max_pages": 500}'

# Monitor crawl progress
curl -s https://YOUR_DOMAIN/api/v1/crawl/status \
  -H "Authorization: Bearer $TOKEN" | python3 -m json.tool

The crawl performs:

Crawl ZOL website (~500 pages)
Extract text content from each page
Generate embeddings using the configured embedding model
Store everything in PostgreSQL with pgvector

Duration: 30-60 minutes depending on server speed. Monitor progress in the admin dashboard at https://YOUR_DOMAIN/admin.

Step 5: Taxonomy Extraction

After the crawl completes, run extraction from the Taxonomy Platform UI. The system auto-detects hub pages (navigational listing pages) via an LLM classifier -- no manual configuration needed.

Open the admin UI at https://YOUR_DOMAIN/admin
Navigate to the Taxonomy Platform section
Run extraction -- hub pages are auto-detected from crawled content
Review and approve the extraction proposals
Restart the application so FrozenTaxonomyRegistry loads the approved taxonomy from DB

You can also trigger a self-heal to catch any missed extractions:

# Re-obtain token if expired
TOKEN=$(curl -s -X POST \
  "https://YOUR_DOMAIN/realms/zol/protocol/openid-connect/token" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "grant_type=password" \
  -d "client_id=zol-rag-frontend" \
  -d "username=admin@zol.be" \
  -d "password=YOUR_ADMIN_PASSWORD_HERE" \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")

# Check graph status
curl -s https://YOUR_DOMAIN/api/v1/diagnostics/status \
  -H "Authorization: Bearer $TOKEN" | python3 -m json.tool

# If entities are missing, run self-heal (graph extraction)
curl -X POST "https://YOUR_DOMAIN/api/v1/diagnostics/self-heal/graph?dry_run=false" \
  -H "Authorization: Bearer $TOKEN"

Step 6: Post-Ingest Finalization (Required)

These run once after a full crawl + taxonomy extraction. On a fresh install, 6b and 6c are required operator steps (not part of the ingest pipeline) — skipping them leaves features silently broken even though search "works". 6a is now automatic at ingest time and is only needed to re-stamp legacy docs (see its note below). All three are idempotent (safe to re-run) and read DATABASE_URL (already set inside the zol-app container).

Run them inside the app container:

# 6a. Validate & stamp department schedules  — AUTOMATIC on fresh installs
#     `department_schedule_meta.validated` is now stamped automatically at
#     ingest time for every raadplegingen page. Run this script only to
#     RE-STAMP legacy docs ingested before this was automatic, or after a
#     validation-logic change. Safe to skip on a fresh install (new docs are
#     stamped at crawl time). Idempotent.
docker exec zol-app python -m scripts.validate_and_stamp_department_schedules

# 6b. Seed ontology aliases  — REQUIRED for spoken-specialty resolution
#     Materialises the approved specialty-noun → department map (e.g. "longarts"
#     → Pneumologie) into app.ontology_aliases, and creates any missing
#     department taxonomy entities so the EntityLinker can resolve them.
docker exec zol-app python -m scripts.seed_ontology_aliases

# 6c. Seed SNOMED domain mapping  — REQUIRED for domain plausibility guards
#     ~38 SNOMED parent-concept → clinical-domain rows. Must run AFTER the SNOMED
#     import (Step 1). Idempotent UPSERT.
docker exec zol-app python -m scripts.seed_snomed_domain_mapping

Verify the stamp worked

After 6a, confirm at least some departments are validated — this is what the schedule tools resolve against:

docker exec zol-postgres psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -t -c \
  "SELECT count(*) FROM app.documents
   WHERE metadata->'department_schedule_meta'->>'validated' = 'true';"

0 means no schedules were validated — re-check that the crawl ingested the … – Raadplegingen overview pages and that 6a ran without error.

These are at-rest finalization, not historical backfills

backend/scripts/backfill_consultation_schedule.py and backfill_department_schedule.py are one-shot historical migrations for corpora ingested before the at-ingest extractors existed — you do not run them on a fresh install (new docs get consultation_schedule / department_schedule extracted automatically at ingest time). Only the three finalization scripts above are needed on a new server.

6d. Department co-reference clustering — optional, human-gated (run as needed)

A hospital often has the same department split across several taxonomy entities (e.g. Mond- Kaak- en Aangezichtschirurgie, MKA-arts, … (MKA), … Aangezichtsheelkunde), where only one carries the WORKS_IN doctors. A roster query that resolves to one of the empty duplicates returns nothing. cluster_department_entities.py links co-referent entities under a shared dedup_cluster_id, so the roster aggregates doctors across the whole cluster.

Unlike 6a–6c this is two-phase and human-gated — it is NOT run blindly, because writing a cluster makes it immediately trusted by the roster read-path:

# 1. PROPOSE — read-only; prints clusters + writes a reviewable JSON. No DB writes.
docker exec zol-app python -m scripts.cluster_department_entities \
  --hospital <HOSPITAL_UUID> --out /tmp/dept_clusters.json
docker cp zol-app:/tmp/dept_clusters.json ./dept_clusters.json

# 2. A human reviews dept_clusters.json and DELETES any cluster that is not
#    truly the same department (token-subset matching is conservative, but the
#    confirmation is the only trust boundary — there is no read-time guard).

# 3. APPLY — writes dedup_cluster_id for the approved clusters only. Idempotent.
docker cp ./dept_clusters.json zol-app:/tmp/dept_clusters.json
docker exec zol-app python -m scripts.cluster_department_entities \
  --apply --in /tmp/dept_clusters.json --hospital <HOSPITAL_UUID>

Re-run after a full re-ingest

The cluster lives in taxonomy_entities.dedup_cluster_id. A full re-ingest can re-extract the entities and reset it, silently regressing roster queries for the clustered department. Re-run the propose → review → apply cycle after any full re-ingest. (Tracked: issue #175.)

Step 7: Verify Search Works

# Test a search query (public endpoint, no auth needed)
curl -X POST https://YOUR_DOMAIN/api/v1/public/query \
  -H "Content-Type: application/json" \
  -d '{"query": "Welke arts behandelt een hernia?"}'

# Test the doctor-schedule tool end-to-end
curl -X POST https://YOUR_DOMAIN/api/v1/public/query \
  -H "Content-Type: application/json" \
  -d '{"query": "Welke artsen van urologie houden raadpleging op maandagvoormiddag?"}'

The second query should return a grounded list of doctors with citations. If it hedges ("ik kan geen betrouwbare lijst geven…"), the crawl likely did not ingest the … – Raadplegingen overview pages, or their schedules failed validation — check that department_schedule_meta.validated = 'true' exists for some documents (the Step 6a verify query above). On a fresh install these are stamped at ingest; for legacy docs, run Step 6a.

You should receive a JSON response with relevant results, source citations, and the safety disclaimer.

Disk Usage Estimates

Data	Current Size	1-Year Estimate
PostgreSQL (relational + embeddings + taxonomy)	~2.5 GB	~5.5 GB
MinIO documents	~5 GB	~11 GB
SNOMED tables	~2 GB	~2 GB (static)

Architectural Evolution

The data seeding workflow was updated in March 2026 to reflect the migration from cookie-based authentication to Keycloak OIDC. All API calls now use Bearer token authentication obtained from Keycloak's token endpoint, replacing the previous pattern of registering users via /api/v1/auth/register and using cookie-based sessions. Admin user creation is now performed through the Keycloak admin console or REST API rather than through application-level registration endpoints.

Next: SSL & DNS →

Checklist​

Step 1: SNOMED CT Import​

From Your Mac (Recommended)​

Manual Import (If Already on Server)​

Expected Results​

Verify SNOMED Import​

Step 2: Create Admin User in Keycloak​

Via Keycloak Admin Console (Recommended)​

Via Keycloak REST API​

Step 3: Obtain a Bearer Token​

Step 4: Initial Content Crawl​

Step 5: Taxonomy Extraction​

Step 6: Post-Ingest Finalization (Required)​

6d. Department co-reference clustering — optional, human-gated (run as needed)​

Step 7: Verify Search Works​

Disk Usage Estimates​

Architectural Evolution​