Data Seeding
After the application is running, seed it with SNOMED terminology, an initial content crawl, and taxonomy extraction via the Taxonomy Platform.
Checklist
- Run database migrations (
alembic upgrade head) - Import SNOMED CT Belgian Edition (using
push-snomed.sh) - Create admin user in Keycloak
- Obtain a Bearer token for API calls
- Create hospital via minimal YAML or admin API
- Run initial website crawl (~500 pages)
- Run extraction from Taxonomy Platform UI -- hub pages auto-detected
- Admin reviews/approves extraction proposals
- App restart --
FrozenTaxonomyRegistryloads from DB - Run post-ingest finalization scripts (Step 6 — schedules, aliases, domain map)
- Verify search works with a test query
After a full crawl + taxonomy extraction, Step 6b and 6c must be run
manually — they are NOT part of the ingest pipeline. Skipping them leaves
features silently broken on a fresh server: spoken-specialty resolution
fails, and SNOMED domain guards have no mapping. Step 6a (schedule
stamping) is no longer required on fresh installs — department_schedule_meta
is now stamped automatically at ingest time. Run 6a only to re-stamp legacy
docs or after a validation-logic change. See
Step 6.
Step 1: SNOMED CT Import
SNOMED CT provides the medical terminology backbone -- synonym expansion, concept hierarchy, and body-site-to-department mapping.
From Your Mac (Recommended)
Use the push-snomed.sh script to upload and import in one step:
# From your local machine (Mac)
cd /path/to/zol-rag
./scripts/push-snomed.sh deploy@YOUR_SERVER_IP
This will:
- Find
~/Downloads/SnomedCT_ManagedServiceBE_PRODUCTION_* - Compress (~2 GB → ~400 MB)
- Upload to
/opt/zol-rag/snomed/on the server - Run the import inside the app container
- Report timing and row counts
Preview without executing:
./scripts/push-snomed.sh deploy@YOUR_SERVER_IP --dry-run
Manual Import (If Already on Server)
If you've already copied the RF2 files to the server:
# Find the Snapshot directory
find /opt/zol-rag/snomed/ -type d -name Snapshot
# Run the import via docker exec
docker exec zol-app python -m scripts.import_snomed_rf2 \
--rf2-dir /opt/zol-rag/snomed/SnomedCT_*/Snapshot
Expected Results
| Table | Expected Rows |
|---|---|
snomed_concepts | ~350,000 |
snomed_descriptions | ~500,000 |
snomed_relationships | ~900,000 |
snomed_transitive_closure | ~5,000,000 |
Import takes 5-15 minutes depending on server speed.
Verify SNOMED Import
docker exec zol-app python -c "
import os, psycopg2
url = os.environ.get('DATABASE_URL', '').replace('+asyncpg', '')
conn = psycopg2.connect(url)
cur = conn.cursor()
for t in ['snomed_concepts', 'snomed_descriptions', 'snomed_relationships', 'snomed_transitive_closure']:
cur.execute(f'SELECT COUNT(*) FROM app.{t}')
print(f'{t}: {cur.fetchone()[0]:,} rows')
conn.close()
"
Step 2: Create Admin User in Keycloak
Before seeding content, create an admin user in Keycloak. You can do this via the Keycloak admin console or via the Keycloak REST API.
Via Keycloak Admin Console (Recommended)
- Open an SSH tunnel:
ssh -L 8080:127.0.0.1:8080 deploy@YOUR_SERVER_IP - Navigate to
http://localhost:8080/admin - Login with Keycloak admin credentials (from
.env.prod) - Select the "zol" realm
- Create a new user (see User Management for detailed steps)
- Assign the admin realm role
Via Keycloak REST API
# Get a Keycloak admin token
KC_ADMIN_TOKEN=$(curl -s -X POST \
"http://localhost:8080/realms/master/protocol/openid-connect/token" \
-d "grant_type=password" \
-d "client_id=admin-cli" \
-d "username=admin" \
-d "password=${KEYCLOAK_ADMIN_PASSWORD}" \
| python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")
# Create the admin user
curl -s -X POST \
"http://localhost:8080/admin/realms/zol/users" \
-H "Authorization: Bearer $KC_ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"username": "admin@zol.be",
"email": "admin@zol.be",
"firstName": "ZOL",
"lastName": "Admin",
"enabled": true,
"emailVerified": true,
"credentials": [{
"type": "password",
"value": "YOUR_ADMIN_PASSWORD_HERE",
"temporary": false
}]
}'
# Get the user ID
USER_ID=$(curl -s \
"http://localhost:8080/admin/realms/zol/users?username=admin@zol.be" \
-H "Authorization: Bearer $KC_ADMIN_TOKEN" \
| python3 -c "import sys,json; print(json.load(sys.stdin)[0]['id'])")
# Get the admin role ID
ADMIN_ROLE=$(curl -s \
"http://localhost:8080/admin/realms/zol/roles/admin" \
-H "Authorization: Bearer $KC_ADMIN_TOKEN")
# Assign the admin role
curl -s -X POST \
"http://localhost:8080/admin/realms/zol/users/${USER_ID}/role-mappings/realm" \
-H "Authorization: Bearer $KC_ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d "[${ADMIN_ROLE}]"
Step 3: Obtain a Bearer Token
All authenticated API calls require a Bearer token from Keycloak.
# Obtain an access token for the admin user
TOKEN=$(curl -s -X POST \
"https://YOUR_DOMAIN/realms/zol/protocol/openid-connect/token" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "grant_type=password" \
-d "client_id=zol-rag-frontend" \
-d "username=admin@zol.be" \
-d "password=YOUR_ADMIN_PASSWORD_HERE" \
| python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")
# Verify the token works
curl -s https://YOUR_DOMAIN/api/v1/auth/me \
-H "Authorization: Bearer $TOKEN" | python3 -m json.tool
Access tokens expire after 5 minutes by default. If you are running a long seeding procedure, you may need to re-obtain the token between steps.
Step 4: Initial Content Crawl
The system crawls the ZOL website to build its content database:
# Trigger a content crawl
curl -X POST https://YOUR_DOMAIN/api/v1/crawl/start \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{"max_pages": 500}'
# Monitor crawl progress
curl -s https://YOUR_DOMAIN/api/v1/crawl/status \
-H "Authorization: Bearer $TOKEN" | python3 -m json.tool
The crawl performs:
- Crawl ZOL website (~500 pages)
- Extract text content from each page
- Generate embeddings using the configured embedding model
- Store everything in PostgreSQL with pgvector
Duration: 30-60 minutes depending on server speed. Monitor progress in the admin dashboard at https://YOUR_DOMAIN/admin.
Step 5: Taxonomy Extraction
After the crawl completes, run extraction from the Taxonomy Platform UI. The system auto-detects hub pages (navigational listing pages) via an LLM classifier -- no manual configuration needed.
- Open the admin UI at
https://YOUR_DOMAIN/admin - Navigate to the Taxonomy Platform section
- Run extraction -- hub pages are auto-detected from crawled content
- Review and approve the extraction proposals
- Restart the application so
FrozenTaxonomyRegistryloads the approved taxonomy from DB
You can also trigger a self-heal to catch any missed extractions:
# Re-obtain token if expired
TOKEN=$(curl -s -X POST \
"https://YOUR_DOMAIN/realms/zol/protocol/openid-connect/token" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "grant_type=password" \
-d "client_id=zol-rag-frontend" \
-d "username=admin@zol.be" \
-d "password=YOUR_ADMIN_PASSWORD_HERE" \
| python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")
# Check graph status
curl -s https://YOUR_DOMAIN/api/v1/diagnostics/status \
-H "Authorization: Bearer $TOKEN" | python3 -m json.tool
# If entities are missing, run self-heal (graph extraction)
curl -X POST "https://YOUR_DOMAIN/api/v1/diagnostics/self-heal/graph?dry_run=false" \
-H "Authorization: Bearer $TOKEN"
Step 6: Post-Ingest Finalization (Required)
These run once after a full crawl + taxonomy extraction. On a fresh install,
6b and 6c are required operator steps (not part of the ingest pipeline) —
skipping them leaves features silently broken even though search "works". 6a is
now automatic at ingest time and is only needed to re-stamp legacy docs (see
its note below). All three are idempotent (safe to re-run) and read
DATABASE_URL (already set inside the zol-app container).
Run them inside the app container:
# 6a. Validate & stamp department schedules — AUTOMATIC on fresh installs
# `department_schedule_meta.validated` is now stamped automatically at
# ingest time for every raadplegingen page. Run this script only to
# RE-STAMP legacy docs ingested before this was automatic, or after a
# validation-logic change. Safe to skip on a fresh install (new docs are
# stamped at crawl time). Idempotent.
docker exec zol-app python -m scripts.validate_and_stamp_department_schedules
# 6b. Seed ontology aliases — REQUIRED for spoken-specialty resolution
# Materialises the approved specialty-noun → department map (e.g. "longarts"
# → Pneumologie) into app.ontology_aliases, and creates any missing
# department taxonomy entities so the EntityLinker can resolve them.
docker exec zol-app python -m scripts.seed_ontology_aliases
# 6c. Seed SNOMED domain mapping — REQUIRED for domain plausibility guards
# ~38 SNOMED parent-concept → clinical-domain rows. Must run AFTER the SNOMED
# import (Step 1). Idempotent UPSERT.
docker exec zol-app python -m scripts.seed_snomed_domain_mapping
After 6a, confirm at least some departments are validated — this is what the schedule tools resolve against:
docker exec zol-postgres psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -t -c \
"SELECT count(*) FROM app.documents
WHERE metadata->'department_schedule_meta'->>'validated' = 'true';"
0 means no schedules were validated — re-check that the crawl ingested the
… – Raadplegingen overview pages and that 6a ran without error.
backend/scripts/backfill_consultation_schedule.py and
backfill_department_schedule.py are one-shot historical migrations for
corpora ingested before the at-ingest extractors existed — you do not run
them on a fresh install (new docs get consultation_schedule /
department_schedule extracted automatically at ingest time). Only the three
finalization scripts above are needed on a new server.
Step 7: Verify Search Works
# Test a search query (public endpoint, no auth needed)
curl -X POST https://YOUR_DOMAIN/api/v1/public/query \
-H "Content-Type: application/json" \
-d '{"query": "Welke arts behandelt een hernia?"}'
# Test the doctor-schedule tool end-to-end
curl -X POST https://YOUR_DOMAIN/api/v1/public/query \
-H "Content-Type: application/json" \
-d '{"query": "Welke artsen van urologie houden raadpleging op maandagvoormiddag?"}'
The second query should return a grounded list of doctors with citations. If it
hedges ("ik kan geen betrouwbare lijst geven…"), the crawl likely did not ingest
the … – Raadplegingen overview pages, or their schedules failed validation — check
that department_schedule_meta.validated = 'true' exists for some documents (the
Step 6a verify query above). On a
fresh install these are stamped at ingest; for legacy docs, run Step 6a.
You should receive a JSON response with relevant results, source citations, and the safety disclaimer.
Disk Usage Estimates
| Data | Current Size | 1-Year Estimate |
|---|---|---|
| PostgreSQL (relational + embeddings + taxonomy) | ~2.5 GB | ~5.5 GB |
| MinIO documents | ~5 GB | ~11 GB |
| SNOMED tables | ~2 GB | ~2 GB (static) |
Architectural Evolution
The data seeding workflow was updated in March 2026 to reflect the migration from cookie-based authentication to Keycloak OIDC. All API calls now use Bearer token authentication obtained from Keycloak's token endpoint, replacing the previous pattern of registering users via /api/v1/auth/register and using cookie-based sessions. Admin user creation is now performed through the Keycloak admin console or REST API rather than through application-level registration endpoints.
Next: SSL & DNS →