Skip to main content

PII Protection

Healthcare data is among the most sensitive categories of personal information under EU law: GDPR Art. 9 elevates "data concerning health" to special-category status with stricter processing conditions, and the Belgian Federal Public Service Public Health enforces sector-specific privacy obligations on top of GDPR. The ZOL Intelligent Search system implements a nuanced PII-protection strategy that balances those obligations with the practical reality that hospital public information is intended to contain contact data.

Two-Layer PII Architecture

Layer 1: HTTP request PII detection

Status: Always active.

Every incoming user query is scanned for PII patterns before processing. This layer serves an audit and monitoring purpose rather than a blocking function. When PII is detected in a user query:

  1. The PII type and occurrence are logged to the audit trail
  2. The query is processed normally (the system does not reject PII-containing queries)
  3. The PII data is not retained in the semantic-query cache (app.semantic_cache)

Detection patterns

PII typePatternExampleRegulatory anchor
EmailStandard email regexpatient@example.comGDPR Art. 4(1)
Phone (Belgian)+32 / 04xx / 089/... formats089 32 50 50, +32 89 325050GDPR Art. 4(1); E.164 (@itu_e164) compliant
RijksregisternummerBelgian national ID, 11-digit checksum format85.01.15-123.45Belgian Royal Decree of 8 January 1973
IBANInternational bank-account numberBE68 5390 0754 7034ISO 13616
Credit cardMajor card-number patterns4111 1111 1111 1111PCI-DSS handling not in scope (no card processing)
IP addressIPv4 / IPv6192.168.1.1GDPR Art. 4(1) (CJEU C-582/14, Breyer)

Why not block PII-containing queries?

A patient might legitimately include PII in a query: "I'm looking for Dr. Peeters, my phone is 0489 12 34 56, can you have them call me?" Blocking such queries would degrade user experience without commensurate privacy benefit. Instead, the system:

  • Processes the query using only the semantically relevant parts;
  • Logs the PII occurrence for compliance monitoring (GDPR Art. 30 records of processing activities);
  • Excludes responses to PII-containing queries from the semantic cache so the PII is not embedded into a long-lived store.

This design follows the GDPR Art. 25 data protection by design principle: the processing is configured to minimise personal-data retention rather than to refuse the legitimate request.

Layer 2: content-side PII masking

Status: Disabled for ZOL (config: pii_mask_document_content=false).

The ingestion pipeline (backend/app/services/processing_service.py) includes a PII-masking capability that can detect and redact PII from ingested document content. For the ZOL deployment, this layer is intentionally disabled.

Rationale

Hospital public content is intended to contain contact information:

  • Department emails: facturatie@zol.be, onthaal@zol.be
  • Phone numbers: 089 32 50 50 (main switchboard), department-specific lines
  • Doctor names: published on the public website

Masking this information would render the search system unable to answer fundamental questions like "What is the phone number for the Cardiology department?" or "How do I email the billing department?".

Design decision

This is not a security oversight but a deliberate architectural decision documented as part of the GDPR Art. 25 data protection by design analysis. The ZOL content corpus consists exclusively of publicly published hospital information: there are no patient records, no employee personal data, and no confidential records in the search index. All content is sourced from the hospital's public website and published brochures.

Trade-offs

Alternative consideredWhy rejected
Mask everything by defaultRenders the service useless for its primary task. "What is the phone number for X?" — the most common navigational question — would always return [REDACTED]. The privacy benefit is zero (the content is public), the utility loss is catastrophic.
Mask only in user-uploaded contentThe hospital content corpus is not user-uploaded; it is curated by the hospital. The capability exists in the code but the trigger condition (user-uploaded content with possible non-public PII) does not currently occur.
Hash-then-maskAdds preprocessing cost and creates a synthetic "deidentified" surface that still leaks structure. Standard de-identification literature (HIPAA Safe Harbor) treats this as a medium-confidence transformation; for our use case, "this content is public" is a stronger guarantee than "this content has been hashed."

When masking would be appropriate

Content-side PII masking is built and tested but disabled for the current ZOL deployment. The capability would activate if the ingestion pipeline were extended to user-uploaded content (a future feature explicitly out of scope).

Layer 3: voice-side redaction

Status: Always active. Implementation: backend/app/services/voice/voice_pii_redaction.py.

The voice channel introduces a different threat surface: callers commonly speak phone numbers, dates of birth, and self-introduce by name as part of normal conversation. Without intervention, those utterances would surface verbatim in structured logs that are designed for engineering observability rather than personal-data retention. The voice-side redaction module strips PII patterns before structured-log emission, satisfying GDPR Art. 5(1)(c) data minimisation at the log boundary.

Patterns covered

Pattern classCoverageExample match
Belgian phone (international form)+32 ... with various separators+32 89 80 80 80
Belgian phone (domestic form)0XX ...0473 12 34 56, 089/80.80.80
Belgian phone (compact)9–10 digit run089808080
International phone fallback8+ digit run with separators123 456 7890
Date of birthDD/MM/YYYY, DD-MM-YYYY, DD.MM.YYYY (4-digit year required)15/01/1985
Self-introduction namesTrigger phrase + 1–3 capitalised tokens"ik ben Anna Verstraeten""ik ben [REDACTED:name]"

Pseudonymisation for audit-trail correlation

The module also exposes hash_for_audit() which returns a salted, truncated SHA-256 digest of the redacted text. The hash is used in audit-log lines where the operator needs to correlate turns across pipeline stages without retaining plaintext. This is pseudonymisation as defined in GDPR Art. 4(5): "the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information". The hash is one-way (SHA-256), salted with a constant (correlation, not security), and truncated to 16 hex characters (sufficient for cross-stage correlation, far below the collision-attack threshold).

What is intentionally NOT covered

PatternWhy not
RijksregisternummerWould require 11-digit + checksum logic; fail-open beats false-positive blocking on legitimate utterances containing numbers (such as "I have 11 questions about ..."). Logged in structured fields, not free text.
Email addressesCallers do not speak email addresses over voice; pattern would never trigger.
Medical diagnosesProtected by domain disclosure (the system never claims a diagnosis), not by redaction. Redacting the word cardiologie would break the search use case.

Audit trail (GDPR Art. 30)

All PII-related events are logged to a structured audit trail per Art. 30 records of processing activities:

EventLogged dataRetention
PII detected in queryPII type, timestamp, session ID90 days (see Data Retention Policy)
PII-containing query processedQuery hash (not content), response status90 days
Voice transcript redactionPattern class matched, redaction countPermanent (engineering-only metadata, no PII)
GDPR Art. 17 deletionUser ID, deletion counts per tablePermanent (compliance evidence)

The audit trail supports compliance reporting and enables the hospital's privacy officer to monitor PII-exposure patterns without accessing the PII data itself.

GDPR alignment

The PII-protection strategy aligns with the relevant GDPR principles:

GDPR principleArticleImplementation
Lawfulness, fairness, transparencyArt. 5(1)(a)Privacy notice references this strategy; disclosure that the system processes user queries
Purpose limitationArt. 5(1)(b)PII in queries is used only for response generation, never for secondary purposes
Data minimisationArt. 5(1)(c)Semantic cache excludes PII; voice logs redact PII before emission
Storage limitationArt. 5(1)(e)Audit logs retained per the Data Retention Policy
Integrity and confidentialityArt. 5(1)(f)PII logging uses structured format without raw content; voice redaction; TLS in transit
AccountabilityArt. 5(2)Audit trail provides compliance evidence; this document is itself accountability artifact
PseudonymisationArt. 4(5), Art. 25hash_for_audit() SHA-256 correlation hashing; voice transcript redaction
Right to erasureArt. 17DELETE /api/v1/gdpr/users/{user_id}/data cascades through all PII-touching tables (backend/app/api/gdpr.py)

Comparative regulatory mapping

AspectEU GDPRU.S. HIPAA
Identifier scopeArt. 4(1) "any information relating to an identified or identifiable natural person" — open-endedHIPAA Safe Harbor — explicit list of 18 identifiers
PseudonymisationArt. 4(5) — defined and required by Art. 25 / 32"Limited Data Set" allows specified PHI; full Safe Harbor de-identifies all 18
Sector applicabilityCross-sectorHealthcare-only (covered entities + business associates)
Sanction modelArt. 83 — up to 4 % of global turnoverOCR civil-monetary penalties tiered by violation severity

The ZOL system is GDPR-primary; the HIPAA Safe Harbor list is referenced as a complementary identifier inventory because the 18 categories overlap substantially with European supervisory-authority guidance on identifier examples.

See GDPR (Regulation (EU) 2016/679) for canonical text.

References

  • Regulation (EU) 2016/679 — General Data Protection Regulation, Articles 4, 5, 17, 25, 28, 30, 32.
  • @hipaa_safe_harbor — HIPAA Safe Harbor de-identification methodology (U.S. analogue).
  • @itu_e164 — ITU-T Recommendation E.164 international phone numbering plan.
  • CJEU C-582/14 Patrick Breyer v Bundesrepublik Deutschland (2016) — IP addresses are personal data when combined with means likely to be used to identify the data subject.
  • Belgian Royal Decree of 8 January 1973 — Rijksregisternummer format.
  • Data Retention Policy — retention durations for all PII-touching data classes.
  • DPIA — Art. 35 risk assessment.