PII Protection
Healthcare data is among the most sensitive categories of personal information under EU law: GDPR Art. 9 elevates "data concerning health" to special-category status with stricter processing conditions, and the Belgian Federal Public Service Public Health enforces sector-specific privacy obligations on top of GDPR. The ZOL Intelligent Search system implements a nuanced PII-protection strategy that balances those obligations with the practical reality that hospital public information is intended to contain contact data.
Two-Layer PII Architecture
Layer 1: HTTP request PII detection
Status: Always active.
Every incoming user query is scanned for PII patterns before processing. This layer serves an audit and monitoring purpose rather than a blocking function. When PII is detected in a user query:
- The PII type and occurrence are logged to the audit trail
- The query is processed normally (the system does not reject PII-containing queries)
- The PII data is not retained in the semantic-query cache (
app.semantic_cache)
Detection patterns
| PII type | Pattern | Example | Regulatory anchor |
|---|---|---|---|
| Standard email regex | patient@example.com | GDPR Art. 4(1) | |
| Phone (Belgian) | +32 / 04xx / 089/... formats | 089 32 50 50, +32 89 325050 | GDPR Art. 4(1); E.164 (@itu_e164) compliant |
| Rijksregisternummer | Belgian national ID, 11-digit checksum format | 85.01.15-123.45 | Belgian Royal Decree of 8 January 1973 |
| IBAN | International bank-account number | BE68 5390 0754 7034 | ISO 13616 |
| Credit card | Major card-number patterns | 4111 1111 1111 1111 | PCI-DSS handling not in scope (no card processing) |
| IP address | IPv4 / IPv6 | 192.168.1.1 | GDPR Art. 4(1) (CJEU C-582/14, Breyer) |
Why not block PII-containing queries?
A patient might legitimately include PII in a query: "I'm looking for Dr. Peeters, my phone is 0489 12 34 56, can you have them call me?" Blocking such queries would degrade user experience without commensurate privacy benefit. Instead, the system:
- Processes the query using only the semantically relevant parts;
- Logs the PII occurrence for compliance monitoring (GDPR Art. 30 records of processing activities);
- Excludes responses to PII-containing queries from the semantic cache so the PII is not embedded into a long-lived store.
This design follows the GDPR Art. 25 data protection by design principle: the processing is configured to minimise personal-data retention rather than to refuse the legitimate request.
Layer 2: content-side PII masking
Status: Disabled for ZOL (config: pii_mask_document_content=false).
The ingestion pipeline (backend/app/services/processing_service.py) includes a PII-masking capability that can detect and redact PII from ingested document content. For the ZOL deployment, this layer is intentionally disabled.
Rationale
Hospital public content is intended to contain contact information:
- Department emails:
facturatie@zol.be,onthaal@zol.be - Phone numbers: 089 32 50 50 (main switchboard), department-specific lines
- Doctor names: published on the public website
Masking this information would render the search system unable to answer fundamental questions like "What is the phone number for the Cardiology department?" or "How do I email the billing department?".
This is not a security oversight but a deliberate architectural decision documented as part of the GDPR Art. 25 data protection by design analysis. The ZOL content corpus consists exclusively of publicly published hospital information: there are no patient records, no employee personal data, and no confidential records in the search index. All content is sourced from the hospital's public website and published brochures.
Trade-offs
| Alternative considered | Why rejected |
|---|---|
| Mask everything by default | Renders the service useless for its primary task. "What is the phone number for X?" — the most common navigational question — would always return [REDACTED]. The privacy benefit is zero (the content is public), the utility loss is catastrophic. |
| Mask only in user-uploaded content | The hospital content corpus is not user-uploaded; it is curated by the hospital. The capability exists in the code but the trigger condition (user-uploaded content with possible non-public PII) does not currently occur. |
| Hash-then-mask | Adds preprocessing cost and creates a synthetic "deidentified" surface that still leaks structure. Standard de-identification literature (HIPAA Safe Harbor) treats this as a medium-confidence transformation; for our use case, "this content is public" is a stronger guarantee than "this content has been hashed." |
When masking would be appropriate
Content-side PII masking is built and tested but disabled for the current ZOL deployment. The capability would activate if the ingestion pipeline were extended to user-uploaded content (a future feature explicitly out of scope).
Layer 3: voice-side redaction
Status: Always active. Implementation: backend/app/services/voice/voice_pii_redaction.py.
The voice channel introduces a different threat surface: callers commonly speak phone numbers, dates of birth, and self-introduce by name as part of normal conversation. Without intervention, those utterances would surface verbatim in structured logs that are designed for engineering observability rather than personal-data retention. The voice-side redaction module strips PII patterns before structured-log emission, satisfying GDPR Art. 5(1)(c) data minimisation at the log boundary.
Patterns covered
| Pattern class | Coverage | Example match |
|---|---|---|
| Belgian phone (international form) | +32 ... with various separators | +32 89 80 80 80 |
| Belgian phone (domestic form) | 0XX ... | 0473 12 34 56, 089/80.80.80 |
| Belgian phone (compact) | 9–10 digit run | 089808080 |
| International phone fallback | 8+ digit run with separators | 123 456 7890 |
| Date of birth | DD/MM/YYYY, DD-MM-YYYY, DD.MM.YYYY (4-digit year required) | 15/01/1985 |
| Self-introduction names | Trigger phrase + 1–3 capitalised tokens | "ik ben Anna Verstraeten" → "ik ben [REDACTED:name]" |
Pseudonymisation for audit-trail correlation
The module also exposes hash_for_audit() which returns a salted, truncated SHA-256 digest of the redacted text. The hash is used in audit-log lines where the operator needs to correlate turns across pipeline stages without retaining plaintext. This is pseudonymisation as defined in GDPR Art. 4(5): "the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information". The hash is one-way (SHA-256), salted with a constant (correlation, not security), and truncated to 16 hex characters (sufficient for cross-stage correlation, far below the collision-attack threshold).
What is intentionally NOT covered
| Pattern | Why not |
|---|---|
| Rijksregisternummer | Would require 11-digit + checksum logic; fail-open beats false-positive blocking on legitimate utterances containing numbers (such as "I have 11 questions about ..."). Logged in structured fields, not free text. |
| Email addresses | Callers do not speak email addresses over voice; pattern would never trigger. |
| Medical diagnoses | Protected by domain disclosure (the system never claims a diagnosis), not by redaction. Redacting the word cardiologie would break the search use case. |
Audit trail (GDPR Art. 30)
All PII-related events are logged to a structured audit trail per Art. 30 records of processing activities:
| Event | Logged data | Retention |
|---|---|---|
| PII detected in query | PII type, timestamp, session ID | 90 days (see Data Retention Policy) |
| PII-containing query processed | Query hash (not content), response status | 90 days |
| Voice transcript redaction | Pattern class matched, redaction count | Permanent (engineering-only metadata, no PII) |
| GDPR Art. 17 deletion | User ID, deletion counts per table | Permanent (compliance evidence) |
The audit trail supports compliance reporting and enables the hospital's privacy officer to monitor PII-exposure patterns without accessing the PII data itself.
GDPR alignment
The PII-protection strategy aligns with the relevant GDPR principles:
| GDPR principle | Article | Implementation |
|---|---|---|
| Lawfulness, fairness, transparency | Art. 5(1)(a) | Privacy notice references this strategy; disclosure that the system processes user queries |
| Purpose limitation | Art. 5(1)(b) | PII in queries is used only for response generation, never for secondary purposes |
| Data minimisation | Art. 5(1)(c) | Semantic cache excludes PII; voice logs redact PII before emission |
| Storage limitation | Art. 5(1)(e) | Audit logs retained per the Data Retention Policy |
| Integrity and confidentiality | Art. 5(1)(f) | PII logging uses structured format without raw content; voice redaction; TLS in transit |
| Accountability | Art. 5(2) | Audit trail provides compliance evidence; this document is itself accountability artifact |
| Pseudonymisation | Art. 4(5), Art. 25 | hash_for_audit() SHA-256 correlation hashing; voice transcript redaction |
| Right to erasure | Art. 17 | DELETE /api/v1/gdpr/users/{user_id}/data cascades through all PII-touching tables (backend/app/api/gdpr.py) |
Comparative regulatory mapping
| Aspect | EU GDPR | U.S. HIPAA |
|---|---|---|
| Identifier scope | Art. 4(1) "any information relating to an identified or identifiable natural person" — open-ended | HIPAA Safe Harbor — explicit list of 18 identifiers |
| Pseudonymisation | Art. 4(5) — defined and required by Art. 25 / 32 | "Limited Data Set" allows specified PHI; full Safe Harbor de-identifies all 18 |
| Sector applicability | Cross-sector | Healthcare-only (covered entities + business associates) |
| Sanction model | Art. 83 — up to 4 % of global turnover | OCR civil-monetary penalties tiered by violation severity |
The ZOL system is GDPR-primary; the HIPAA Safe Harbor list is referenced as a complementary identifier inventory because the 18 categories overlap substantially with European supervisory-authority guidance on identifier examples.
See GDPR (Regulation (EU) 2016/679) for canonical text.
References
- Regulation (EU) 2016/679 — General Data Protection Regulation, Articles 4, 5, 17, 25, 28, 30, 32.
- @hipaa_safe_harbor — HIPAA Safe Harbor de-identification methodology (U.S. analogue).
- @itu_e164 — ITU-T Recommendation E.164 international phone numbering plan.
- CJEU C-582/14 Patrick Breyer v Bundesrepublik Deutschland (2016) — IP addresses are personal data when combined with means likely to be used to identify the data subject.
- Belgian Royal Decree of 8 January 1973 — Rijksregisternummer format.
- Data Retention Policy — retention durations for all PII-touching data classes.
- DPIA — Art. 35 risk assessment.