PII Protection

Healthcare data is among the most sensitive categories of personal information under EU law: GDPR Art. 9 elevates "data concerning health" to special-category status with stricter processing conditions, and the Belgian Federal Public Service Public Health enforces sector-specific privacy obligations on top of GDPR. The ZOL Intelligent Search system implements a nuanced PII-protection strategy that balances those obligations with the practical reality that hospital public information is intended to contain contact data.

Two-Layer PII Architecture

Layer 1: HTTP request PII detection

Status: Always active.

Every incoming user query is scanned for PII patterns before processing. This layer serves an audit and monitoring purpose rather than a blocking function. When PII is detected in a user query:

The PII type and occurrence are logged to the audit trail
The query is processed normally (the system does not reject PII-containing queries)
The PII data is not retained in the semantic-query cache (app.semantic_cache)

Detection patterns

PII type	Pattern	Example	Regulatory anchor
Email	Standard email regex	`patient@example.com`	GDPR Art. 4(1)
Phone (Belgian)	`+32` / `04xx` / `089/...` formats	`089 32 50 50`, `+32 89 325050`	GDPR Art. 4(1); E.164 (@itu_e164) compliant
Rijksregisternummer	Belgian national ID, 11-digit checksum format	`85.01.15-123.45`	Belgian Royal Decree of 8 January 1973
IBAN	International bank-account number	`BE68 5390 0754 7034`	ISO 13616
Credit card	Major card-number patterns	`4111 1111 1111 1111`	PCI-DSS handling not in scope (no card processing)
IP address	IPv4 / IPv6	`x.x.x.x`	GDPR Art. 4(1) (CJEU C-582/14, Breyer)

Why not block PII-containing queries?

A patient might legitimately include PII in a query: "I'm looking for Dr. Peeters, my phone is 0489 12 34 56, can you have them call me?" Blocking such queries would degrade user experience without commensurate privacy benefit. Instead, the system:

Processes the query using only the semantically relevant parts;
Logs the PII occurrence for compliance monitoring (GDPR Art. 30 records of processing activities);
Excludes responses to PII-containing queries from the semantic cache so the PII is not embedded into a long-lived store.

This design follows the GDPR Art. 25 data protection by design principle: the processing is configured to minimise personal-data retention rather than to refuse the legitimate request.

Layer 2: content-side PII masking

Status: Disabled for ZOL (config: pii_mask_document_content=false).

The ingestion pipeline (backend/app/services/processing_service.py) includes a PII-masking capability that can detect and redact PII from ingested document content. For the ZOL deployment, this layer is intentionally disabled.

Rationale

Hospital public content is intended to contain contact information:

Department emails: facturatie@zol.be, onthaal@zol.be
Phone numbers: 089 32 50 50 (main switchboard), department-specific lines
Doctor names: published on the public website

Masking this information would render the search system unable to answer fundamental questions like "What is the phone number for the Cardiology department?" or "How do I email the billing department?".

Design decision

This is not a security oversight but a deliberate architectural decision documented as part of the GDPR Art. 25 data protection by design analysis. The ZOL content corpus consists exclusively of publicly published hospital information: there are no patient records, no employee personal data, and no confidential records in the search index. All content is sourced from the hospital's public website and published brochures.

Trade-offs

Alternative considered	Why rejected
Mask everything by default	Renders the service useless for its primary task. "What is the phone number for X?" — the most common navigational question — would always return [REDACTED]. The privacy benefit is zero (the content is public), the utility loss is catastrophic.
Mask only in user-uploaded content	The hospital content corpus is not user-uploaded; it is curated by the hospital. The capability exists in the code but the trigger condition (user-uploaded content with possible non-public PII) does not currently occur.
Hash-then-mask	Adds preprocessing cost and creates a synthetic "deidentified" surface that still leaks structure. Standard de-identification literature (HIPAA Safe Harbor) treats this as a medium-confidence transformation; for our use case, "this content is public" is a stronger guarantee than "this content has been hashed."

When masking would be appropriate

Content-side PII masking is built and tested but disabled for the current ZOL deployment. The capability would activate if the ingestion pipeline were extended to user-uploaded content (a future feature explicitly out of scope).

Layer 3: voice-side redaction

Status: Always active. Implementation: backend/app/services/voice/voice_pii_redaction.py.

The voice channel introduces a different threat surface: callers commonly speak phone numbers, dates of birth, and self-introduce by name as part of normal conversation. Without intervention, those utterances would surface verbatim in structured logs that are designed for engineering observability rather than personal-data retention. The voice-side redaction module strips PII patterns before structured-log emission, satisfying GDPR Art. 5(1)(c) data minimisation at the log boundary.

Patterns covered

Pattern class	Coverage	Example match
Belgian phone (international form)	`+32 ...` with various separators	`+32 89 80 80 80`
Belgian phone (domestic form)	`0XX ...`	`0473 12 34 56`, `089/80.80.80`
Belgian phone (compact)	9–10 digit run	`089808080`
International phone fallback	8+ digit run with separators	`123 456 7890`
Date of birth	DD/MM/YYYY, DD-MM-YYYY, DD.MM.YYYY (4-digit year required)	`15/01/1985`
Self-introduction names	Trigger phrase + 1–3 capitalised tokens	"ik ben Anna Verstraeten" → "ik ben [REDACTED:name]"

Pseudonymisation for audit-trail correlation

The module also exposes hash_for_audit() which returns a salted, truncated SHA-256 digest of the redacted text. The hash is used in audit-log lines where the operator needs to correlate turns across pipeline stages without retaining plaintext. This is pseudonymisation as defined in GDPR Art. 4(5): "the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information". The hash is one-way (SHA-256), salted with a constant (correlation, not security), and truncated to 16 hex characters (sufficient for cross-stage correlation, far below the collision-attack threshold).

What is intentionally NOT covered

Pattern	Why not
Rijksregisternummer	Would require 11-digit + checksum logic; fail-open beats false-positive blocking on legitimate utterances containing numbers (such as "I have 11 questions about ..."). Logged in structured fields, not free text.
Email addresses	Callers do not speak email addresses over voice; pattern would never trigger.
Medical diagnoses	Protected by domain disclosure (the system never claims a diagnosis), not by redaction. Redacting the word cardiologie would break the search use case.

All PII-related events are logged to a structured audit trail per Art. 30 records of processing activities:

Event	Logged data	Retention
PII detected in query	PII type, timestamp, session ID	90 days (see Data Retention Policy)
PII-containing query processed	Query hash (not content), response status	90 days
Voice transcript redaction	Pattern class matched, redaction count	Permanent (engineering-only metadata, no PII)
GDPR Art. 17 deletion	User ID, deletion counts per table	Permanent (compliance evidence)

The audit trail supports compliance reporting and enables the hospital's privacy officer to monitor PII-exposure patterns without accessing the PII data itself.

The PII-protection strategy aligns with the relevant GDPR principles:

GDPR principle	Article	Implementation
Lawfulness, fairness, transparency	Art. 5(1)(a)	Privacy notice references this strategy; disclosure that the system processes user queries
Purpose limitation	Art. 5(1)(b)	PII in queries is used only for response generation, never for secondary purposes
Data minimisation	Art. 5(1)(c)	Semantic cache excludes PII; voice logs redact PII before emission
Storage limitation	Art. 5(1)(e)	Audit logs retained per the Data Retention Policy
Integrity and confidentiality	Art. 5(1)(f)	PII logging uses structured format without raw content; voice redaction; TLS in transit
Accountability	Art. 5(2)	Audit trail provides compliance evidence; this document is itself accountability artifact
Pseudonymisation	Art. 4(5), Art. 25	`hash_for_audit()` SHA-256 correlation hashing; voice transcript redaction
Right to erasure	Art. 17	`DELETE /api/v1/gdpr/users/{user_id}/data` cascades through all PII-touching tables (`backend/app/api/gdpr.py`)

Comparative regulatory mapping

Aspect	EU GDPR	U.S. HIPAA
Identifier scope	Art. 4(1) "any information relating to an identified or identifiable natural person" — open-ended	HIPAA Safe Harbor — explicit list of 18 identifiers
Pseudonymisation	Art. 4(5) — defined and required by Art. 25 / 32	"Limited Data Set" allows specified PHI; full Safe Harbor de-identifies all 18
Sector applicability	Cross-sector	Healthcare-only (covered entities + business associates)
Sanction model	Art. 83 — up to 4 % of global turnover	OCR civil-monetary penalties tiered by violation severity

The ZOL system is GDPR-primary; the HIPAA Safe Harbor list is referenced as a complementary identifier inventory because the 18 categories overlap substantially with European supervisory-authority guidance on identifier examples.

See GDPR (Regulation (EU) 2016/679) for canonical text.

References

Regulation (EU) 2016/679 — General Data Protection Regulation, Articles 4, 5, 17, 25, 28, 30, 32.
@hipaa_safe_harbor — HIPAA Safe Harbor de-identification methodology (U.S. analogue).
@itu_e164 — ITU-T Recommendation E.164 international phone numbering plan.
CJEU C-582/14 Patrick Breyer v Bundesrepublik Deutschland (2016) — IP addresses are personal data when combined with means likely to be used to identify the data subject.
Belgian Royal Decree of 8 January 1973 — Rijksregisternummer format.
Data Retention Policy — retention durations for all PII-touching data classes.
DPIA — Art. 35 risk assessment.

Two-Layer PII Architecture​

Layer 1: HTTP request PII detection​

Detection patterns​

Why not block PII-containing queries?​

Layer 2: content-side PII masking​

Rationale​

Trade-offs​

When masking would be appropriate​

Layer 3: voice-side redaction​

Patterns covered​

Pseudonymisation for audit-trail correlation​

What is intentionally NOT covered​

Audit trail (GDPR Art. 30)​

GDPR alignment​

Comparative regulatory mapping​

References​