ADR-0036: Adversarial Input Hardening
Status: Accepted (February 2026)
Context
Universal adversarial suffix attacks (Zou et al., 2023) -- known as GCG attacks -- can bypass LLM safety alignment by appending optimized gibberish token sequences to harmful queries. These suffixes:
- Bypass LLM safety alignment with 88% success rate on GPT-3.5/4
- Transfer across models -- a suffix optimized on one model works on others
- Are high-perplexity gibberish -- no meaningful English/Dutch words
- Are undetectable by regex-based injection filters
For a hospital search system where zero medical advice incidents is the KPI, this represents a critical threat vector. The existing regex-only injection detection (10 patterns) cannot catch GCG-style attacks.
Decision
Implement four hardening measures:
H1: Perplexity-Based Anomaly Detector
A lightweight statistical check (detect_anomalous_input()) that catches GCG suffixes in under 5ms:
- Word-in-dictionary ratio: Normal Dutch >60%, GCG gibberish <20%
- Character-level entropy: Normal Dutch ~3.5-4.5 bits, GCG >5.5 bits
- Consecutive non-alphabetic characters: Flag sequences of 3+ special characters
- Non-word token ratio: Tokens matching
[^a-zA-ZÀ-ÿ0-9\s]{3,}
Uses existing taxonomy vocabulary + a compact Dutch common word list (~5K words, loaded as frozenset at startup).
H2: LLM-as-Judge Safety Validation
Enable the existing (but previously disabled) LLM-as-judge safety layer by default. The judge evaluates whether generated responses contain medical advice that regex patterns miss. Cost-optimized by skipping safe intents and enforcing a 3-second timeout.
H3: Rate Limiter In-Memory Fallback + Burst Protection
- In-memory fallback:
dict[str, deque[float]]sliding window when Redis is unavailable (capped at 10K identifiers) - Burst protection: Max 5 requests per 10-second window per IP (in addition to 60/hour)
H4: Streaming Retraction Server-Side Enforcement
When the safety layer detects unsafe content during streaming, a retraction message replaces the streamed content. The WebSocket is closed with code 4001 (safety violation) after sending the retraction.
Consequences
Positive
- GCG attacks blocked: Anomaly detector catches all tested GCG suffixes (0% false negatives on proven attack set)
- Low false positive rate: Dutch compound words (hartchirurgie, kinderpsychiatrie) pass correctly
- Defense in depth: 4 independent mechanisms, each effective alone
- Minimal latency: H1 adds under 5ms; H2 is async with 3s timeout
Negative
- H2 adds latency (~500ms) for queries producing medical-adjacent content
- H3 in-memory fallback is per-process, not distributed (acceptable for single-instance deployment)
- H1 dictionary dependency: New queries with unusual vocabulary may be flagged (configurable thresholds mitigate)
References
- Liao, H., et al. (2024). AmpleGCG-Plus: A strong generative model of adversarial suffixes to jailbreak LLMs. arXiv preprint, arXiv:2410.22143. https://arxiv.org/abs/2410.22143
- Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv preprint, arXiv:2307.15043. https://arxiv.org/abs/2307.15043