Skip to main content

ADR-0036: Adversarial Input Hardening

Status: Accepted (February 2026)

Context

Universal adversarial suffix attacks (Zou et al., 2023) -- known as GCG attacks -- can bypass LLM safety alignment by appending optimized gibberish token sequences to harmful queries. These suffixes:

  • Bypass LLM safety alignment with 88% success rate on GPT-3.5/4
  • Transfer across models -- a suffix optimized on one model works on others
  • Are high-perplexity gibberish -- no meaningful English/Dutch words
  • Are undetectable by regex-based injection filters

For a hospital search system where zero medical advice incidents is the KPI, this represents a critical threat vector. The existing regex-only injection detection (10 patterns) cannot catch GCG-style attacks.

Decision

Implement four hardening measures:

H1: Perplexity-Based Anomaly Detector

A lightweight statistical check (detect_anomalous_input()) that catches GCG suffixes in under 5ms:

  1. Word-in-dictionary ratio: Normal Dutch >60%, GCG gibberish <20%
  2. Character-level entropy: Normal Dutch ~3.5-4.5 bits, GCG >5.5 bits
  3. Consecutive non-alphabetic characters: Flag sequences of 3+ special characters
  4. Non-word token ratio: Tokens matching [^a-zA-ZÀ-ÿ0-9\s]{3,}

Uses existing taxonomy vocabulary + a compact Dutch common word list (~5K words, loaded as frozenset at startup).

H2: LLM-as-Judge Safety Validation

Enable the existing (but previously disabled) LLM-as-judge safety layer by default. The judge evaluates whether generated responses contain medical advice that regex patterns miss. Cost-optimized by skipping safe intents and enforcing a 3-second timeout.

H3: Rate Limiter In-Memory Fallback + Burst Protection

  • In-memory fallback: dict[str, deque[float]] sliding window when Redis is unavailable (capped at 10K identifiers)
  • Burst protection: Max 5 requests per 10-second window per IP (in addition to 60/hour)

H4: Streaming Retraction Server-Side Enforcement

When the safety layer detects unsafe content during streaming, a retraction message replaces the streamed content. The WebSocket is closed with code 4001 (safety violation) after sending the retraction.

Consequences

Positive

  • GCG attacks blocked: Anomaly detector catches all tested GCG suffixes (0% false negatives on proven attack set)
  • Low false positive rate: Dutch compound words (hartchirurgie, kinderpsychiatrie) pass correctly
  • Defense in depth: 4 independent mechanisms, each effective alone
  • Minimal latency: H1 adds under 5ms; H2 is async with 3s timeout

Negative

  • H2 adds latency (~500ms) for queries producing medical-adjacent content
  • H3 in-memory fallback is per-process, not distributed (acceptable for single-instance deployment)
  • H1 dictionary dependency: New queries with unusual vocabulary may be flagged (configurable thresholds mitigate)

References

  • Liao, H., et al. (2024). AmpleGCG-Plus: A strong generative model of adversarial suffixes to jailbreak LLMs. arXiv preprint, arXiv:2410.22143. https://arxiv.org/abs/2410.22143
  • Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv preprint, arXiv:2307.15043. https://arxiv.org/abs/2307.15043