Skip to main content

Prompt Engineering Standards

"this is not world class prompt engineering?"

"llms think in English, right? ... maybe this is why our prompts are so long"

project lead, 2026-05-31, mid-way through a chat-language bug fix

This page documents the prompt-engineering rubric ratified on 2026-05-31 for the ZOL Hospital Intelligent Search project. Like the Decision-Cost Rubric, it exists because of one specific, traceable failure: a single self-contradicting clause in a system prompt that silently answered Romanian questions in Dutch in production. The case study below is the argument.

The authoritative, copy-pasteable rubric lives in the repo at docs/prompt-engineering-standards.md. This page is the showcase — it carries the story that makes the rules stick.


TL;DR

Seven principles, P1–P7:

  1. P1 — Instruction language is English; output language is a directive. Author rules once, in English. Control answer language with ALWAYS answer in {detected_language}, never by re-writing the rules in each language.
  2. P2 — Positive framing over negative. Positive directive > contrastive WRONG/CORRECT pair > bare "NEVER do X" > a "NEVER" that quotes the forbidden output verbatim (worst — it primes what it bans).
  3. P3 — No forensic metadata in prompt strings. Incident IDs, dates, and call IDs cost tokens on every call and mean nothing to the model. They belong in git, not in the prompt.
  4. P4 — Every behavioral rule has a paired eval. A rule no test would catch regressing is debt; the test lands with the rule.
  5. P5 — No contradictory constraints in one clause. A clause that names a value and a restriction excluding it is a defect.
  6. P6 — Separate LLM-instructions from deterministic canned output. Refusals/disclaimers that bypass the model stay in localized lookup tables, not the prompt.
  7. P7 — Channel-appropriate; no cross-contamination. Chat ≠ voice; a rule that only fits one channel lives only in that channel's builder.

The case study: how a Romanian question got a Dutch answer

A user opened the ZOL chat and typed, in Romanian:

am guta, ce doctor ma ajuta? — "I have gout, which doctor helps me?"

The system answered in Dutch — a full, correct, well-cited answer about Reumatologie. Everything worked except the language. Stranger still: a different Romanian message in the same session ("raspunde pe romaneste") came back in Romanian. Two paths, two languages, same conversation.

What was actually happening

The pipeline did almost everything right. It detected the language as Romanian (ro). It even applied a cross-lingual retrieval discount so the Romanian query wouldn't be wrongly refused. The detected code reached the answer prompt intact. Then the prompt told the model this:

You speak Română (ro; nl|en|fr|it).

Read it as the model does. It names a value — Română — and, in the same breath, a restricted set that excludes it: nl|en|fr|it. Faced with a contradiction, the model resolved it the most "reasonable" way available: it fell back to the hospital's primary language, Dutch.

The Romanian refusal, meanwhile, came from a completely different mechanism — a static, pre-translated lookup table (get_blocked_message("off_topic", "ro")) that never touches the LLM and simply returned its Romanian entry. So the refusal honored ro and the generated answer did not. That split is the whole bug.

The fix

One clause. The contradiction (P5) became a positive directive (P1, P2):

ALWAYS write your entire answer in {detected_language_name}
({detected_language_code}) — the same language as the user's question.
The retrieved brochures are mostly in Dutch; translate the grounded facts
into {detected_language_name} rather than copying the Dutch phrasing.

Three regression tests landed with the fix (P4), pinning the contract: no nl|en|fr|it restriction for any language, ro produces a "Română" directive, and the Romanian disclaimer is present.

The diagnostic signal

A contradiction like (ro; nl|en|fr|it) surviving in production is not just a typo — it is evidence that the prompt has no holistic test harness, only per-incident patches. Nothing asserted "question language → answer language," so nothing caught it. That absence is what P4 closes.


P1 — and the "LLMs think in English" question

The project lead's instinct — "llms think in English, right? maybe this is why our prompts are so long" — is half right, and the precise half matters.

LLMs do not literally think in English. Interpretability work on Claude shows a shared, language-agnostic concept space: the same internal features activate for a concept whether the prompt is Dutch, English, or Chinese, and that cross-lingual sharing grows with model scale. The kernel of truth is narrower: for English-dominant training data, instruction-following and reasoning are most robust in English.

The consequence is the load-bearing one: instruction language and output language are independent. English instructions plus an explicit answer in {language} directive produce fluent target-language output. That is not a workaround — it is the normal, intended mode.

So the prompts are not long because they are multilingual. The chat system prompt is one English instruction set; rendering it in nl, ro, or en produces the same ~12,200 characters. What is duplicated is elsewhere — the voice answer-style rules, hand-authored four times in nl/en/fr/it. That duplication is a maintenance tax with no model benefit, and it is the target of a later sub-project.


P2 — why "NEVER do X" underperforms

Negation is a fragile operation for an autoregressive model: to process "don't emit Dutch," it must first represent emitting Dutch. The worst form quotes the forbidden output verbatim — which literally injects those tokens into context, priming the behavior it bans (the "pink-elephant" effect).

Ranked best to worst:

TierFormExample
1 (best)Positive directive"ALWAYS answer in the call's locked language."
2Contrastive pair"WRONG: seven names in a row. CORRECT: two names + 'and others'."
3Bare prohibition"NEVER mix languages."
4 (worst)Prohibition quoting the forbidden output"NEVER say 'Daar kan ik geen...'"

The chat fix used tier 1 on purpose. A voice prompt rule that currently sits at tier 4 — forbidding a Dutch leak while quoting the exact Dutch sentence — is a known rewrite target.

Honest caveat

This is a strong empirical tendency, not a law. Frontier models handle negation far better than the GPT-3-era models where this advice originated. The real wins are removing the verbatim-output priming and the token cost — which an eval, not folklore, confirms per change.


The seven principles in full

#PrincipleThe defect it prevents
P1English instructions; output language is a directivePer-language rule duplication; contradictory language clauses
P2Positive framing over negativeWeak prohibitions; priming the banned output
P3No forensic metadata in prompt stringsPaying tokens for incident IDs the model can't use
P4Every behavioral rule has a paired evalContradictions surviving silently to production
P5No contradictory constraints in one clauseThe ro→Dutch bug
P6LLM-instructions separate from deterministic canned outputRefusals drifting from answers; bloating the prompt
P7Channel-appropriate; no cross-contaminationVoice tools or spoken-number rules leaking into chat

How the standards are enforced

  • A new prompt or prompt edit is checked against P1–P7 before it is committed.
  • A reviewer — human or the code-review tool — cites the violated principle by number, the same way the Decision-Cost Rubric is cited by axis.
  • The refactor program's sub-projects (classifier-leanness, chat polish, the voice rewrite) are each accepted only when they satisfy these principles and pass their eval gate.

The rubric ships with a worked example of itself: P4 ("every behavioral rule has a paired eval") is already satisfied by the three regression tests that landed with the ro→Dutch fix. The standard is not theory — it is the codification of a bug we already paid for.