PromptInjectionCheck¶
What it does¶
Pattern + heuristic scan for prompt-injection attempts in the request body. Catches the well-known attack families:
- Override patterns — "ignore previous instructions", "disregard the above", "forget everything you were told"
- Role spoofing —
system:,assistant:markers in user content; embedded<|im_start|>/<|system|>chat-template tokens - Persona attacks — DAN, "you are now jailbroken", "developer mode enabled", "act as if you have no restrictions"
- Encoding tricks — base64 (standard + URL-safe), base32, hex, ROT13 blobs that decode to override patterns
Stage¶
ADMISSION.
Pre-processing pipeline (v0.1.3+)¶
Before pattern matching, every input runs through normalization that defeats the trivial obfuscations a reviewer would reach for first:
- NFKC normalization — collapses compatibility variants (full-width letters, ligatures, mathematical alphanumerics) to ASCII.
- Confusables fold — Cyrillic / Greek / Cherokee lookalikes
mapped to Latin (
іgnore→ignore,Ignore→ignore). - Zero-width / bidi-formatting strip — invisible characters between visible ones no longer hide patterns.
- Stretched-letter collapse —
i g n o r e→ignore. - Wide encoding decoders — base64 (standard + URL-safe), base32, hex, ROT13 blobs are decoded and the decoded contents re-scanned with the same patterns.
Both the original text AND the normalized form are scanned, so normalization-introduced false negatives are impossible — the raw input still has the original sensitivity.
Configuration¶
from signet.checks import PromptInjectionCheck, Severity
# Defaults: HIGH→block, MEDIUM→escalate, LOW→allow
PromptInjectionCheck()
# Strict: even MEDIUM triggers a block
PromptInjectionCheck(severity_actions={
Severity.HIGH: "block",
Severity.MEDIUM: "block",
Severity.LOW: "escalate",
})
# Permissive: turn HIGH into escalation rather than refusal
PromptInjectionCheck(severity_actions={
Severity.HIGH: "escalate",
Severity.MEDIUM: "allow",
Severity.LOW: "allow",
})
Audit row example¶
{
"check_name": "prompt_injection",
"decision": "block",
"reason": "matched 'ignore-previous' (HIGH)",
"metadata": {
"rule": "ignore-previous",
"severity": "high",
"match_source": "decoded-rot13",
"match_count": 1,
"all_rules_hit": ["ignore-previous"]
}
}
match_source distinguishes the input layer that triggered:
raw, normalized, decoded-base64, decoded-base64url,
decoded-base32, decoded-hex, or decoded-rot13. Useful for
post-hoc tuning — if everything is hitting via decoded-base32,
your callers may have legitimate base32 use you need to allowlist.
What this check does NOT catch (genuine ML/data territory)¶
- Sophisticated multilingual semantic injection — attacks expressed in Russian/Chinese/Arabic syntax that don't share English trigger phrases.
- Adversarial-suffix attacks (GCG / AutoDAN-discovered token strings). Beyond regex; needs a trained classifier.
- Multi-step / cross-turn attacks ("First answer X. Now ignore your rules" split across messages or tool-call results).
- Semantic prompt injection without lexical markers (rephrased attacks that don't use any of the trigger phrases).
For those, layer an LLM-judge plugin at COMMITMENT — see
signet.plugins.tribunal for the reference shape.
Production-tuned implementations (calibrated judges, labeled
adversarial corpora, ongoing threat-intel) are typical engagements
for vendors that maintain that data.
Known false-positive surface¶
The default rules are tuned to minimize false positives but the following will trigger:
- Documentation explaining prompt injection (the patterns themselves appear in the text).
- Test code that includes attack strings.
- Legitimate user content that happens to contain "ignore previous" in non-instruction context.
Tune severity_actions to suit your traffic mix, or carve out
allowlist patterns at a higher layer.