AI Sec
jailbreak

LLM Bypass: How Attackers Circumvent Safety Alignment at Every Layer

A technical breakdown of LLM bypass techniques — adversarial suffixes, shallow alignment exploits, fine-tuning attacks, and guardrail evasion — with practitioner-level implications for red teams and production defenders.

By AI Sec Editorial · · 8 min read

The phrase “llm bypass” covers a wider attack surface than jailbreaking alone. Jailbreaking usually means crafting a prompt that induces prohibited output from an otherwise-untouched model. LLM bypass also includes fine-tuning attacks that surgically remove alignment from model weights, adversarial suffix optimization that exploits gradient information without any pretense of social engineering, and evasion of the guardrail infrastructure deployed in front of the model itself. These are three distinct attack classes, each with different preconditions, success rates, and defenses. Understanding which class is relevant to your engagement changes what you test and what you build.

The Alignment Layer Is Shallower Than It Looks

The foundational automated bypass technique is still the Greedy Coordinate Gradient (GCG) attack, published by Zou et al. in 2023. GCG appends an adversarially optimized suffix to a harmful request — a string of tokens that looks like noise but shifts the model’s output distribution toward compliance. The optimization loop runs gradient-based search over the suffix tokens to maximize the log-probability of an affirmative response opener (“Sure, here is…”). Once crafted for one open-weight model, these suffixes transfer to black-box targets: ChatGPT, Claude, and Bard all produced harmful completions from suffixes optimized entirely on Llama.

The reason transfer works traces back to what safety alignment actually does. Post-training alignment — RLHF, RLAIF, constitutional AI — modifies the model’s behavior on its first few output tokens. Refusal is encoded primarily in the beginning of a generation: the model learns to start with “I can’t help with that” rather than “Sure.” This is sometimes called shallow safety alignment, and it has a direct mechanical consequence. Adversarial suffixes that shift the first token from a refusal opener to an agreement opener often unlock the rest of the completion without further resistance. The model was never trained to refuse mid-generation; it was trained to refuse at generation start. The alignment is real, but it is shallow, and that shallowness is exploitable.

The USENIX Security 2025 paper on aligned LLM refusal behavior (Yu et al.) documents this empirically: susceptibility to adversarial suffix attacks, prefilling attacks, and decoding parameter manipulation all trace to the same root — safety training concentrated at generation onset rather than distributed throughout the model’s decision process. Decoding attacks are worth noting specifically: sampling parameters like temperature and top-p can be adjusted to surface completions the model would normally suppress, without touching the prompt at all.

Fine-Tuning: The Weight-Level Bypass

When an attacker has the ability to fine-tune a model — or when a target organization is deploying a fine-tunable open-weight model — the bypass surface expands from prompt-space to parameter-space.

Research documented in Interconnects’ analysis of RLHF brittleness establishes the baseline: safety fine-tuning on models like Llama-2-chat can be substantially reversed with 10 to 100 supervised fine-tuning examples. Three data strategies achieve this: explicitly harmful Q&A pairs, identity-shifting examples that establish an “obedient agent” persona for the model, and benign examples that erode the association between helpfulness and safety constraints. The compute cost of the reversal is roughly three orders of magnitude less than the original alignment training — alignment is expensive to install, cheap to remove.

This has direct implications for enterprise deployments. Many organizations fine-tune open-weight models on proprietary data and host them internally or through model-as-a-service providers. If the fine-tuning process is not isolated from untrusted training data — and if the training pipeline lacks input validation — an adversary with write access to the training dataset can perform data poisoning that gradually degrades safety alignment across model versions. The attack is not speculative: the mechanism is identical to the fine-tuning bypass, just executed via the training pipeline rather than by a legitimate user with API access.

LoRA adapters compound this. A safety-stripped LoRA adapter — a small parameter delta that, when merged at inference time, negates alignment training — can be distributed as a model “enhancement” and requires no access to the base model weights. The adapter approach has not yet been weaponized at scale in production environments, but the technical capability is demonstrated and the distribution mechanism (model hubs, package registries) is widely accessible.

Guardrail Evasion: Bypassing the Infrastructure, Not the Model

A separate and increasingly relevant attack class targets the guardrail layer rather than the model itself. Most production LLM deployments place prompt-injection and content-classification systems in front of or alongside the model — Azure Prompt Shield, Meta Prompt Guard, Llama Guard, and similar tools. A 2025 paper accepted to LLMSec, Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails, tested evasion against six major guardrail systems using two attack families.

The first family uses character injection — Unicode homoglyphs, zero-width characters, and encoding substitutions that preserve semantic meaning for the target LLM but shift token distributions enough to evade classifier-based detectors. These are classic adversarial ML evasion adapted to text inputs.

The second family uses adversarial ML techniques directly: computing token importance rankings from a white-box model and using those rankings to guide perturbations against black-box guardrails. The critical finding is transferability — importance rankings computed on accessible open-weight guardrail models predicted which tokens to perturb to evade proprietary guardrail APIs. Both Azure Prompt Shield and Meta Prompt Guard showed meaningful evasion rates in the study.

OWASP’s LLM01:2025 classification of prompt injection as the top LLM application vulnerability reflects the same underlying dynamic: input classifiers keyed to prohibited patterns can be bypassed by encoding, obfuscation, payload splitting across multiple turns, and multilingual delivery. The model processes the semantics; the guardrail processes the surface form; the two see different things.

Engagement Implications

For a red team engagement against an LLM-powered system, these three bypass classes suggest three different test tracks:

Prompt-space attacks (GCG variants, encoding tricks, multi-turn escalation) are relevant for any deployed model regardless of weight access. They require only API access and are the baseline test for any LLM security assessment. Documented techniques across all categories are catalogued at jailbreakdb.com and aiattacks.dev.

Weight-level attacks apply when the target organization is running a fine-tunable model, when you have supply-chain access to the training pipeline, or when you are assessing the risk of a model sourced from a third-party hub that may have been tampered with. The 10-100 example reversal threshold is low enough that even small unauthorized fine-tuning runs are worth investigating.

Guardrail evasion is now a required test component for any deployment that uses a standalone safety classifier. The transfer results from the LLMSec 2025 paper mean that an attacker with access to the open-weight version of a guardrail system can optimize evasion payloads offline and apply them against the deployed proprietary version. GuardML maintains comparative coverage of production guardrail architectures and their known evasion profiles — a useful reference when scoping which guardrail vendors to target in a given engagement.

The practical defense posture across all three classes is layered: per-request input filtering, conversation-level semantic monitoring, output classification independent of input structure, and — critically — monitoring for anomalous fine-tuning activity on any model accessible to external users or untrusted training pipelines. None of these controls is individually sufficient. Alignment is real but shallow; guardrails are useful but evadable; the stack as a whole is what needs testing.

Sources

Sources

  1. Universal and Transferable Adversarial Attacks on Aligned Language Models (Zou et al. 2023)
  2. Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails (LLMSec 2025)
  3. LLM01:2025 Prompt Injection — OWASP Gen AI Security Project
  4. Undoing RLHF and the brittleness of safe LLMs (Interconnects)
#jailbreak #llm-security #red-team #adversarial-ml #alignment
Subscribe

AI Sec — in your inbox

Offensive AI security — prompt injection, jailbreaks, agent exploitation, red team writeups. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments