Tools
A curated directory of 20 tools we use, evaluate, and recommend across the AI security landscape — with our take on each.
Prompt Injection Scanners
garak
Modular LLM vulnerability scanner (Generative AI Red-teaming and Assessment Kit) — probes for prompt injection, jailbreak, toxicity, hallucination, data exfiltration, and malware generation across dozens of pluggable detectors.
Our take
The reference open-source LLM scanner, now maintained by NVIDIA. Its probe-per-vulnerability design makes it CI-friendly — run a curated subset as a gate on every model deployment. Some probes are noisy and require tuning per target model; treat raw pass rates as a starting point, not a final verdict.
promptmap / promptmap2
Automated prompt injection and jailbreak scanner for custom LLM applications — reads your system prompt, generates attack payloads organized by category (distraction, prompt stealing, jailbreak, harmful content, hate speech, social bias), and reports successes.
Our take
Purpose-built for testing production system prompts rather than base models. The white-box mode where you feed in your actual system prompt is genuinely useful for rapid iteration during development. Completely rewritten as promptmap2 in early 2025 with multi-provider support (OpenAI, Anthropic, Google, Ollama).
PromptBench
Microsoft's unified evaluation framework for LLM robustness — adversarial prompt attacks across character, word, sentence, and semantic levels with benchmark tasks spanning sentiment analysis, NLI, reading comprehension, math, and translation.
Our take
More of an evaluation framework than a real-time scanner. Useful for establishing a quantitative robustness baseline before you start optimizing your system prompt or fine-tune. Word-level attacks produce the biggest drops (33% in the original paper) — start there.
promptfoo
CLI and library for LLM evaluation and red-teaming — covers 50+ vulnerability types including prompt injection, jailbreaks, PII leakage, hallucination, and insecure output handling, with CI/CD integration and side-by-side model comparison.
Our take
Best developer UX in the category. The declarative YAML config lets you define attack suites as code and run them in CI. Originally eval-first; red-team capabilities were added later and are maturing fast. Now part of OpenAI but remains MIT-licensed.
Jailbreak Frameworks & Benchmarks
JailbreakBench
Open robustness benchmark for jailbreaking LLMs (NeurIPS 2024 Datasets & Benchmarks) — the JBB-Behaviors dataset of 200 misuse behaviors, standardized attack scripts, reproducible judge models, and a public leaderboard.
Our take
The cleanest apples-to-apples benchmark for comparing attack success rates. The judge model is the most debated component — run results through multiple judges when publishing. Essential reference point for any new attack paper.
HarmBench
Standardized evaluation framework for automated red teaming from CAIS — 510 behaviors across 7 semantic categories, 18 red-teaming methods evaluated against 33 target LLMs and defenses, classifier-based success scoring.
Our take
More comprehensive than AdvBench in behavior coverage and harder to game than human-labeled scoring. The large-scale comparison of 18 attack methods is the most useful artifact — read it before building any new attack. Prefer HarmBench over AdvBench for new research.
AutoDAN-Turbo
ICLR 2025 Spotlight — lifelong black-box jailbreak agent that autonomously discovers and evolves attack strategies without human intervention, achieving 88.5% ASR against GPT-4-1106-turbo and outperforming baselines by 74.3% on public benchmarks.
Our take
The autonomous strategy-discovery design is a meaningful step beyond static suffix generation. Attack success rates on frontier models are impressive. Useful as a stress test of your guardrails — if AutoDAN-Turbo can't crack your system after N iterations, you have meaningful confidence.
TAP (Tree of Attacks with Pruning)
Black-box jailbreak method (NeurIPS 2024) that uses an attacker LLM with tree-of-thoughts reasoning to iteratively refine attack prompts — prunes unpromising branches, jailbreaks GPT-4 and Claude for 80%+ of test behaviors.
Our take
One of the most transfer-efficient black-box attacks available. Requires only an attacker LLM and API access to the target — no gradient access. The pruning step meaningfully reduces query count. Good choice when you need to demonstrate concrete attack risk to stakeholders with limited budget.
GPTFuzz
Black-box jailbreak fuzzing framework that mutates human-written seed templates — seed selection strategy, semantic mutation operators, and a judgment model, achieving over 90% ASR against ChatGPT and Llama-2.
Our take
Template-mutation approach finds jailbreaks that optimization-based methods miss. The seed corpus you feed it largely determines output quality — invest time curating good seeds. Pairs well with GCG for complementary coverage.
Red Team Platforms & Toolkits
PyRIT
Microsoft's Python Risk Identification Toolkit for generative AI — automates multi-turn attack orchestration, supports Crescendo, TAP, Skeleton Key, and custom strategies against OpenAI, Azure, Anthropic, Google, HuggingFace, and custom HTTP endpoints.
Our take
Battle-tested by Microsoft's AI Red Team on 100+ internal operations before open-source release. The multi-turn conversation orchestration is the strongest feature — realistic, contextual attacks that single-shot frameworks miss. Higher learning curve than garak but far more flexible for novel attack research.
HouYi
Automated indirect prompt injection framework for LLM-integrated applications — draws from traditional web injection techniques to test RAG pipelines, agents, and plugin-enabled apps. Found 31/36 real applications vulnerable, including Notion.
Our take
The IPI focus differentiates it from tools that only test direct injection. If you're red-teaming RAG pipelines or LLM agents with tool access, HouYi's three-component prompt structure (pre-built context, context partition, malicious payload) models real attacker methodology well.
ai-exploits
ProtectAI's catalog of working exploit PoCs against ML infrastructure platforms — MLflow, ClearML, Ray, H2O, Kubeflow, and others — covering SSRF, RCE, authentication bypass, and deserialization vulnerabilities.
Our take
The closest thing to a Metasploit module collection for ML infrastructure. Required reading and running before any ML platform pentest. These aren't theoretical — the RCE chains against MLflow and Ray are fully weaponized.
Optimization-Based Attack Research
llm-attacks (GCG reference implementation)
Official implementation of the Greedy Coordinate Gradient adversarial suffix attack from Zou et al. 2023 — transfers jailbreak suffixes trained on open-source models to black-box systems including GPT-4, Claude, and Bard.
Our take
The foundational optimization-based jailbreak. Every subsequent attack paper benchmarks against it. Runnable on a 24GB consumer GPU; the nanogcg pip package (released 2024) makes it much easier to deploy. Use generated suffixes to build your defender's regression suite.
ArtPrompt
ACL 2024 paper and implementation — exploits LLMs' failure to recognize ASCII art representations of forbidden words, bypassing safety filters in GPT-3.5, GPT-4, Gemini, Claude, and Llama-2 without gradient access.
Our take
Conceptually clean and effective. The attack exploits a real capability gap — models are undertrained on ASCII art — rather than brute-forcing alignment. Useful demonstration that content filters need to cover visual encodings, not just lexical patterns.
DeepInception
Jailbreak via nested fictional scenarios (EMNLP 2024) — exploits LLMs' tendency toward obedience within constructed narratives by wrapping harmful requests in multi-layer inception-style scenes.
Our take
Effective on both open and closed models with minimal query budget. The prompt template is simple enough to implement manually, making it useful for human red teamers who need quick demonstrations. Continuous jailbreak in subsequent turns is the most practically dangerous property.
PEZ (Hard Prompts Made Easy)
Gradient-based discrete optimization method (NeurIPS 2023) for generating hard text prompts that transfer across models — bridges soft-prompt optimization with vocabulary-constrained hard prompts for both text-to-image and text-to-text tasks.
Our take
More of a prompt engineering research tool than a red-team weapon, but the transferability properties have attack implications. Understanding PEZ helps you reason about why suffix attacks transfer. The code is clean and the method is fast.
Adversarial ML Libraries
Adversarial Robustness Toolbox (ART)
IBM-led comprehensive adversarial ML library — evasion, poisoning, extraction, and inference attacks plus defenses and certifications across PyTorch, TensorFlow, JAX, and scikit-learn.
Our take
The most complete general-purpose adversarial ML library. Primarily classical ML (image classifiers, tabular models) rather than LLMs, but the poisoning and extraction attack implementations are the best available. Essential for any adversarial ML research that goes beyond chat-model jailbreaks.
TextAttack
Adversarial attack and data augmentation framework for NLP — word substitution, sentence paraphrasing, character perturbation attacks against text classifiers and NLU models.
Our take
Predates the LLM era and remains useful for classifier robustness work. Less relevant for chat-model security but still the right tool for testing safety classifiers and text-based detection systems that underpin many guardrail stacks.
Datasets & Evaluation Benchmarks
AdvBench
520-behavior adversarial benchmark from the GCG paper — the de facto standard dataset for comparing jailbreak attack success rates across publications.
Our take
Still the baseline comparison point most papers cite. Increasingly gamed — models are fine-tuned specifically against AdvBench behaviors. Supplement with HarmBench for more robust evaluation.
AI Vulnerability Database (AVID)
Open vulnerability disclosure platform for AI/ML systems — taxonomized reports of real-world failures, categorized by attack technique, vulnerability type, and impact domain.
Our take
Younger than CVE/NVD but more AI-specific. The taxonomy aligns with MITRE ATLAS, making cross-referencing straightforward. File your disclosures here; the community is responsive and growing.