AI Sec

Tools

A curated directory of 20 tools we use, evaluate, and recommend across the AI security landscape — with our take on each.

Prompt Injection Scanners

garak

open-source Free

Modular LLM vulnerability scanner (Generative AI Red-teaming and Assessment Kit) — probes for prompt injection, jailbreak, toxicity, hallucination, data exfiltration, and malware generation across dozens of pluggable detectors.

Our take

The reference open-source LLM scanner, now maintained by NVIDIA. Its probe-per-vulnerability design makes it CI-friendly — run a curated subset as a gate on every model deployment. Some probes are noisy and require tuning per target model; treat raw pass rates as a starting point, not a final verdict.

promptmap / promptmap2

open-source Free

Automated prompt injection and jailbreak scanner for custom LLM applications — reads your system prompt, generates attack payloads organized by category (distraction, prompt stealing, jailbreak, harmful content, hate speech, social bias), and reports successes.

Our take

Purpose-built for testing production system prompts rather than base models. The white-box mode where you feed in your actual system prompt is genuinely useful for rapid iteration during development. Completely rewritten as promptmap2 in early 2025 with multi-provider support (OpenAI, Anthropic, Google, Ollama).

PromptBench

open-source Free

Microsoft's unified evaluation framework for LLM robustness — adversarial prompt attacks across character, word, sentence, and semantic levels with benchmark tasks spanning sentiment analysis, NLI, reading comprehension, math, and translation.

Our take

More of an evaluation framework than a real-time scanner. Useful for establishing a quantitative robustness baseline before you start optimizing your system prompt or fine-tune. Word-level attacks produce the biggest drops (33% in the original paper) — start there.

promptfoo

open-source Free

CLI and library for LLM evaluation and red-teaming — covers 50+ vulnerability types including prompt injection, jailbreaks, PII leakage, hallucination, and insecure output handling, with CI/CD integration and side-by-side model comparison.

Our take

Best developer UX in the category. The declarative YAML config lets you define attack suites as code and run them in CI. Originally eval-first; red-team capabilities were added later and are maturing fast. Now part of OpenAI but remains MIT-licensed.

Jailbreak Frameworks & Benchmarks

JailbreakBench

open-source Free

Open robustness benchmark for jailbreaking LLMs (NeurIPS 2024 Datasets & Benchmarks) — the JBB-Behaviors dataset of 200 misuse behaviors, standardized attack scripts, reproducible judge models, and a public leaderboard.

Our take

The cleanest apples-to-apples benchmark for comparing attack success rates. The judge model is the most debated component — run results through multiple judges when publishing. Essential reference point for any new attack paper.

HarmBench

open-source Free

Standardized evaluation framework for automated red teaming from CAIS — 510 behaviors across 7 semantic categories, 18 red-teaming methods evaluated against 33 target LLMs and defenses, classifier-based success scoring.

Our take

More comprehensive than AdvBench in behavior coverage and harder to game than human-labeled scoring. The large-scale comparison of 18 attack methods is the most useful artifact — read it before building any new attack. Prefer HarmBench over AdvBench for new research.

AutoDAN-Turbo

open-source Free

ICLR 2025 Spotlight — lifelong black-box jailbreak agent that autonomously discovers and evolves attack strategies without human intervention, achieving 88.5% ASR against GPT-4-1106-turbo and outperforming baselines by 74.3% on public benchmarks.

Our take

The autonomous strategy-discovery design is a meaningful step beyond static suffix generation. Attack success rates on frontier models are impressive. Useful as a stress test of your guardrails — if AutoDAN-Turbo can't crack your system after N iterations, you have meaningful confidence.

TAP (Tree of Attacks with Pruning)

open-source Free

Black-box jailbreak method (NeurIPS 2024) that uses an attacker LLM with tree-of-thoughts reasoning to iteratively refine attack prompts — prunes unpromising branches, jailbreaks GPT-4 and Claude for 80%+ of test behaviors.

Our take

One of the most transfer-efficient black-box attacks available. Requires only an attacker LLM and API access to the target — no gradient access. The pruning step meaningfully reduces query count. Good choice when you need to demonstrate concrete attack risk to stakeholders with limited budget.

GPTFuzz

open-source Free

Black-box jailbreak fuzzing framework that mutates human-written seed templates — seed selection strategy, semantic mutation operators, and a judgment model, achieving over 90% ASR against ChatGPT and Llama-2.

Our take

Template-mutation approach finds jailbreaks that optimization-based methods miss. The seed corpus you feed it largely determines output quality — invest time curating good seeds. Pairs well with GCG for complementary coverage.

Red Team Platforms & Toolkits

PyRIT

open-source Free

Microsoft's Python Risk Identification Toolkit for generative AI — automates multi-turn attack orchestration, supports Crescendo, TAP, Skeleton Key, and custom strategies against OpenAI, Azure, Anthropic, Google, HuggingFace, and custom HTTP endpoints.

Our take

Battle-tested by Microsoft's AI Red Team on 100+ internal operations before open-source release. The multi-turn conversation orchestration is the strongest feature — realistic, contextual attacks that single-shot frameworks miss. Higher learning curve than garak but far more flexible for novel attack research.

HouYi

open-source Free

Automated indirect prompt injection framework for LLM-integrated applications — draws from traditional web injection techniques to test RAG pipelines, agents, and plugin-enabled apps. Found 31/36 real applications vulnerable, including Notion.

Our take

The IPI focus differentiates it from tools that only test direct injection. If you're red-teaming RAG pipelines or LLM agents with tool access, HouYi's three-component prompt structure (pre-built context, context partition, malicious payload) models real attacker methodology well.

ai-exploits

open-source Free

ProtectAI's catalog of working exploit PoCs against ML infrastructure platforms — MLflow, ClearML, Ray, H2O, Kubeflow, and others — covering SSRF, RCE, authentication bypass, and deserialization vulnerabilities.

Our take

The closest thing to a Metasploit module collection for ML infrastructure. Required reading and running before any ML platform pentest. These aren't theoretical — the RCE chains against MLflow and Ray are fully weaponized.

Optimization-Based Attack Research

llm-attacks (GCG reference implementation)

open-source Free

Official implementation of the Greedy Coordinate Gradient adversarial suffix attack from Zou et al. 2023 — transfers jailbreak suffixes trained on open-source models to black-box systems including GPT-4, Claude, and Bard.

Our take

The foundational optimization-based jailbreak. Every subsequent attack paper benchmarks against it. Runnable on a 24GB consumer GPU; the nanogcg pip package (released 2024) makes it much easier to deploy. Use generated suffixes to build your defender's regression suite.

ArtPrompt

open-source Free

ACL 2024 paper and implementation — exploits LLMs' failure to recognize ASCII art representations of forbidden words, bypassing safety filters in GPT-3.5, GPT-4, Gemini, Claude, and Llama-2 without gradient access.

Our take

Conceptually clean and effective. The attack exploits a real capability gap — models are undertrained on ASCII art — rather than brute-forcing alignment. Useful demonstration that content filters need to cover visual encodings, not just lexical patterns.

DeepInception

open-source Free

Jailbreak via nested fictional scenarios (EMNLP 2024) — exploits LLMs' tendency toward obedience within constructed narratives by wrapping harmful requests in multi-layer inception-style scenes.

Our take

Effective on both open and closed models with minimal query budget. The prompt template is simple enough to implement manually, making it useful for human red teamers who need quick demonstrations. Continuous jailbreak in subsequent turns is the most practically dangerous property.

PEZ (Hard Prompts Made Easy)

open-source Free

Gradient-based discrete optimization method (NeurIPS 2023) for generating hard text prompts that transfer across models — bridges soft-prompt optimization with vocabulary-constrained hard prompts for both text-to-image and text-to-text tasks.

Our take

More of a prompt engineering research tool than a red-team weapon, but the transferability properties have attack implications. Understanding PEZ helps you reason about why suffix attacks transfer. The code is clean and the method is fast.

Adversarial ML Libraries

Adversarial Robustness Toolbox (ART)

open-source Free

IBM-led comprehensive adversarial ML library — evasion, poisoning, extraction, and inference attacks plus defenses and certifications across PyTorch, TensorFlow, JAX, and scikit-learn.

Our take

The most complete general-purpose adversarial ML library. Primarily classical ML (image classifiers, tabular models) rather than LLMs, but the poisoning and extraction attack implementations are the best available. Essential for any adversarial ML research that goes beyond chat-model jailbreaks.

TextAttack

open-source Free

Adversarial attack and data augmentation framework for NLP — word substitution, sentence paraphrasing, character perturbation attacks against text classifiers and NLU models.

Our take

Predates the LLM era and remains useful for classifier robustness work. Less relevant for chat-model security but still the right tool for testing safety classifiers and text-based detection systems that underpin many guardrail stacks.

Datasets & Evaluation Benchmarks

AdvBench

open-source Free

520-behavior adversarial benchmark from the GCG paper — the de facto standard dataset for comparing jailbreak attack success rates across publications.

Our take

Still the baseline comparison point most papers cite. Increasingly gamed — models are fine-tuned specifically against AdvBench behaviors. Supplement with HarmBench for more robust evaluation.

AI Vulnerability Database (AVID)

open-source Free

Open vulnerability disclosure platform for AI/ML systems — taxonomized reports of real-world failures, categorized by attack technique, vulnerability type, and impact domain.

Our take

Younger than CVE/NVD but more AI-specific. The taxonomy aligns with MITRE ATLAS, making cross-referencing straightforward. File your disclosures here; the community is responsive and growing.