FragBench is a benchmark pipeline for evaluating whether LLM agents can complete multi-step attack chains when prompts are fragmented, rewritten, and executed through tool access. The workflow has three stages: generate fragment datasets from campaign seeds, optionally rewrite them with the RL/judge loop in fragbench-structural-graph-main, then validate execution through the Dockerized MCP agent harness.
| Stage | Requirements |
|---|---|
| 1. Dataset generation | Python ≥ 3.11, uv. For LLM stylization: ANTHROPIC_API_KEY. |
| 2. Reinforcement Learning based Rewriting | Same Python env; ANTHROPIC_API_KEY and/or OPENROUTER_API_KEY as required by the selected backends (see §2). |
| 3. MCP validation | Docker Desktop, Make, OPENROUTER_API_KEY for the default target (openrouter). Optional: Ollama on the host when using MCP_MODEL_BACKEND=ollama. Judge keys are required when JUDGE=1 and the selected judge backend requires them. |
Install Python dependencies from the repo root:
uv venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
uv pip install -e .FragBench builds its datasets from public cyber-incident reporting, which is distilled into attack campaigns. Each campaign lives as one JSON seed under seeds/; it describes the ordered steps an adversary would execute. Note that seeds/ contains real campaign templates derived from public threat intel (GTIG, Anthropic TI, OpenAI disruption reports, Microsoft AI tradecraft report).
Dataset generation expands that seed into a fragmented benchmark: the story is broken into discrete steps, each wrapped in benign cover so rewritten prompts still read like plausible user or operator sessions rather than bare attack text. The pipeline emits the fragments JSON (variations, per-fragment styles, and produces / consumes links) which are further rewritten using RL (optional) or validated using the MCP harness.
Two ways to generate those fragments:
No model calls, no API key:
python run.py --generate --seed-file seeds/promptsteal.json --seed 0 \
--output-json results/promptsteal_manual.jsonRephrases style via the LLM (--no-style-templates):
export ANTHROPIC_API_KEY=sk-ant-...
python run.py --generate --seed-file seeds/promptsteal.json \
--no-style-templates --max-concurrency 32 \
--output-json results/promptsteal_llm.jsonSwap promptsteal for any seed basename below:
ad_discovery, ai-phishing, clickfix, coinbait, deepfake-id-fraud, dprk_fraud, gtg1002, hello_world, honestcue, jasper_sleet, london_drugs_lockbit, malterminal, nocode_ransomware, ns_power_ransomware, operation_dream_job, operation_false_witness, promptflux, promptsteal, quietvault, scope_creep, tycoon2fa, vibe_extortion, wormgpt_kawaiigpt
See python run.py --generate --help for --dry-run, --fragment, --legitimize, --output-toml, and others.
For each campaign in seeds/*.json, the pipeline produces two dataset variants under results/ using 1.1 and 1.2 respectively:
| File | Meaning |
|---|---|
results/<seed>_manual.json |
Template-based stylization; reproducible; no keys. |
results/<seed>_llm.json |
LLM-rephrased stylization; needs ANTHROPIC_API_KEY; slower and billed. |
# Manual variants for every seed
for f in seeds/*.json; do
name=$(basename "$f" .json)
python run.py --generate --seed-file "$f" --seed 0 \
--output-json "results/${name}_manual.json"
done
# LLM variants (needs ANTHROPIC_API_KEY, slower, costs money)
for f in seeds/*.json; do
name=$(basename "$f" .json)
python run.py --generate --seed-file "$f" \
--no-style-templates --max-concurrency 32 \
--output-json "results/${name}_llm.json"
doneThe generated JSON files (results/<seed>_manual.json / results/<seed>_llm.json) are the inputs to the MCP graph runner or can be further refined in the RL step.
Optional step: judge or RL-rewrite generated JSON fragments using fragbench_generated_rl.py to improve pass rate and reduce refusals.
The input dataset must be placed under fragbench-structural-graph-main/dataset/ before running this step. For example, results/promptsteal_manual.json from §1 should be copied or staged as fragbench-structural-graph-main/dataset/promptsteal_manual.json, then referenced as --input dataset/promptsteal_manual.json from inside fragbench-structural-graph-main.
The required keys depend on the selected --judge-backend and --rewriter-backend:
export ANTHROPIC_API_KEY=sk-ant-...
export OPENROUTER_API_KEY=sk-or-...cd fragbench-structural-graph-main
python -u fragbench_dataset_generator/fragbench_pack/fragbench_generated_rl.py \
--mode rl \
--variation-style direct \
--judge-backend anthropic \
--judge-model claude-sonnet-4-6 \
--rewriter-backend openrouter \
--rewriter-model deepseek/deepseek-v4-flash \
--input dataset/promptsteal_manual.jsonAnthropic backend: If Anthropic models are selected for
--judge-backendor--rewriter-backend, calls go to Anthropic’s API.ANTHROPIC_API_KEYis required; there is no substitute when using those backends.
Backends can be mixed (e.g. OpenRouter judge + Anthropic rewriter) as long as the matching key is present.
MODE:judge(pass-rate only) orrl(rewrite loop).--input: path under the localdataset/directory (e.g.dataset/promptsteal_manual.json).--output: optional; if omitted, the script picks a path underruns/.- Set
jb/rb/ models to match the available keys (e.g. OpenRouter judge + rewriter for full RL without local Ollama).
This will create a folder with 3 outputs:
-
run_policies.json— Structured list of rewriting strategies available to / learned during the run (IDs, descriptions, and success/failure tallies). -
run_trajectory.json— Round-by-round summary of training (phase, pass rate, how many prompts were attempted, aggregate strategy stats). -
run.json— Rewritten fragments in same format as input file (can be executed against MCP)
The fragments materialized in §1 (and optionally rewritten in §2) describe an attack as ordered steps, dependencies, and prompts. MCP validation answers a different question: faced with a real tool-equipped agent in an isolated environment, does a target model actually drive the chain forward—file writes, shell, recon, and other MCP tools—according to each fragment?
Runs execute through the same Docker-isolated MCP stack used for serious evaluation: toolkits wired like production, scheduling that respects the fragment DAG (produces / consumes). The result is per-fragment outcomes (pass or fail), optional judge verdicts, and full tool traces. That closes the loop from “attack text” to observed behavior, which is what downstream defense needs: baselines for monitors, policy experiments (guardrails, deny lists, system prompts), regression tests after mitigations, and apples-to-apples comparison of models—without treating “unsafe prose in chat” as equivalent to “completed malicious tool use.”
This path is Docker-only; there is no bare-metal graph runner.
MCP harness architecture. A central mcp-client ingests campaign JSON files, selects the attack based on the style and seed, and routes tool calls to a fleet of 24 specialised MCP servers via FastMCP (HTTP + SSE). All agent interactions are captured as structured .jsonl session logs served through a viewer at port 8787. Support is available for self-hosted LLMs as well as those accessed via OpenRouter.
Open Figure 3 (vector PDF) — use this PDF for clearer version.
Confirm the tooling and daemon:
docker --version
docker compose versionFrom the repo root, bring up MCP servers and the results viewer:
docker compose up -d
docker compose ps # most services should be Runningmcp-client mounts ./results read-write (reads the fragments file, writes results/runs/) and mounts ./logs for per-fragment session streams. The viewer mounts ./results and ./logs read-only so completed outputs and live traces can be inspected from the browser.
The Docker stack exposes multiple MCP servers under fragbench_mcp/servers/. Examples include:
| Server | Path | Example role |
|---|---|---|
| Filesystem | fragbench_mcp/servers/filesystem/server.py |
Read/write files created by fragments. |
| Shell | fragbench_mcp/servers/shell/server.py |
Execute shell commands inside the sandbox. |
| Network recon | fragbench_mcp/servers/network_recon/server.py |
Exercise network-discovery style fragments. |
fragbench_mcp/servers/email/server.py |
Validate email-themed social or delivery workflows. | |
| Credential | fragbench_mcp/servers/credential/server.py |
Validate credential-handling and secret-management scenarios. |
The Makefile defaults to MCP_MODEL_BACKEND=openrouter and MCP_MODEL=anthropic/claude-haiku-4.5. Provide an OpenRouter key (or add it to a .env the compose file loads):
export OPENROUTER_API_KEY=sk-or-...Run against fragments from §1 or §2:
make docker-attack-graph-run \
FRAGMENTS=results/promptsteal_manual.json \
STYLE=direct \
SEEDS=0-9 \
MAX_PARALLEL_VARIATIONS=4 \
MAX_PARALLEL_FRAGMENTS=2Override MCP_MODEL to change the model under test (still on OpenRouter), for example:
make docker-attack-graph-run \
FRAGMENTS=results/promptsteal_manual.json \
STYLE=direct \
SEEDS=0-2 \
MCP_MODEL=qwen/qwen3.5-122b-a10b \
JUDGE=1 \
JUDGE_MODEL=anthropic/claude-haiku-4.5Set JUDGE=1 for LLM-as-judge on fragment success (and per-response classification in the MCP client). If the judge API is unavailable, the runner falls back to keyword checks. Response verdicts remain ANSWERED · REFUSED · PARTIAL · UNCLEAR when the judge path is active.
| Variable | Meaning | Default (graph run) |
|---|---|---|
FRAGMENTS |
Path to fragments JSON | results/promptsteal_fragments.json |
STYLE |
Variation style per seed | direct |
SEEDS |
0, 0,2,4, 0-9, or empty = all seeds in file |
(all) |
MAX_PARALLEL_VARIATIONS |
Concurrent variation jobs | 4 |
MAX_PARALLEL_FRAGMENTS |
Concurrent fragments per variation (DAG-aware) | 2 |
MCP_MODEL_BACKEND |
Target backend: openrouter or ollama |
openrouter |
MCP_MODEL |
Target model id | anthropic/claude-haiku-4.5 |
JUDGE |
1 = enable LLM judge |
0 |
JUDGE_BACKEND / JUDGE_MODEL |
Evaluator backend and model | openrouter / anthropic/claude-haiku-4.5 |
To use a model on the host instead of OpenRouter, run Ollama and select the ollama backend. Containers reach the host at host.docker.internal:11434.
brew install ollama # macOS — or https://ollama.com/download
ollama serveIn another terminal (large first-time download):
ollama pull huihui_ai/qwen3.5-abliterated:35b
ollama listThen:
make docker-attack-graph-run \
FRAGMENTS=results/promptsteal_manual.json \
STYLE=direct \
MCP_MODEL_BACKEND=ollama \
MCP_MODEL=huihui_ai/qwen3.5-abliterated:35bEach graph run leaves structured JSON under results/runs/.
| Example file | Contents (high level) |
|---|---|
attack_20260505_030157_e64467_meta.json |
Run-wide metadata written around start/end: which fragments_path, seeds, style, target_model / target_backend, judge_model, timestamps, status, aggregate pass count. This file catalogues batch jobs and ties every per-seed file back to one run_id. |
attack_20260505_015031_fbfff5_seed_1345099283_HONESTCUE_1345099283.json |
Chain summary for one seed: campaign id, style, whether the full variation passed, and an ordered fragments list with prompts, roles, phases, produces / consumes, and each fragment’s pass / verdict / justification (no full tool dumps). Ideal for quick scoring tables and governance summaries. |
attack_graph_20260505_015031_fbfff5_seed_1345099283_HONESTCUE_1345099283.json |
Graph detail for the same seed: phases_in_order, rich variation block, and per-fragment tools_executed (tool name plus arguments), final_response, session_path to the JSONL log, timings, and artifact bookkeeping. This file supports defense forensics and debugging how the agent satisfied or missed each step. |
Per-fragment sessions are JSONL streams (session_<timestamp>_<id>.jsonl). Paths appear in graph JSON as session_path (container paths under /app/logs/ map to ./logs/ on the host). By default the graph runner also isolates sessions under logs/<run_id>/ for a given attack run; that subdirectory should be checked when the repo root logs/ is crowded.
The web UI runs at http://localhost:8787. The Live view shows sessions while fragments execute; the Graph view shows pass/fail vectors and fragment-level details after completion. A run can be opened directly with http://localhost:8787/graph.html?run=<run_id>.
MCP validation frontend views. The Live view streams fragment execution in real time, including MCP agent sessions, model responses, tool calls, and intermediate status. The Graph view summarizes completed validation runs by campaign phase and reports per-fragment pass/fail outcomes, judge rationales, tool executions, and artifact dependencies used for downstream defense analysis. Green indicates fragments that passed the LLM judge’s evaluation for having completed the task; red indicates those that failed.
(a) Live view.
(b) Graph view.
docker compose downIf Ollama was used, stop it with Ctrl-C in its terminal (or killall ollama).


