2026-05-19 · Architecture

Why we built pentest-ai as an MCP server, not just a CLI

pentest-ai started as a CLI. It still is one. pip install ptai, run ptai scan, get a report. That works.

The problem with a CLI is that it asks the human to plan the engagement. You need to know which scanner to run first, when to pivot from recon to exploitation, when a finding is worth chasing and when it's a false positive. We knew how to do that. Most people who'd want to use ptai don't, and shouldn't have to.

So we wrote an MCP server.

The hypothesis

If the planning lives in an LLM client, the CLI just has to expose primitives. Things like start_engagement, http_request, run_probe, save_finding. The client decides what to call next based on what came back.

Claude Code is already running on most of our laptops. It has a tool-use loop and a context window large enough to hold an entire engagement state. It can read the SARIF output, decide a finding looks interesting, and ask for a follow-up scan. That's the whole job.

We argued about this for a while. There's a version of pentest-ai where the planning lives inside the binary, written as a state machine, with the LLM only consulted when the state machine is stuck. We built a prototype like that. It was fine. It was also a permanent maintenance burden: every new playbook had to be coded twice, once in the state machine and once in the prompt the LLM saw when the state machine punted. Moving the planning fully out solved that.

What the model actually does

A live engagement looks like this. The user types something like "scan staging.example.com, you have a bearer token in $STAGING_TOKEN". Claude Code calls start_engagement with the scope, picks up the token from the environment, and gets back an engagement ID. From there it runs recon probes, reads the responses, and picks the next probe based on what came back. The MCP server doesn't decide any of that. It just answers calls.

When a probe finds something promising, the model files it through save_finding with the request, the response, and a short note about why it's worth keeping. At the end of the run it calls generate_report and gets a SARIF file plus a markdown summary. We log the whole tool-call trace as the audit trail; if a customer asks "why did you hit this endpoint", the answer is right there in the transcript.

What MCP gave us

Three things, mainly.

Tool calls instead of subcommands. The model doesn't have to memorize the CLI surface. Each tool advertises a JSON schema; the model picks the right one and fills in the arguments. Auth profiles, scope rules, intensity settings all become tool parameters instead of command-line flags.

Conversation instead of bash history. When a scan finds an SSRF, the model can ask "what other endpoints take a URL parameter?" and run a second probe targeted at the answer. With a CLI you'd be writing a shell loop.

Auth that survives. Engagements set up a bearer token once. Every subsequent http_request tool call reuses it. We had to write a small process-local cache to make this work, and that turned out to be the fiddliest part of the whole thing.

Scope enforcement at the protocol layer. Every tool call runs through a scope check before it touches the network. The model can ask to hit any URL it wants. The server says no if the URL isn't in scope, and the refusal goes back into the conversation. That's a much better place to enforce scope than inside a prompt where the model might forget the boundaries an hour later.

What broke

The auth cache, twice. First version stored tokens in the engagement DB row; that meant every tool call hit SQLite. Second version cached in-process but didn't survive when the MCP subprocess restarted. The third version (now live in 0.15.2) caches in-process with a fallback to a credential file in the engagement directory.

Parallel fan-out hurt more than it helped. Early on we let the model fire 20 http_request calls in parallel. The MCP framing protocol can handle that. Claude Code's UI cannot — large response bodies plus permission prompts plus parallel streams locked the UI for minutes at a time. We documented the lesson and went serial. The benchmark didn't get slower; it got faster, because the model could read each response before deciding the next.

Notification triggers. MCP supports server-initiated notifications. We tried using them. Something between our server and Claude Code's notification handler caused the client to hang on subsequent tool calls. Pulled the notifications out, hang stopped. Haven't tracked down whether it's ours or the client's bug; not worth blocking the release on.

Tool surface size. We started with 47 MCP tools and almost immediately wanted to add more. The temptation is to expose every CLI subcommand as its own tool. We resisted that. Each tool added to the surface is something the model has to skim past to find the one it actually wants, and each one is a place where the JSON schema can drift from the implementation. The current rule of thumb: a tool earns its place if it represents a distinct verb in an engagement, not a configuration option of another verb.

What we gave up

A CLI you can pipe into other CLIs. ptai scan | jq still works against the JSON output, but the headline path now assumes you're in an LLM session. People who want the old "scan and dump SARIF" workflow get it, but the docs lead with the MCP path because that's where the leverage is.

Determinism. A CLI run twice with the same flags does the same thing. An MCP-driven run picks slightly different probes depending on what the model decides is interesting. We added an intensity parameter and a strict_scope flag to bound the variance, but two runs are not identical. For a security tool that's a real cost. We think the upside (the model chasing the right thread without being told) is worth it; we won't pretend the tradeoff isn't there.

Where it lands

Same 194 tools. Same scanners under the hood. The difference is who's driving.

If you want to try it:

pip install ptai
claude mcp add pentest-ai ptai mcp

Then in Claude Code, ask it to pentest something. The session starts the engagement, runs probes, validates findings, and writes a report. You read the conversation; the conversation IS the audit trail.

We think this is roughly what most engineering tools are going to look like in 18 months. Primitives in the tool, planning in the model. The protocol is the glue. We just got here first for pentesting.

Source: github.com/0xSteph/pentest-ai · Install: pip install ptai