2026-05-19 · Benchmark

The auth-protected bugs every web scanner misses

The unauth pass

We pointed pentest-ai at a local copy of OWASP Juice Shop. Default config, no auth profile, intensity set to normal. The kind of run you get if you install the tool, copy the quickstart, and hit go.

It surfaced 58 findings. The mix was about what you'd expect from any decent scanner on Juice Shop: reflected XSS sinks, a handful of information disclosure paths, weak or missing security headers, error messages that leak stack frames, and a few exposed metadata files like ftp/package.json.bak and robots.txt entries that point at things they shouldn't.

Most of those 58 would also show up in ZAP or Nuclei on the same target. Nothing remarkable. If we stopped here, the benchmark would be "pentest-ai is roughly comparable to the open-source scanners," which is fine but not interesting.

The more honest framing is what those 58 findings don't include. They don't include anything that requires being logged in. They don't include anything that requires reasoning across more than one request. In a real app, that's where most of the interesting bugs live. The whole reason you have a login page is because the sensitive stuff sits behind it.

The auth turn

We added a bearer-token auth profile to the engagement and re-ran the same target. The profile is short: tell ptai which header carries the token, give it a witness endpoint to confirm the session is alive, and let it go.

Total finding count went up modestly. More endpoints became visible to the recon pass, so the long tail of low-severity issues grew a bit. That's the boring part of the result. The interesting part is that four exploit chains auto-composed during the authenticated run. None of them showed up in the unauth pass. None of them are findable by a single-shot request.

A chain, the way we use the word, is a sequence of requests where each one is informed by the previous response. The first request reveals something. The agent decides what to try next based on what came back. The chain is logged with the steps that made it work, so you can replay it by hand.

This is the part most scanners can't do. They're stateless by design. They walk a list of endpoints, classify the response against a ruleset, and move on. There's no "what did the last request reveal, and what does that suggest about the next one?" loop. ZAP and Burp give you the building blocks to script this yourself; almost nobody does, because writing the glue is the whole job.

The four chains

Here's what auto-composed on the authenticated Juice Shop run. Real output from the 2026-05-15 benchmark, not a curated list.

1. BOLA on basket

Juice Shop assigns each user their own shopping basket, identified by a small integer in the URL. With a valid bearer token, you can request other users' baskets by changing that integer. ptai found this by submitting its own basket first, noting the ID, then probing adjacent IDs and comparing response shapes. The chain logs as: log in, identify own basket ID, probe adjacent IDs, confirm response differs from a 403 or 404, file as BOLA. Scanners typically miss this because the endpoint pattern /rest/basket/{id} looks fine to an unauthenticated probe (it just 401s), and authenticated probes without state comparison can't tell whether the response is your basket or someone else's.

2. JWT confusion

Juice Shop signs its JWTs with HS256. ptai grabbed its own token, decoded it, set the alg header to none, re-signed with an empty key, and replayed against a protected endpoint. It came back 200 instead of 401. The chain logs as: intercept own JWT, modify alg header, resign with empty key, replay, confirm 200. What scanners miss here is the resign-and-replay step. Detecting "this app uses JWT" is easy. Trying a malformed alg against a real protected endpoint and noticing the response code changed is a multi-step loop, not a single signature.

3. File upload pivot

The avatar upload endpoint accepts arbitrary MIME types. ptai uploaded a small test file and observed that the upload was served back with Content-Type: text/html. That's enough to flag a chain leading to stored XSS or, on a less sandboxed target, worse. Juice Shop sandboxes the file serving enough that we didn't get to RCE, but the chain is what matters. The steps were: probe upload endpoint with varied MIME types, retrieve uploaded file, inspect served Content-Type, flag the gap between accepted and served types. Scanners miss this because they tend to test upload endpoints in isolation, never fetching the file back to see how it gets served.

4. Admin endpoint via header trust

Juice Shop has an admin section gated by JWT role. While poking at authenticated requests, ptai noticed that certain admin-adjacent endpoints also checked a custom header called x-admin. Setting x-admin: true on an already-authenticated request returned admin data. Stupid bug, real bug. ptai found it by varying request headers on endpoints it already had access to, not by guessing admin URLs. Scanners miss this because their header fuzzing, where it exists, is usually run against unauthenticated baselines. The signal only appears once you're inside.

The honest part

Scanners can do auth. ZAP has session management. Burp has Macros. Nuclei has auth templates. The capability is there. The problem is that wiring it up is fiddly, default profiles ship unauthenticated, and most people running these tools never get past the default.

pentest-ai's contribution isn't that authenticated scanning is novel. It's that the agent loop treats "log in, then keep going" as the standard workflow rather than something you bolt on. The bearer-token profile is two lines in the engagement config. After that, the same agent that ran the unauth pass keeps running, sees more, and chains what it sees.

A number we'd be careful about: the 58 unauth findings and 4 auth chains came from one app on one benchmark run. Different targets will produce different shapes. The honest claim is depth past the login wall, not more findings overall. If you're scanning a public marketing site, none of this matters. If you're scanning the part of your product where users actually do things, it does.