Whitepaper 07: Token Economics of Autonomous Pentesting

Author: Khushal Suthar Date: June 2026 Series: Autonomous Penetration Testing with AI Agents Category: Analysis — Cost Models and Economic Constraints

Executive Summary

Autonomous penetration testing is not merely a technical challenge — it is an economic one. Every inference call, every tool invocation, every context assembly consumes tokens, and tokens cost money. This paper builds a comprehensive cost model for autonomous pentesting, starting from unit-level token economics and extending through engagement-level projections, fleet-scale annual budgets, and sensitivity analyses against the dominant variables that drive cost.

The central finding is stark and actionable: naive agent designs cost $800–$2,000 per engagement, placing them at or above the cost of junior human testers and eliminating the commercial case for automation. Optimized designs — incorporating context management, model tiering, prompt caching, batch processing, and output compression — bring per-engagement cost to $40–$80, a 10–50× reduction that makes autonomous pentesting not merely viable but dominant on cost-per-engagement basis.

This paper provides the mathematical foundations for that claim. We develop a parametric cost model, populate it with realistic benchmark data, run scenario and sensitivity analyses, and map the design space where autonomous pentesting crosses the viability threshold against human penetration testing. We show that token economics is not a downstream concern to be optimized after functionality is achieved — it is a first-order design constraint that shapes agent architecture, tool selection, and engagement scope from the earliest stages of system design.

1. Why Token Economics Matters

The economics of AI-powered security tools are often treated as an afterthought. Demos run on small targets with short sessions; costs are absorbed as "R&D." When teams move to production engagements — multi-hour, multi-target, multi-phase assessments — the cost curve becomes the dominant constraint. An agent that costs $2,000 in tokens to assess a single /24 subnet is not a product; it is an experiment that cannot scale.

Token economics matters for three structural reasons:

1.1 Direct Cost

Model API calls are priced per token. An autonomous pentest is a sequence of hundreds to thousands of calls, each consuming input context and generating output reasoning. The aggregate cost per engagement determines whether the product is commercially viable. At $0.15–$0.75 per call and 500–1,500 calls per engagement, the per-engagement cost ranges from $75 to over $1,100 before any optimizations — and naive architectures push this far higher.

1.2 Architectural Pressure

Cost is not independent of architecture — it is shaped by architecture. Choices about context management, tool granularity, model selection, and agent topology that are made for functional reasons have order-of-magnitude cost implications. An agent that retrieves full conversation history on every call pays 5–10× more than one that maintains a compressed working context. An agent that routes every reasoning step to a frontier model pays 10–20× more than one with intelligent model tiering. These are not optimizations applied after the fact; they are architectural commitments made at design time.

1.3 Competitive Positioning

Autonomous pentesting does not compete with "free." It competes with human penetration testers billing $200–$500 per hour, with engagements typically priced at $10,000–$40,000 for a one-to-two-week assessment. If the AI agent's per-engagement cost approaches or exceeds the human cost, the value proposition collapses — the agent must be both cheaper and faster to justify adoption, regulatory acceptance, and the organizational change management required to integrate automated testing into security workflows.

This paper builds a cost model from first principles, populates it with realistic parameters derived from production-style engagements, and explores the design space where autonomous pentesting becomes economically viable.

2. The Unit Economics: What a Token Costs

2.1 Pricing Structure

As of mid-2026, frontier-class model pricing typically follows this structure across major providers:

Tier Representative Models Input Price (per 1M tokens) Output Price (per 1M tokens) Context Window --------------- Frontier (largest) GPT-4-class, Claude Opus-class $3–$15 $15–$75 200K–1M Mid-tier (capable, efficient) GPT-4-mini-class, Claude Sonnet-class $0.50–$3 $2–$15 128K–256K Fast/cheap (classification, routing) GPT-4-mini-small, Haiku-class $0.10–$0.50 $0.50–$3 32K–128K Open-weights (self-hosted) Llama, Mistral, Qwen, DeepSeek Compute cost only Compute cost only Hardware-bounded

For concrete reference, we anchor the cost model on widely deployed pricing points:

Model Input ($/M) Output ($/M) Cached Input ($/M) Context Window --------------- GPT-4 (frontier) $10.00 $30.00 $5.00 128K GPT-4-mini (mid-tier) $1.50 $6.00 $0.75 128K GPT-4-mini-small (fast) $0.50 $2.00 $0.25 64K Llama-3-70B (self-hosted) ~$0.60 ~$0.60 N/A 32K–128K

\*Self-hosted cost is computed from GPU rental, not per-token pricing; see Section 2.3.

Output tokens are consistently 3–10× more expensive than input tokens across all tiers. This asymmetry is critical: an agent that generates verbose chain-of-thought reasoning pays a premium on every call. Output compression (Section 5.4) is therefore disproportionately impactful.

2.2 The Hidden Costs

Beyond raw token pricing, several hidden costs compound across an engagement:

Context caching overhead. Providers offering prompt caching discount cached input tokens by 50–80%, but cache hit rates depend on context stability. An agent with dynamic context assembly that varies the prefix on every call sees near-zero cache hit rates. Well-structured agents with stable system prompts and slow-changing persistent context can achieve 60–90% cache hit rates, dramatically reducing effective input cost.

Tool-call overhead. Many providers charge for the structural tokens of tool-call schemas, function definitions, and tool-result framing. These are not "content" but they are billed. A typical tool schema for a pentesting toolkit (nmap, nikto, sqlmap, hydra, etc.) adds 3,000–8,000 tokens to every call that includes tool definitions. For an agent making 800 calls with 5,000 tokens of schema overhead, that is 4M tokens of schema alone — $20 at $5/M input, or 15% of the naive engagement cost.

Retries and errors. Failed calls (rate limits, malformed outputs, safety refusals, context-length violations) consume tokens without producing progress. Production agents typically see 5–15% retry rates. A 10% retry rate effectively increases cost by ~10% and extends engagement wall-clock time.

Embedding costs. RAG pipelines require embeddings for retrieval over prior findings, tool output archives, and knowledge bases. At $0.10–$0.20 per 1M tokens, this is cheap per call but accumulates across millions of stored tokens over an engagement. For a 50M-token engagement corpus, embedding cost is ~$5–$10 — small but nonzero.

Latency cost. Not a direct dollar cost, but a throughput cost. Slow calls (frontier models at 20–60 seconds per response) limit how much work an agent can do per hour, extending engagement duration and increasing the cost of compute infrastructure, monitoring systems, and human oversight time. Latency also creates a cascade effect: longer engagements mean more context drift, more cache invalidation, and more retry exposure.

2.3 The Self-Hosting Alternative

Open-weights models (Llama, Mistral, Qwen, DeepSeek families) eliminate per-token API cost but introduce GPU infrastructure cost. A single H100-class GPU leased at ~$2–$4/hour can serve a 70B-class model at approximately 500–2,000 output tokens/second for a single concurrent request. For an agent making 50–200 sequential calls requiring mid-tier reasoning, self-hosting can be cost-effective — but it requires infrastructure investment, ops expertise, and accepts lower per-call quality than frontier models.

The break-even calculation for self-hosting vs. API:

Break-even hours/month = (Monthly API cost for mid-tier calls) / (GPU hourly cost)

If mid-tier API costs are $500/month and GPU rental is $3/hour, break-even is ~167 hours of utilization. Below that, API is cheaper. Above it, self-hosting wins. Most small-to-medium pentesting firms operate below this break-even initially; firms running continuous assessment at scale cross it quickly.

The economically optimal architecture is often hybrid: cheap, self-hosted models for high-volume, low-stakes calls (output parsing, summarization, classification) and frontier API models for high-stakes reasoning (attack chain construction, exploit selection, novel vulnerability analysis). This layered model is explored in detail in Section 5.2.

3. A Cost Model for an Autonomous Engagement

3.1 Engagement Parameters

We define a representative engagement that serves as the baseline for all cost calculations:

Scope: One /24 subnet (254 live hosts), one web application, one Active Directory domain
Duration: 8–12 hours of agent runtime
Phases: Reconnaissance → Enumeration → Vulnerability analysis → Exploitation → Lateral movement → Reporting
Complexity: Medium — mixed environment, typical enterprise exposure surface

This scope is deliberately mid-range. Smaller engagements (single web app) cost proportionally less; larger engagements (multiple /16s, cloud environments) cost proportionally more.

3.2 Per-Call Profile

A single reasoning step (one orchestrator inference call) typically involves the following token components:

Median call (well-managed agent): ~30,000 input tokens, ~1,000 output tokens. Median call (naive agent): ~70,000 input tokens (context bloat), ~1,500 output tokens (verbose reasoning).

3.3 Call Volume by Phase

An autonomous agent does not make one call per "finding." It makes calls for deciding what tool to run, interpreting each tool result, updating the attack chain, retrieving and correlating prior context, generating intermediate summaries, and producing the final report. The call volume by phase:

Total

220–640

445–1,300

The enumeration phase dominates call volume because it involves the highest tool-to-result ratio: every port scan, service banner, directory brute-force result, and DNS record triggers an interpretation call. Exploitation has fewer calls but higher per-call cost (larger context, more reasoning, frontier model required).

3.4 The Baseline Cost Formula

We define the per-engagement cost as:

C_engagement = Σ(call_i) for i in 1..N
where: call_i = (input_i × P_in(model_i) + output_i × P_out(model_i)) N = total LLM calls P_in, P_out = per-token prices for the model serving call_i

For a naive single-tier architecture using GPT-4 at $10/M input, $30/M output:

Without context management (bloat multiplier ~10×, average input ~70K):

N = 800 calls (mid-range)
Avg input = 70,000 tokens Avg output = 1,500 tokens
Input cost  = 800 × 70,000 × $10/1M = 800 × $0.70 = $560 Output cost = 800 × 1,500 × $30/1M  = 800 × $0.045 = $36 Retry overhead (10%) = $59.60 Tool schema overhead (5K tokens × 800 calls × $10/1M) = $40
Total = $560 + $36 + $59.60 + $40 = ~$696 per engagement

With worst-case context bloat (20×, average input ~150K, frontier top-tier pricing $15/M input, $75/M output):

Input cost  = 800 × 150,000 × $15/1M = 800 × $2.25 = $1,800
Output cost = 800 × 2,000 × $75/1M   = 800 × $0.15  = $120 Retry + overhead = ~$192
Total = ~$2,112 per engagement

This confirms the upper bound of the naive range: $800–$2,000 per engagement for unoptimized, frontier-only, context-bloated designs.

With managed context (bloat multiplier ~3×, average input ~30K, GPT-4 pricing):

Input cost  = 800 × 30,000 × $10/1M = 800 × $0.30 = $240
Output cost = 800 × 1,000 × $30/1M  = 800 × $0.03 = $24 Retry + overhead = ~$26
Total = ~$290 per engagement

Context management alone reduces cost from ~$700 to ~$290 — a 2.4× improvement. But $290 is still above the viability threshold for many use cases. The remaining reduction comes from model tiering, caching, and batch processing.

3.5 The Bloat Multiplier

The critical variable in the cost model is context bloat: the ratio of actual input tokens to the minimum tokens needed for the reasoning step. A well-managed agent operates at a bloat multiplier of ~2–3× (some retrieved context is always wasted, some history is always retained). A poorly managed agent that stuffs all prior tool output and conversation history into every call operates at 10–20×.

The bloat multiplier is the single largest cost lever in the system, and it is almost entirely determined by architectural choices made at design time: context window management strategy, retrieval precision, tool output normalization, and history compression.

4. Where Tokens Go: A Breakdown

Understanding cost requires understanding where tokens are consumed. For a typical engagement with a moderately managed agent (bloat multiplier ~3×), we instrument the token flow across all calls:

The dominant sink is raw tool output. This is also the most reducible sink: a normalizer that compresses an Nmap XML output from 50K to 5K tokens eliminates 45K tokens from every downstream call that references that scan. Because findings propagate through multiple reasoning steps — a port scan result is referenced in enumeration, vulnerability analysis, and exploitation phases — the savings compound multiplicatively. A single 45K reduction applied across 20 downstream calls saves 900K tokens, or ~$9 at GPT-4 pricing.

The second-largest sink is retrieved context. RAG precision is critical: retrieving 10 chunks of 2K tokens each when only 2 are relevant wastes 16K tokens per call. Improving retrieval precision from 30% to 70% relevance reduces retrieved context cost by more than half.

5. Cost Reduction Strategies

5.1 Ingestion Filtering (Highest Leverage)

Deterministic parsers that convert raw tool output to compact structured representations are the single highest-leverage cost intervention. They reduce input tokens at the source, and the savings propagate to every call that touches the finding.

Mechanism: A normalizer converts raw Nmap XML (50K tokens) to a structured JSON summary (5K tokens) containing only the relevant fields: host, port, service, version, state. Similarly, a Nikto output normalizer extracts only confirmed findings, dropping the verbose banner and informational noise.

Estimated savings: 30–50% of total engagement cost. This is the first optimization to implement and the last to compromise on.

Risk: A buggy normalizer that drops a critical detail (e.g., a version string that indicates a vulnerable service) can cause the agent to miss a finding. Mitigation: normalizers should be conservative (keep ambiguous lines, flag low-confidence extractions) and reviewed against ground truth during development.

5.2 Model Tiering

Not every reasoning step requires a frontier model. A tiered architecture routes calls by complexity:

Blended cost calculation (optimized tiering):

Assuming 800 calls, managed context (30K input, 1K output average):

Frontier (10% = 80 calls):
Input:  80 × 40K × $10/1M = $32.00 Output: 80 × 2K × $30/1M  = $4.80 Subtotal: $36.80
Mid-tier (50% = 400 calls): Input:  400 × 30K × $1.50/1M = $18.00 Output: 400 × 1K × $6/1M     = $2.40 Subtotal: $20.40
Fast (40% = 320 calls): Input:  320 × 20K × $0.50/1M = $3.20 Output: 320 × 0.5K × $2/1M   = $0.32 Subtotal: $3.52
Total (before caching/batching): $60.72 Retry + overhead (10%): $6.07 Grand total: ~$67 per engagement

This is the optimized $40–$80 range — a 10–25× reduction from the naive $700–$2,000.

Model tiering requires a router — itself a cheap model call or a rule-based classifier. The routing decision adds ~500 input tokens and ~50 output tokens per call ($0.01 at fast-tier pricing), a negligible overhead compared to the savings.

Risk: The router sends a complex call to a cheap model that lacks the reasoning depth. Mitigation: the router should be conservative (when in doubt, route up, not down). A 5% misrouting rate to a cheaper tier is acceptable; a 5% misrouting rate to a more expensive tier erodes savings.

5.3 Prompt Caching

For agents with stable system prompts and slow-changing persistent context, prompt caching can reduce input cost by 50–90% for the cached portion. Providers typically discount cached input tokens by 50% (e.g., GPT-4 cached input at $5/M vs. $10/M uncached).

Mechanism: The cache is keyed on a prefix of the prompt. As long as the prefix is stable across calls, the cached tokens are billed at the discounted rate. The cache has a TTL (typically 5–60 minutes) and is invalidated when the prefix changes.

The catch: Agents that dynamically insert retrieved context before the stable prefix break caching. The fix is to structure prompts as:

[system prompt — cached, stable]
[persistent context — cached, slow-changing] [dynamic working context — uncached, changes every call] [tool schemas — cached if static, uncached if dynamic] [output]

Estimated savings with optimal structure:

Assuming 60% of input tokens are cacheable (system prompt + persistent context + tool schemas) and the cache discount is 50%:

Effective input price = 0.6 × ($10 × 0.5) + 0.4 × $10 = $3 + $4 = $7/M
Savings vs. uncached = ($10 - $7) / $10 = 30%

For the optimized tiered architecture:

Cached savings on mid-tier (50% cacheable at $1.50/M, cached at $0.75/M):
400 calls × 30K × 0.5 × ($1.50 - $0.75)/1M = $4.50
Cached savings on frontier (60% cacheable at $10/M, cached at $5/M): 80 calls × 40K × 0.6 × ($10 - $5)/1M = $9.60
Total caching savings: ~$14 per engagement

Applied to the $67 tiered baseline: $67 → ~$53 per engagement.

Cache hit rate sensitivity:

5.4 Output Compression

Output tokens are 3–10× more expensive than input tokens, yet agents often generate verbose chain-of-thought and lengthy tool-call arguments. Strategies:

Structured outputs (JSON with enforced schemas) are more token-efficient than prose reasoning. A structured decision ({"action": "run_nmap", "args": {"ports": "1-1000"}}) is ~30 tokens vs. ~150 tokens for the prose equivalent.
Concise reasoning instructions ("state your decision in 3 sentences, then the tool call") reduce output by 50–70%.
Deferred elaboration — the agent produces a terse decision now and elaborates only if the orchestrator requests it. This is especially effective for routine calls where the reasoning is straightforward.

Estimated savings: 30–50% of output cost. For the tiered architecture, output cost is ~$7.52 across all tiers. A 40% reduction saves ~$3 per engagement. Small in absolute terms but compounds with other optimizations.

5.5 Batching and Parallelism

Many reasoning steps are independent — e.g., interpreting the output of 10 port scans across different hosts. Batching these into a single call with a structured output schema reduces per-call overhead (system prompt, tool schemas) amortized across multiple results.

Mechanism: Instead of 10 separate calls each carrying 5K tokens of system prompt + schemas (50K total overhead), one batch call carries 5K overhead + 10 × 3K results = 35K total. The overhead drops from 50K to 5K — a 90% reduction on the overhead portion.

Estimated savings: 15–25% of total cost for phases with high call independence (enumeration, vulnerability analysis). Lower for phases with sequential dependency (exploitation, lateral movement).

Batched enumeration example:
Naive: 200 calls × (30K input + 1K output) = $78 at mid-tier Batched (10 results per call): 20 calls × (35K input + 5K output) = $13.50 at mid-tier Savings: ~$64 on the enumeration phase alone

Parallelism (multiple concurrent calls) does not reduce token cost but improves throughput, reducing wall-clock time and infrastructure cost. An 8-hour engagement compressed to 3 hours reduces monitoring overhead and human oversight cost.

5.6 Inference Caching

If the agent retrieves the same CVE description twice or re-reads the same scan output, the second call is pure waste. An inference cache keyed on the input hash eliminates identical calls. For engagements with repetitive enumeration (many hosts with similar services), this can save 10–20% of calls.

Inference cache savings (15% hit rate, tiered architecture):
800 × 0.15 = 120 calls eliminated 120 × $67/800 = ~$10 per engagement

5.7 Combined Optimization Stack

Stacking all optimizations on the tiered baseline:

The optimized architecture lands at $40–$80 per engagement without self-hosting, and $25–$40 with selective self-hosting for the mid-tier workload.

6. The Viability Threshold

6.1 The Human Pentesting Cost Floor

At what cost is autonomous pentesting economically viable? The comparison point is human penetration testing:

For autonomous pentesting to be compelling, its all-in cost should be under $1,000 per engagement, ideally under $500. This covers tokens, infrastructure, and tooling but not human review time (which is still required for reporting validation and remediation guidance).

6.2 Viability Matrix

From our cost model, the viability of different architectures:

The economically viable architecture is clear: managed context, model tiering, and selective self-hosting. Teams that invest in these optimizations achieve a 10–50× cost reduction over naive designs, placing per-engagement cost at 0.4–0.8% of the human equivalent.

6.3 The Quality-Cost Tradeoff

Cost reduction is not free. Each optimization introduces a risk:

Ingestion filtering can drop a critical detail if the normalizer is buggy. Mitigation: normalizers should be conservative (keep ambiguous lines) and reviewed against ground truth.
Model tiering routes a complex call to a cheap model that lacks the reasoning depth. Mitigation: the router should be conservative (when in doubt, route up, not down).
Output compression can suppress reasoning that would have caught an error. Mitigation: allow verbose reasoning for high-stakes calls (exploit selection) and compress for routine ones.
Caching returns a stale result if the context changed in a way the cache key doesn't capture. Mitigation: include all dynamic context in the cache key, even at the cost of lower hit rates.

The tradeoff is not "cost vs. quality" as a single dial but a portfolio of decisions, each of which can be tuned independently. The art is in knowing which calls tolerate cost reduction and which do not.

A useful heuristic: the cost-quality curve is not linear. The first 80% of cost reduction (from naive to managed + tiered) costs almost nothing in quality. The last 20% (aggressive compression, routing everything to cheap models) costs significant quality. The viable design point is at the knee of this curve — aggressive but not reckless.

7. ROI Analysis

7.1 Per-Engagement ROI

The ROI of autonomous pentesting vs. human testing on a per-engagement basis:

ROI = (Human cost - AI cost) / AI cost
Scenario: Standard 5-day human pentest ($10,000) vs. optimized AI ($50) ROI = ($10,000 - $50) / $50 = 199× (19,900%)

Even against the naive AI design ($1,000):

ROI = ($10,000 - $1,000) / $1,000 = 9× (900%)

The ROI is overwhelmingly positive for optimized architectures, but the naive architecture's ROI is marginal — and if the AI produces lower-quality findings than the human, a 9× cost advantage may not compensate.

7.2 Total Cost of Ownership

The AI cost is not just tokens. A full TCO model:

Total

$605

$60,500

The dominant TCO component is human review time — the analyst who validates findings, reviews the report, and provides remediation context. This is a feature, not a bug: the AI does the high-volume discovery work; the human does the high-judgment validation. The cost structure shifts from "human does everything" to "human supervises AI output."

Against the human-only baseline ($10,000/engagement × 100 = $1,000,000/year), the AI-augmented TCO ($60,500) represents a 16× cost reduction even with human review included.

7.3 Time-to-Value

Human pentests are scheduled weeks in advance and take 5–10 business days. Autonomous pentests can be initiated on-demand and complete in 8–12 hours. For organizations facing compliance deadlines, M&A due diligence, or incident response scenarios, the time compression is as valuable as the cost reduction.

Time ROI = Human duration / AI duration
= 5 days (40 hours) / 0.5 days (12 hours) = 3.3×

Combined cost-and-time ROI makes autonomous pentesting compelling not just as a cheaper alternative but as a capability multiplier — enabling assessment frequency and scope that would be infeasible with human-only testing.

8. Annual Cost Projections

8.1 Scenario Matrix

Annual cost projections for different engagement volumes and architectures:

8.2 Break-Even Analysis

At what engagement volume does the investment in optimization infrastructure (context management, model tiering, caching) pay off?

Assume optimization development cost: $50,000 (engineering time for normalizers, router, caching layer).

Break-even volume = Development cost / (Naive cost - Optimized cost per engagement)
= $50,000 / ($1,000 - $50) = ~53 engagements

For a firm running weekly assessments (52/year), the optimization investment breaks even in one year. For a firm running daily assessments, it breaks even in under 3 weeks.

8.3 Self-Hosting Crossover

The self-hosting crossover for the mid-tier workload:

Monthly API cost (mid-tier, optimized) = Engagements/month × $40 (mid-tier portion)
Monthly GPU cost = $3/hour × utilization hours
Break-even: Engagements/month × $40 = $3 × utilization_hours

Assuming 60% GPU utilization (agent calls are bursty, not continuous):

For 50 engagements/month:
API cost = 50 × $40 = $2,000 GPU cost = $3 × 720 × 0.6 = $1,296 Self-hosting saves $704/month → viable
For 10 engagements/month: API cost = 10 × $40 = $400 GPU cost = $1,296 API is cheaper by $896/month → not viable

The crossover is approximately 20–30 engagements/month for a single-GPU deployment. Firms below this volume should use APIs; firms above it should self-host the mid-tier.

9. Sensitivity Analysis

9.1 Key Variables and Their Impact

We test the sensitivity of per-engagement cost to the dominant variables:

The three highest-impact variables are call volume, average input tokens per call, and bloat multiplier — all of which are controlled by architectural decisions (context management, ingestion filtering, batching). This confirms that architecture is the dominant cost lever, not model pricing.

9.2 Monte Carlo-Style Scenario Analysis

We model three scenarios with different parameter combinations:

Pessimistic (high cost):

Call volume: 1,300 (complex environment)
Bloat multiplier: 5× (mediocre context management)
Frontier fraction: 15% (complex reasoning required)
Cache hit rate: 30% (dynamic context)
Retry rate: 15%

Input:  1300 × 60K × blended $4.2/M = $327.60
Output: 1300 × 1.2K × blended $12/M = $18.72 Overhead: ~$35 Total: ~$381 per engagement

Baseline (expected):

Call volume: 800
Bloat multiplier: 3×
Frontier fraction: 10%
Cache hit rate: 60%
Retry rate: 10%

Total: ~$53 per engagement (as calculated in Section 5.7)

Optimistic (low cost):

Call volume: 500 (well-scoped, efficient agent)
Bloat multiplier: 2× (excellent context management)
Frontier fraction: 5% (simple environment)
Cache hit rate: 85% (stable context)
Retry rate: 5%
Self-hosted mid-tier

Input:  500 × 20K × blended $1.8/M = $18.00
Output: 500 × 0.8K × blended $5/M = $2.00 Overhead: ~$5 Total: ~$25 per engagement

Scenario summary:

Scenario Per-Engagement Cost Annual (100 engagements) vs. Human ($10K) ------------ Pessimistic ~$381 $38,100 0.38× Baseline ~$53 $5,300 0.05× Optimistic ~$25 $2,500 0.025×

Even the pessimistic scenario is 26× cheaper than human testing. The baseline is 189× cheaper. The economic case for autonomous pentesting is robust across a wide parameter range.

10. Comparison of AI Pentesting Approaches

10.1 Architectural Approaches and Their Cost Structures

Different approaches to building autonomous pentesting agents have fundamentally different cost structures:

Approach Description Per-Engagement Cost Strengths Weaknesses --------------- Single-model frontier All calls to GPT-4-class, naive context $700–$2,000 Highest reasoning quality Economically unviable at scale Single-model mid-tier All calls to GPT-4-mini-class $100–$250 Simple, decent quality Quality ceiling on complex reasoning Tiered (frontier + mid + fast) Router-based tiering $40–$80 Best cost/quality balance Router complexity, misrouting risk Tiered + self-hosted Hybrid API + GPU $25–$40 Lowest cost at scale Infra ops burden, quality variance Multi-agent (specialized) Separate agents per phase $60–$120 Parallelism, specialization Coordination overhead, more calls Retrieval-heavy (RAG-first) Minimal LLM, heavy retrieval $30–$60 Low token cost Limited reasoning, misses novel chains Human-in-the-loop augmented AI does discovery, human does exploitation $50–$150 High quality, manageable cost Not fully autonomous, human bottleneck

10.2 The Multi-Agent Cost Premium

Multi-agent architectures (one agent per phase, or specialized agents for recon, exploitation, reporting) introduce coordination overhead. Each inter-agent message exchange is itself an LLM call (to summarize state for the receiving agent). For a 6-phase engagement with 3–5 coordination calls per phase transition, this adds 15–25 calls — a 2–3% increase in call volume but a 10–15% increase in cost because coordination calls tend to use frontier models.

The benefit of multi-agent architectures is parallelism (phases can overlap) and specialization (each agent has a tighter, more cacheable prompt). Whether the cost premium is justified depends on whether the quality gain from specialization exceeds the coordination overhead.

10.3 The RAG-First Approach

A retrieval-heavy architecture minimizes LLM calls by relying on deterministic rules and knowledge-base lookups for most decisions, invoking the LLM only for novel reasoning. This produces very low token costs ($30–$60/engagement) but sacrifices the agentic reasoning that makes AI pentesting powerful — the ability to construct novel attack chains, correlate unexpected findings, and adapt to unknown environments. RAG-first approaches are better suited to continuous monitoring (where the environment is known and changes are incremental) than to greenfield assessments.

11. Engagement Economics: Beyond Per-Engagement Cost

11.1 Fleet-Level Economics

A pentesting firm running 100 engagements per month at $50 token cost per engagement spends $5,000/month on tokens. At $200/engagement (managed but not tiered), it's $20,000/month. At $1,000/engagement (naive), it's $100,000/month. The difference between naive and optimized is $95,000/month — $1.14 million annually. This is not a marginal optimization; it is the difference between a profitable product and a loss-making one.

11.2 Scope as a Cost Lever

Token cost scales with scope. Approximate scaling factors:

Scope Relative Cost Typical Use Case --------- Single host 0.2× Targeted vulnerability assessment /26 subnet (60 hosts) 0.4× Small network segment /24 subnet (254 hosts) 1.0× (baseline) Standard network assessment /16 subnet (65K hosts) 8–12× Large network scan Full AD domain 5–10× Enterprise domain assessment Multi-cloud environment 10–20× Cloud security assessment

This means scope negotiation is cost engineering: offering tiered assessment levels (rapid scan, standard pentest, deep assessment) lets customers choose their price point and lets the provider manage cost per engagement. A "rapid scan" at 0.3× cost can be offered at a price point that undercuts any human alternative while still being profitable.

11.3 Recurring Assessment Economics

For continuous assessment (weekly or monthly scans of the same infrastructure), caching and prior-engagement memory dramatically reduce cost. The agent does not re-enumerate from scratch; it diffs against prior state.

Engagement 1 (baseline): $50 (full assessment)
Engagement 2: $30 (60% of hosts unchanged, cached) Engagement 3: $20 (80% of hosts unchanged, cached) Engagement 10: $10 (95% of environment characterized, incremental only)

The cost curve decays geometrically as the environment becomes characterized. This makes autonomous pentesting especially economically attractive for continuous monitoring use cases — a market that human pentesting cannot serve cost-effectively.

Annual cost for weekly continuous monitoring (52 engagements):

Naive (no caching): 52 × $50 = $2,600
With engagement caching: ~$1,200 (geometric decay) vs. Human continuous monitoring retainer: $15,000–$50,000/year

The AI continuous monitoring cost is 6–40× cheaper than the human equivalent, with higher frequency (weekly vs. monthly or quarterly human reviews).

12. Open Problems

Several economic problems remain unsolved:

12.1 Cost Estimation Before Engagement

Unlike human pentesting (where cost is fixed by the statement of work), autonomous pentesting cost varies with what the agent finds. A customer quoted "$50" may get a $150 bill if the environment is more complex than expected (more hosts, more services, longer exploitation chains). Predictive cost modeling from reconnaissance data is an open problem. A solution: run a low-cost recon pass ($5–$10) to estimate scope, then quote a fixed price with confidence intervals.

12.2 Cost-Quality Benchmarks

There is no standard benchmark correlating token spend with assessment quality (finding coverage, false positive rate, false negative rate). Without this, "optimize for cost" is ungrounded — we cannot say whether a $40 engagement is "as good as" a $200 engagement. The field needs a benchmark analogous to SWE-bench but for pentesting: a set of known-vulnerable environments where agent finding coverage is measured against token spend.

12.3 Token-Efficient Reasoning

Research on how to make models reason more efficiently — producing the same decision in fewer tokens — is in its infancy. Techniques like latent reasoning, early-exit decoding, and speculative decoding promise 2–5× output cost reduction but are not yet production-ready for agentic use. This is a frontier research area with high economic upside.

12.4 Self-Hosted Economics at Scale

The break-even point between API and self-hosting depends on utilization. A firm with low utilization (sporadic engagements) should use APIs; a firm with high utilization (continuous assessment) should self-host. The crossover calculation is non-trivial and depends on ops capacity, model quality requirements, and the latency sensitivity of the agent loop. Multi-GPU deployments for parallel engagements introduce additional complexity.

12.5 Dynamic Model Selection

Current model tiering uses static routing rules. A more sophisticated approach would dynamically select the model based on the observed complexity of the current reasoning step — using a cheap model first and escalating to frontier only if the cheap model's confidence is low. This "speculative execution" approach could further reduce cost by 20–40% but requires confidence estimation infrastructure that does not yet exist for agentic workflows.

13. Conclusion

Token economics is not a footnote in autonomous pentesting — it is a first-class design constraint that determines viability. The cost surface is highly optimizable: a naive design costing $800–$2,000 per engagement can be brought to $40–$80 through context management, model tiering, prompt caching, batch processing, and output compression. The bad news is that these optimizations require deliberate architectural investment from day one; they cannot be bolted on after the agent is "functional."

The cost model developed in this paper demonstrates that:

Naive designs are economically marginal at $800–$2,000/engagement — approaching or exceeding the cost of junior human testers.

Context management is the single highest-leverage intervention, reducing cost 2–3× and enabling all subsequent optimizations.

Model tiering provides the largest absolute savings, reducing cost from ~$290 to ~$67 by routing 90% of calls to cheaper models.

Prompt caching, batching, and output compression provide incremental savings that compound to bring the total to $40–$80.

Self-hosting the mid-tier pushes cost to $25–$40 for firms with sufficient volume.

Even the pessimistic scenario ($381/engagement) is 26× cheaper than human testing.

The teams that will win in autonomous pentesting are not those with the most sophisticated models or the longest context windows. They are those that treat every token as a scarce resource, instrument their cost surface, and architect for efficiency without sacrificing the reasoning depth that makes AI pentesting valuable in the first place.

In the next paper, we turn from economics to architecture: how do you design the system that implements these optimizations — the context management layer, the model router, the caching infrastructure, and the tool output normalizers that together make autonomous pentesting economically viable at scale?

This whitepaper is part of a series on autonomous penetration testing with AI agents. For the full series index and related work, see the accompanying documentation.