Whitepaper 06 — The Context Window Crisis in AI Pentesting: Why Large-Scale Security Assessment Breaks LLM Agents
Author: Khushal Suthar, Associate Principal Security Analyst Date: June 2026 Series: Autonomous Penetration Testing with AI Agents Category: Analysis — Fundamental Limitations & Failure Mode Taxonomy
1. Executive Summary
The promise of autonomous penetration testing rests on a deceptively simple proposition: an AI agent ingests context about a target environment, reasons about attack paths, and executes exploits at machine speed. The bottleneck is not reasoning quality, model intelligence, or tool integration. It is the context window — the finite span of tokens a transformer can attend to in a single forward pass. This paper argues that the context window is the defining architectural crisis of AI pentesting, that naive attempts to expand it produce diminishing returns and compounding costs, and that the three approaches currently in circulation — truncation, state decomposition, and agent routing — each solve one symptom while leaving the disease intact. The path forward is not an ever-larger window but a hierarchical, cascaded memory architecture in which context is treated as a managed, evicted, and curated resource rather than an infinitely expandable buffer. We introduce the Tri-Con 3-Layer Index (detailed in Whitepaper 01) as the architectural response that eliminates the failure modes catalogued here, and we provide a mathematical model of context exhaustion that explains why every non-hierarchical approach must eventually collapse.
2. Introduction: Why Context Is the Crisis
Penetration testing is, at its core, an exercise in accumulating and correlating information. A human tester begins with almost nothing — an IP range, a scope letter, a set of credentials — and over hours, days, or weeks builds a mental model of the target: network topology, service versions, user accounts, trust relationships, prior findings, failed attempts, and emerging attack chains. The quality of the assessment correlates directly with how much of this accumulated context the tester can hold and reason over simultaneously. A senior pentester's advantage over a junior is not knowledge of tools; it is the capacity to hold a finding from Monday in working memory while recognising its relevance to an observation on Thursday.
Large language models arrive with a structural handicap: their working memory is bounded by the context window. A model with a 128,000-token context window can hold roughly the text of a 300-page book. That sounds generous until one enumerates what an autonomous pentest actually generates.
2.1 Context Volume in a Real Engagement
A modest enterprise engagement can exceed the context window by an order of magnitude within the first few hours of enumeration. A complex one can exceed it by two orders of magnitude. This is not an edge case; it is the median case. The crisis is not hypothetical — it is the first wall that every team building autonomous pentesting agents hits, usually within days of moving from toy targets to real infrastructure.
2.2 A Representative Single-Host Example
Consider a minimal engagement against a single Linux host at 10.10.10.5. A standard reconnaissance sequence produces:
nmap -sV -sC -p- 10.10.10.5 gobuster dir -u http://10.10.10.5 -w big.txt nikto -h http://10.10.10.5 smbclient -L //10.10.10.5 -N enum4linux -a 10.10.10.5 wpscan --url http://10.10.10.5 By step 6 — still in enumeration, before a single exploitation step — the agent has generated 124,600 tokens of raw output. A 128K-context model has already exceeded its window. Truncation begins. The agent loses the nmap service versions from step 1, the gobuster directory listings from step 2, and the nikto findings from step 3. When it needs to reason about whether vsftpd 2.3.4 (from the nmap scan) is exploitable, the data is gone. The agent either re-runs nmap — wasting time, generating noise on the target, and risking detection — or proceeds blind.
This example involves a single host. Scale to a /24 network with 254 live hosts and the numbers multiply by two orders of magnitude. The crisis is not a future risk; it is the present reality of every autonomous pentesting agent that has moved past demo targets.
3. Anatomy of the Context Window
3.1 What the Context Window Actually Is
The context window is the maximum number of tokens — input plus output — that a transformer-based language model can process in a single inference call. It is determined by the architecture's attention mechanism, the positional encoding scheme, and the serving infrastructure's memory budget. Standard transformer attention is O(n²) in sequence length: doubling the context quadruples the attention computation. Even with sparse attention, FlashAttention, ring-attention, and other optimisations, the computational and memory cost grows superlinearly, and the key-value cache pressure grows linearly with sequence length.
The window encompasses everything the model "sees" at the moment of generating its next token:
- System prompt — agent persona, rules of engagement, tool schemas, safety constraints
- Conversation history — prior turns between orchestrator and model
- Tool call results — the volumetric bulk of pentesting context
- Retrieved context — injected documents, retrieved memories, RAG chunks
- In-flight reasoning — chain-of-thought, scratchpad, plan tokens
Every token competes for the same finite budget. A token spent re-reading an Nmap timing line is a token not available for reasoning about an attack chain.
3.2 The Illusion of "Large" Context
Modern frontier models advertise context windows of 200K, 1M, even 10M tokens. These numbers create an illusion of sufficiency. In practice, three effects erode the usable window:
1. Needle-in-a-haystack degradation. Empirical studies consistently show that recall and reasoning quality over long contexts degrade non-uniformly. Models exhibit strong performance at the beginning (primacy) and end (recency) of the context and poor recall in the middle — the so-called "lost in the middle" phenomenon. A finding buried at token position 600K in a 1M window may as well not exist. For pentesting, where the critical observation is often a single version string or a single HTTP header buried in megabytes of scan output, this is catastrophic.
2. Effective vs. nominal context. A model may accept 1M tokens but only reason well over the first ~32K and the last ~32K. The middle is "seen" in the sense that attention weights are computed over it, but it is not deeply integrated into the model's reasoning. For pentesting, where correlation between a port scan at hour one and a credential reuse at hour six is the entire game, middle-of-window blindness is fatal. The nominal window is a marketing number; the effective window is what the agent actually reasons over, and it is typically 4–8× smaller than advertised.
3. Cost and latency scaling. Input token pricing is linear in most pricing models, but latency is not. A 1M-token inference call is not 8× slower than a 128K call — it can be 20–50× slower on non-optimised infrastructure, and the per-token cost can be 2–5× higher due to key-value cache pressure. An autonomous agent making hundreds of calls per engagement at 1M tokens each is economically untenable. The cost of "just use a bigger window" is not just the per-token price; it is the compounding latency that makes the agent slow enough to be useless in time-sensitive engagements.
3.3 The Pentesting Specificity
Context pressure is not unique to pentesting — legal document review, codebase analysis, and scientific literature synthesis all face it. But pentesting has three aggravating properties that make the crisis sharper here than in other domains:
Adversarial noise. Tool output is not clean prose; it is verbose, repetitive, and full of irrelevant lines (Nmap timing lines, HTTP headers, shell prompts, progress bars). The signal-to-noise ratio is poor, meaning useful tokens are a small fraction of consumed tokens. A 40,000-token Nmap output may contain 500 tokens of actual findings. The model must attend over all 40,000 to extract the 500, wasting attention budget on noise.
Temporal correlation requirements. The value of context is not in any single observation but in links between observations separated by time and tool boundaries. The model must hold finding A in attention while evaluating finding B to recognise that they form a chain. A human tester does this naturally; an LLM does it only if both findings are simultaneously in its context window or if an external memory system surfaces the link. If finding A has been evicted by the time finding B arrives, the chain is broken.
Non-deterministic volume. A pentest's output volume is not predictable in advance. One subnet may be a dead-end (10 KB of output); another may explode into a full enumeration cascade (50 MB). The agent cannot pre-allocate context budget because it does not know what it will find. This makes static context-budget strategies brittle; the agent needs a system that adapts in real time to observed volume.
4. Failure Mode Taxonomy
The context window crisis manifests not as a single failure but as a family of interrelated failure modes. We identify nine distinct modes, grouped into three clusters. Every existing autonomous pentesting framework exhibits at least four of these; most exhibit six or more.
4.1 Cluster A — Capacity Failures (Context Too Full)
F1: Context Drowning. Raw tool output exceeds the window before the agent has completed enumeration. The agent truncates, losing early findings. This is the most obvious failure and the one every team encounters first. Symptom: the agent cannot recall the service versions from its initial Nmap scan when deciding which exploit to attempt.
F2: Forgetting Loop. Without active context management, early findings are pushed out of the window by the sheer volume of later tool output, and the agent re-discovers — or fails to re-correlate — information it already had. The agent runs the same Nmap scan twice, asks the same enumeration question, or fails to connect a finding from hour one to an exploit in hour eight. The forgetting loop is not just wasteful; it is actively harmful because re-scanning generates noise on the target, potentially triggering intrusion detection and undermining the stealth objectives of the engagement.
F3: Bloated Prompt. The opposite failure from forgetting: in an attempt to "not forget," the agent or its framework stuffs every prior output into context. The prompt grows until the model spends its reasoning capacity filtering noise rather than reasoning about attacks. Responses become generic, the agent produces boilerplate recommendations, and it stops generating novel hypotheses because its attention is consumed by re-reading Nmap output. The bloated prompt is the failure mode of teams that have heard "don't truncate" but have not heard "summarise hierarchically."
4.2 Cluster B — Integrity Failures (Context Present But Wrong)
F4: Stale-Finding Drift. Summaries are not updated when new data supersedes old data. The agent reasons on outdated port/version information, attempting exploits against services that have been patched or reconfigured since the summary was generated. Symptom: the agent attempts an exploit for vsftpd 2.3.4, but the service was actually vsftpd 3.0.5 — the summary was generated from an early scan and never refreshed after a more precise banner grab.
F5: Granularity Mismatch. The agent receives either too much or too little detail for the decision it is making. A single-level summary that is terse enough for high-level planning is too sparse for exploitation reasoning; one that is detailed enough for exploitation is too verbose for planning. The agent either drowns in detail it does not need or starves for detail it does. This is the failure mode of naïve single-pass summarisation: one compression level cannot serve all reasoning needs.
F6: Cross-Phase Contamination. Reconnaissance noise bleeds into exploitation reasoning because the context has no phase segmentation. The agent is trying to decide whether to exploit a finding, but its context is cluttered with enumeration output from a phase that is logically complete. The model cannot distinguish "this is settled background" from "this is an active lead" because both occupy the same undifferentiated context buffer.
F7: Context Thrash. When context exceeds the window, naive truncation strategies (keep last N tokens, drop oldest) destroy precisely the high-value early context — scope, topology, credentials — that the agent needs most. The agent then behaves erratically: forgetting scope, re-enumerating, or attempting out-of-scope actions. Context thrash is the failure mode that produces rules-of-engagement violations, which are not just technical errors but legal and contractual risks.
4.3 Cluster C — Synthesis Failures (Context Retrieved But Not Connected)
F8: RAG Illusion. A common reflex is to offload context to a vector database and retrieve on demand. This helps with capacity but introduces its own failure: retrieval is query-dependent, and the agent does not always know what to query for. The correlation that would reveal an attack chain ("service X on host Y has the same version as the CVE I just read about") requires the agent to already suspect the link to issue the right retrieval query. RAG solves recall but not synthesis. The agent can retrieve everything it has seen, but it cannot retrieve a connection it has not yet imagined.
F9: Audit Gap. Raw output is discarded after summarisation to save context budget. When the engagement is over and the agent must produce an evidence-backed report, the original data is gone. The agent cannot re-parse raw output, cannot show the exact Nmap line that identified the vulnerable service, and cannot produce the artefacts that a client's remediation team needs. The audit gap converts a technical shortcoming into a business failure: the report looks complete but is not defensible.
4.4 Failure Mode Summary Table
The two Critical severity modes — F7 and F8 — deserve particular attention because they are the failures that produce the worst outcomes: legal exposure from out-of-scope actions, and missed attack chains that a human tester would have caught. An autonomous agent can run for hours, produce voluminous output, and appear thorough while systematically missing the cross-context correlations that constitute the actual value of a penetration test.
5. Token Budget Analysis
To understand why the crisis is structural rather than incidental, we must quantify the token budget of a real engagement and compare it to available context windows.
5.1 The Token Budget Equation
At any reasoning step t, the agent's context consumption is:
C(t) = S + H(t) + T(t) + R(t) + P(t)
Where:
The constraint is:
C(t) ≤ W - O_max
Where W is the context window size and O_max is the maximum output tokens the model needs to generate (typically 1,000–4,000 for a reasoning step). The usable budget is therefore W - O_max - S, and the dynamic terms H(t) + T(t) + R(t) + P(t) must fit within it.
5.2 Per-Phase Token Generation Rates
Empirical observation across multiple engagement types yields the following per-phase generation rates:
A complete single-host engagement generates approximately 565,000–2,075,000 tokens of raw tool output. A 10-host engagement multiplies this by 5–8× (not 10×, because some phases are shared). A /24 engagement can generate 5–50 million tokens.
Against a 128K window, a single-host engagement exceeds capacity by 4–16×. Against a 1M window, it exceeds capacity by 0.5–2× — meaning even a million-token window is insufficient for a single host in the worst case, and is catastrophically insufficient for multi-host engagements.
5.3 The Cost Dimension
At a representative pricing of $3 per million input tokens (frontier-class models), a single-host engagement consuming 2M tokens of input across its lifetime costs $6 in input alone. But this understates the true cost because the same tokens are re-sent on every inference call within a session that does not use prompt caching. Without caching, a 200-step engagement where the context grows linearly to 2M tokens sends an average of 1M tokens per call, for a total of 200M tokens — $600 in input cost for a single host. With prompt caching (which typically offers a 50–80% discount on cached prefixes), this drops to $120–300 — still significant for a single host, and multiplying linearly with host count.
A /24 engagement without caching could cost $15,000–$150,000 in token costs alone. This is not competitive with human pentesting. Context management is therefore not just a technical necessity but an economic one: every token saved by filtering, summarisation, or hierarchical retrieval translates directly to cost savings and latency reductions.
6. Mathematical Model of Context Exhaustion
To make the crisis precise, we formalise context exhaustion as a dynamical system. This model explains why every non-hierarchical approach must eventually collapse and why the collapse is abrupt rather than gradual.
6.1 Definitions
Let:
6.2 The Exhaustion Equation
Without active summarisation or eviction, cumulative context grows as:
E(t) = E(t-1) + r(t) · g(t)
= Σ_{i=1}^{t} r(i) · g(i)
Context exhaustion occurs at the step t\ where E(t\) + S = W, i.e.:
t* = min { t : Σ_{i=1}^{t} r(i) · g(i) ≥ W - S }
If we assume a constant generation rate g and retention rate r (a simplification, but illustrative), this reduces to:
t* = (W - S) / (r · g)
For a 128K window with S = 5,000, r = 0.8 (minimal filtering), and g = 15,000 (reconnaissance phase):
t* = (128,000 - 5,000) / (0.8 × 15,000) = 123,000 / 12,000 ≈ 10.25
The agent exhausts its context window in approximately 10 steps — before it has finished scanning a single host. This matches the empirical observation in Section 2.2, where step 6 produced 124,600 tokens against a 128K window.
6.3 The Collapse Dynamics
The critical insight is that context exhaustion is not a gradual degradation but an abrupt phase transition. Below t\, the agent reasons over complete context and makes correct decisions. At t\, the agent begins truncating, and the quality of reasoning does not decline linearly — it drops sharply because the highest-value context (scope, early findings, credentials) is the oldest and therefore the first to be evicted by naive truncation.
We model reasoning quality Q(t) as:
Q(t) = Q_max · f(coverage(t)) · g(coherence(t))
Where:
After exhaustion (t > t\*), with naive truncation:
The quadratic coherence term is the mathematical expression of why correlation breaks down before individual findings are lost. A finding can survive in context while its connections to other findings are destroyed, because the other endpoints of those connections have been evicted. This is the formal basis for F8 (RAG Illusion) and F2 (Forgetting Loop): the agent does not lose data, it loses structure.
6.4 Why Bigger Windows Do Not Solve the Problem
If we increase W to 1,000,000 (1M tokens) with the same parameters:
t* = (1,000,000 - 5,000) / (0.8 × 15,000) ≈ 83
The agent now survives ~83 steps — enough for a single-host engagement but not for multi-host. And this assumes constant g; in practice, g increases as the agent discovers more services and subnets, so the actual t\ is lower. Furthermore, the needle-in-a-haystack effect means that effective reasoning quality degrades well before t\: a 1M-token context may be nominally within budget but functionally unusable for correlation because the relevant findings are buried in the middle.
The exhaustion equation also reveals why ingestion filtering (reducing r) is the highest-leverage intervention: halving r doubles t\*. A normaliser that compresses Nmap output from 40,000 tokens to 5,000 (r = 0.125) extends the single-host budget from ~10 steps to ~80 steps — a full reconnaissance phase — without any architectural change. This is the mathematical justification for the ingestion-filtering strategy introduced in Whitepaper 01's L1 indexing layer.
6.5 Why Hierarchical Summarisation Is Necessary
With hierarchical summarisation, the model becomes:
E(t) = E_persistent + E_working(t) + E_summary(t)
Where E_persistent is bounded and constant, E_working is bounded by a rolling window of size W_working, and E_summary grows sub-linearly (each summary compresses k prior steps into one summary entry, so E_summary(t) ≈ (t/k) · s where s is summary size).
The exhaustion step becomes:
t** = (W - S - E_persistent - W_working) / (s/k)
If s = 200 tokens and k = 10 (every 10 steps produce one 200-token summary), the marginal context cost per step is 20 tokens, and the agent can run for thousands of steps within a 128K window. This is the mathematical foundation of the Tri-Con L2/L3 architecture (Whitepaper 01): by compressing findings into progressively smaller representations, the orchestrator's context grows sub-linearly, enabling unbounded engagement length at bounded context cost.
7. Why Existing Approaches Are Insufficient
Three approaches dominate current attempts to manage the context crisis. Each addresses one cluster of failure modes while leaving the others intact.
7.1 Truncation (Sliding Window)
Mechanism: Keep the last N tokens (or last N messages) and drop the rest. Used implicitly by any framework that does not implement a dedicated context manager.
What it solves: F1 (Context Drowning) — the agent never exceeds the window because old content is continuously evicted.
What it creates: F2 (Forgetting Loop), F4 (Stale-Finding Drift), F7 (Context Thrash), F9 (Audit Gap). Naive truncation evicts the oldest content, which in pentesting is the most valuable content — scope, topology, initial findings, credentials. The agent forgets its scope and attempts out-of-bounds actions (F7), re-runs scans it has already completed (F2), and loses all raw evidence for reporting (F9).
Verdict: Truncation is not a context management strategy; it is the absence of one. It is the default behaviour of a system that has not yet recognised the crisis. Every framework that relies on truncation exhibits the forgetting loop and context thrash within the first hour of a real engagement.
7.2 State Decomposition (Summarisation)
Mechanism: When context exceeds a threshold, pass the accumulated context through the LLM (or a cheaper model) to produce a summary, then replace the raw context with the summary. Variants include single-pass summarisation, incremental summarisation, and structured summarisation (extracting findings into a key-value store).
What it solves: F1 (Context Drowning), F3 (Bloated Prompt). The agent's context stays within budget because raw output is compressed.
What it creates: F4 (Stale-Finding Drift), F5 (Granularity Mismatch), F6 (Cross-Phase Contamination), F9 (Audit Gap). A single-level summary cannot serve both planning and exploitation reasoning (F5). Summaries generated early in the engagement are not refreshed when more precise data arrives (F4). Without phase segmentation, the summary mixes reconnaissance and exploitation context (F6). And if raw output is discarded after summarisation, the audit trail is destroyed (F9).
The deeper problem with single-pass summarisation is irreversibility. Once a detail is dropped from a summary, it cannot be recovered. If the summariser misses a subtle finding — an unusual HTTP header, a non-standard port, a version string with a trailing patch number — that finding is permanently gone. In pentesting, the subtle findings are often the critical ones. A summariser that preserves 95% of findings has not lost 5% of value; it has potentially lost 100% of the attack chain, because the missed finding may have been the missing link.
Verdict: State decomposition is a necessary but insufficient response. It solves the capacity problem but introduces integrity and synthesis problems. Without multi-granularity summaries, phase segmentation, and raw-data retention, it trades one crisis for another.
7.3 Agent Routing (Multi-Agent Decomposition)
Mechanism: Split the agent into specialised sub-agents, each with its own context window: an enumeration agent, a vulnerability research agent, an exploitation agent, a lateral movement agent, and an orchestrator. Each sub-agent processes only its domain's tool output; the orchestrator holds only high-level state.
What it solves: F1 (Context Drowning), F3 (Bloated Prompt), F6 (Cross-Phase Contamination). Each sub-agent's context is bounded and domain-coherent. The total system memory is the sum of all sub-agent windows, effectively multiplying capacity.
What it creates: F2 (Forgetting Loop at the orchestrator level), F8 (RAG Illusion), and a new failure: orchestrator blindness. The orchestrator receives summaries from sub-agents but cannot see the raw data. When a sub-agent reports "found 3 potential vulnerabilities," the orchestrator cannot reason about which vulnerability is most promising without either (a) requesting the raw data from the sub-agent (which requires the sub-agent to still have it in context) or (b) re-running the sub-agent's analysis. The orchestrator becomes a manager who has read every department's executive summary but none of the underlying reports — it can coordinate but cannot correlate.
The critical failure is that attack chains span agent boundaries. A credential found by the enumeration agent is needed by the lateral movement agent, but the orchestrator must explicitly route it. If the orchestrator does not know to route it (because the credential's relevance is not obvious until another finding arrives in a different sub-agent), the chain is broken. Agent routing solves the capacity problem within each sub-agent but creates a synthesis problem at the orchestrator level — and synthesis is the entire value proposition of autonomous pentesting.
Verdict: Agent routing is a valuable topology pattern but not a context management solution. It multiplies effective memory but does not provide the hierarchical retrieval or cross-agent correlation needed to solve the synthesis failures. Without an external indexed memory that all agents share (such as the Tri-Con index), agent routing produces well-resourced but uncoordinated sub-agents that miss cross-domain attack chains.
7.4 Comparison Table of Approaches
The Tri-Con architecture (introduced in Whitepaper 01) is the only approach that addresses all nine failure modes. It does so by combining ingestion filtering (L1), hierarchical phase-segmented summarisation (L2), ultra-compressed orchestrator working set (L3), cascaded retrieval (L3→L2→L1→Raw), and persistent raw-data storage for audit. The next section introduces it at the level needed to understand why it resolves the crisis.
8. Case Studies
8.1 Case Study A: The Forgotten Credential (Forgetting Loop, F2)
Scenario: An autonomous agent is tasked with a multi-host engagement against a /24 internal network. During hour one, the enumeration agent discovers SMB shares on host 10.0.0.7 and extracts a list of usernames, including svc_backup. During hour six, the exploitation agent discovers that the password Summer2024! is valid for several accounts. The attack chain — svc_backup + Summer2024! → lateral movement to backup server — requires correlating the username from hour one with the password from hour six.
What happened (truncation-based agent): By hour six, the SMB enumeration output from hour one had been evicted from context. The agent had the password but not the username list. It attempted to use the password with usernames it could remember from recent context (the ones it had encountered in the last 30 minutes), none of which were svc_backup. The credential was never combined with the correct username. The attack chain was missed.
What happened (RAG-based agent): The username list was stored in the vector database. When the agent discovered the password, it queried the vector store for "SMB users" and retrieved three chunks: the SMB enumeration output, an unrelated nmap result mentioning port 445, and a reporting fragment. The svc_backup username was in the retrieved SMB chunk, but the agent did not recognise its relevance because the retrieval was semantically noisy and the agent had no phase context to distinguish "this is a finding I should correlate" from "this is background noise."
What would happen (Tri-Con): The SMB enumeration would be indexed at L1 (structured summary: "host 10.0.0.7, SMB shares, users: svc_backup, admin, guest"), L2 (grouped under "SMB findings, reconnaissance phase"), and L3 (one-liner: "SMB users found on 10.0.0.7"). When the exploitation agent discovers the password, the orchestrator's L3 context shows "SMB users found on 10.0.0.7" and the agent triggers cascaded retrieval: L3 → L2 → L1, pulling the exact username list in 500 tokens. The correlation is made in one retrieval step.
8.2 Case Study B: The Bloated Prompt (F3)
Scenario: A team builds an autonomous agent with a 200K context window and configures it to retain all tool output for the duration of the engagement. The reasoning is: "200K is large; we should not need to truncate."
What happened: By hour three, the context was at 180K tokens. The agent's responses became generic: "Based on the scan results, further enumeration is recommended." It stopped generating specific hypotheses. When asked about a specific service version, it responded with a summary of the general service class rather than the specific version string. The model was not failing to reason; it was spending its attention budget on re-reading 180K tokens of scan output on every inference call, leaving minimal capacity for novel reasoning.
Root cause: The model's effective reasoning capacity is not just the context window size but the attention budget available for novel reasoning after the model has processed existing context. A 200K-token context with 199K tokens of raw scan output leaves the model with effectively 1K tokens of "reasoning headroom." The bloated prompt does not produce errors; it produces blandness — the model defaults to safe, generic responses because it cannot afford the attention cost of specific, novel hypotheses.
What would happen (Tri-Con): The orchestrator's context would be ~2,000 tokens of L3 one-liners. The 180K tokens of raw output would be in L1/L2, retrievable on demand. The agent's reasoning headroom would be ~126K tokens (128K window minus 2K context), not 1K. The agent would generate specific, novel hypotheses because it has the attention budget to do so.
8.3 Case Study C: The RAG Illusion (F8)
Scenario: An agent uses a vector database to store all tool outputs and retrieves chunks by semantic similarity before each reasoning step. The engagement involves a web application with a custom authentication bypass.
What happened: The agent ran a Nikto scan that reported an unusual header: X-Custom-Auth-Bypass: true. This was the critical finding — a misconfigured debug header that allowed authentication bypass. The finding was stored in the vector database. Later, when the agent was attempting to access a protected endpoint and received a 401 response, it queried the vector store for "authentication" and "401" and "access denied." The retrieved chunks were about HTTP 401 status codes, generic authentication mechanisms, and a prior finding about a weak password policy. The X-Custom-Auth-Bypass header was not retrieved because the agent did not know to query for "custom header" or "auth bypass" — it did not suspect that a header could bypass authentication, so it did not issue the query that would have retrieved the finding.
Root cause: RAG retrieval is query-dependent, and the agent cannot query for a connection it has not yet imagined. The finding was in the database; the query was not. This is the fundamental limitation of flat-vector RAG for synthesis tasks: it solves recall (the data is retrievable) but not synthesis (the agent does not know what to retrieve to form the connection).
What would happen (Tri-Con): The L3 layer would contain a one-liner: "Unusual header X-Custom-Auth-Bypass on web app." This one-liner is always in the orchestrator's context. When the agent encounters the 401, it does not need to query for the header — it sees the one-liner in its working set and makes the connection directly. Tri-Con's L3 solves the synthesis problem by keeping all findings in the orchestrator's awareness at ultra-compressed granularity, so correlations are surfaced by the model's own reasoning rather than by a retrieval query that the model must know to issue.
8.4 Case Study D: The Scope Violation (F7)
Scenario: An agent is scoped to assess hosts 10.0.0.1–10.0.0.50. The scope document is 3,000 tokens and is placed at the beginning of the context. After 200 steps, the context exceeds the window and the scope document is truncated.
What happened: The agent continued enumeration and discovered an interesting service on 10.0.0.75 — outside the authorised scope. Without the scope document in context, the agent attempted to exploit it, generating traffic on an unauthorised system. The client detected the activity, terminated the engagement, and invoked the liability clause in the contract.
Root cause: Scope is persistent context — it must be present at every reasoning step. Naive truncation treats all context equally and evicts the oldest, which in pentesting is the most fundamental (scope, rules of engagement). The agent does not know what it does not know; once the scope is evicted, the agent has no way to recover it without an explicit retrieval mechanism, and it will not trigger that mechanism because it does not know it should.
What would happen (Tri-Con): Scope is pinned in the system prompt (persistent context, never evicted). Additionally, the Tri-Con L3 layer includes an "engagement boundaries" entry that is always present. The agent cannot forget its scope because scope is not subject to the context management pipeline — it is a fixed input.
9. Introduction to the Tri-Con Solution
The preceding analysis demonstrates that no single-dimensional response to the context crisis is sufficient. Truncation, summarisation, agent routing, and flat RAG each address one dimension while leaving the others to fester. The Tri-Con 3-Layer Index, introduced in full in Whitepaper 01, is a multi-dimensional architecture that addresses all nine failure modes simultaneously through cascaded, multi-granularity, phase-segmented context management.
9.1 Architecture Overview
Tri-Con deploys an indexing agent alongside the orchestrator. Every tool output, skill output, and user interaction is processed by the indexing agent in real time. Raw data is persisted to disk unaltered (addressing F9: Audit Gap). Three progressively compressed semantic indexes are then derived:
The orchestrator operates exclusively on L3, keeping its steady-state context footprint to ~500–2,000 tokens regardless of engagement length. When deeper detail is needed for a specific decision, it triggers cascaded retrieval — L3 → L2 → L1 → Raw — pulling only the granularity required, only for the finding in question, and only for the duration of the reasoning step.
9.2 How Tri-Con Resolves Each Failure Mode
9.3 The Cascaded Retrieval Protocol
The key innovation is not the three layers themselves — multi-level summarisation is a known idea — but the cascaded retrieval protocol that makes them useful. When the orchestrator needs to reason about a specific finding:
The cascade ensures that the orchestrator pays the minimum context cost for each reasoning step. Most decisions are made at L3 (50 tokens); exploitation decisions may require L1 (1,000 tokens); raw output is retrieved only for evidence extraction or rare deep analysis. The amortised context cost per step is low, and the orchestrator's context grows sub-linearly with engagement length.
This is the architectural realisation of the mathematical model in Section 6.5: by making E_summary grow as t/k rather than t, Tri-Con enables unbounded engagement length at bounded orchestrator context cost. A 200-turn engagement that would exhaust a 128K window under truncation runs comfortably in ~2,000 tokens of orchestrator context under Tri-Con.
9.4 Cross-Reference
The full Tri-Con architecture — indexing algorithm, on-disk storage format, retrieval protocol, cache strategy, index invalidation rules, concurrent-access handling, memory estimates, five benchmark scenarios with quantitative results, and comparison against PentestGPT, HackingBuddyGPT, CAI, and PentAG — is presented in Whitepaper 01: The Tri-Con 3-Layer Index. The token-optimisation engine that governs how L1/L2/L3 entries are sized, cached, and evicted within the token budget is presented in Whitepaper 02: Token Optimisation — 4-Level Engine. The orchestrator that consumes L3 and triggers cascaded retrieval is presented in Whitepaper 03: Custom Orchestrator.
This paper's contribution is the diagnosis: a formal failure mode taxonomy, a mathematical model of context exhaustion, and a demonstration that existing approaches are structurally insufficient. The Tri-Con architecture is the prescription; this paper is the pathology report that motivates it.
10. A Taxonomy of Context in AI Pentesting
To manage context, we must first categorise it. Not all tokens are equal, and a context management architecture that treats them as such is already failing. Tri-Con's design is grounded in the following four-category taxonomy.
10.1 Persistent Context (Must Always Be Present)
This context is small (typically under 10K tokens) but irreplaceable. It should be pinned to the system prompt and never enter the eviction pipeline. Case Study D (Section 8.4) illustrates the consequences of failing to pin scope.
10.2 Working Context (Currently Active)
This context changes frequently and should be managed as a rolling window with explicit summarisation when it exceeds a budget. In Tri-Con, this corresponds to the L3 one-liners plus any L2/L1 entries retrieved via cascaded retrieval for the current reasoning step.
10.3 Reference Context (Retrieved on Demand)
This context is large and should live outside the model, retrieved via semantic search, keyword search, or structured queries only when working context signals a need. Tri-Con's L1 and Raw layers serve this role; flat-vector RAG attempts to serve it but without the multi-granularity and phase-segmentation structure that makes retrieval coherent.
10.4 Ephemeral Context (Discardable)
This context should be aggressively filtered at ingestion — never stored, never retrieved, never summarised. The filter must run before tokens enter the window. In Tri-Con, this is the responsibility of the indexing agent's ingestion parser, which strips ephemeral content before generating L1. In the exhaustion equation, this is the mechanism that reduces r (retention rate) and thereby extends t\* (exhaustion step).
11. Measuring Context Health
An often-overlooked requirement: context management must be observable. Teams building autonomous pentesting agents should instrument the following metrics:
These metrics expose whether the agent is thinking or merely processing. An agent with 95% context utilisation, high eviction, and high re-discovery is thrashing; an agent with 40% utilisation, zero eviction, and high retrieval hit rate is healthy. Tri-Con's architecture naturally produces the healthy profile: orchestrator context is ~2K (low utilisation), nothing is evicted from persistent context (zero eviction), and cascaded retrieval has a high hit rate because the L3 index is comprehensive.
Without these metrics, a team cannot distinguish between an agent that is working and an agent that is appearing to work while systematically missing correlations. The most dangerous failure mode in AI pentesting is not an error message; it is a confident, verbose, incorrect report.
12. The Cost of Context Mismanagement
Context mismanagement is not a theoretical concern — it has direct, measurable costs:
The last failure — the RAG gap producing an invisible incompleteness — is the most insidious. An autonomous agent can run for hours, produce voluminous output, generate a polished report, and appear thorough while systematically missing the cross-context correlations that a human tester would catch. The output report looks complete; the assessment is not. The client receives false confidence, and the vulnerabilities that were missed remain exploitable.
This is why the context crisis is not merely a technical inconvenience but a safety and trust problem. An agent that cannot manage its context cannot be trusted to assess security, because its blind spots are not visible in its output. The report is a confident narration of what the agent saw, not a reliable account of what is actually there.
13. The Path Forward
The context window crisis will not be solved by larger windows alone. Even a 100M-token window would not eliminate the need for management: the signal-to-noise problem, the correlation problem, and the cost problem all persist at every scale. A 100M-token context full of Nmap output is still 99.5% noise, still costs $300 per inference call, and still suffers from needle-in-a-haystack degradation. The solution is architectural, not dimensional.
The path forward comprises six principles, all realised in the Tri-Con architecture:
Teams that adopt these patterns will find that even a 128K-token window is sufficient for complex engagements. Teams that do not will find that even a 10M-token window is not. The context window is not the bottleneck; context management is.
14. Conclusion
The context window is the memory of the AI pentesting agent, and memory is the substrate of intelligence. A model that cannot hold its findings cannot correlate them; a model that cannot correlate its findings cannot find attack chains; a model that cannot find attack chains is not a pentester — it is a scanner with better prose.
The crisis is real, but it is tractable. The failure mode taxonomy in this paper identifies nine distinct ways that autonomous pentesting agents fail under context pressure, grouped into capacity, integrity, and synthesis clusters. The mathematical model of context exhaustion demonstrates that the collapse is abrupt rather than gradual, that it destroys correlation structure before it destroys individual findings, and that bigger windows delay but cannot prevent the collapse. The analysis of existing approaches — truncation, state decomposition, agent routing, flat RAG — shows that each solves one cluster while leaving the others to fester. The case studies illustrate how these failures manifest in real engagements, from forgotten credentials to scope violations to invisible incompleteness.
The Tri-Con 3-Layer Index, detailed in Whitepaper 01, is the architectural response that addresses all nine failure modes through cascaded, multi-granularity, phase-segmented context management with persistent raw-data retention. It enables unbounded engagement length at bounded orchestrator context cost, preserves full fidelity raw data for audit and reporting, and surfaces cross-finding correlations through the orchestrator's own reasoning rather than relying on query-dependent retrieval.
The question facing every team building autonomous pentesting agents is not whether to adopt hierarchical context management, but how quickly. The crisis is not a future risk; it is the present wall. The teams that have moved past the demo phase and into real engagements have already hit it. The teams that have moved past it have done so by treating context not as a buffer to be filled but as a resource to be managed, indexed, and curated — which is to say, by adopting something that looks very much like Tri-Con.
In the next paper in this series (Whitepaper 07: Token Economics), we turn from the capacity of context to the economics of tokens: what does autonomous pentesting actually cost, and how does token economics shape system design?
This whitepaper is part of a series on autonomous penetration testing with AI agents. For the full series index and related work, see the accompanying documentation. Cross-references: Whitepaper 01 (Tri-Con 3-Layer Index), Whitepaper 02 (Token Optimisation Engine), Whitepaper 03 (Custom Orchestrator), Whitepaper 07 (Token Economics).