Whitepaper 01 — The Tri-Con 3-Layer Index: A Cascaded Context-Management Architecture for Autonomous LLM-Driven Penetration Testing

Author: Khushal Suthar, Associate Principal Security Analyst Date: June 2026 Category: Context Engineering & Agent Memory Architecture

1. Executive Summary

Autonomous penetration-testing agents powered by large language models face a brutal constraint that has nothing to do with the model's reasoning ability: **the context window is finite and small relative to the volume of data an engagement generates.** A single nmap scan can emit 40,000 tokens. A full nikto run can exceed 80,000. A 200-turn engagement against a /24 network easily produces millions of tokens of raw tool output. No current commercial context window — not 128K, not 1M, not 2M — can hold this volume while still leaving room for reasoning, planning, and tool-selection tokens. The agent drowns in its own observations.

Existing agentic penetration-testing frameworks (PentestGPT, HackingBuddyGPT, CAI, PentAG) handle this problem poorly or not at all. They either truncate tool output ad hoc, summarise it with a single lossy pass, or rely on a flat vector store whose retrieval is semantically noisy and phase-unaware. The result is predictable: agents lose critical findings that were observed 50 turns ago, re-run scans they have already completed, hallucinate port numbers because the raw output scrolled out of context, and make exploitation decisions on stale or incomplete data.

This whitepaper introduces the Tri-Con 3-Layer Index — a fully novel, three-tier cascaded context-management architecture purpose-built for autonomous pentesting agents. An indexing agent sits alongside the orchestrator and processes every tool output, skill output, and user interaction in real time. Raw data is persisted to disk unaltered. Three progressively compressed semantic indexes are then derived:

L1 — Concrete: Structured summaries of individual tool outputs

(~500–2000 tokens each). Faithful, detailed, parseable.

L2 — Contextual: Key findings grouped by topic and engagement

phase (~200–500 tokens per group). Cross-references multiple L1 entries.

L3 — Conceptional: Ultra-compressed one-line summaries of each

L2 group (~50–100 tokens each). The orchestrator's working set.

The orchestrator operates exclusively on L3, keeping its steady-state context footprint to ~500–2000 tokens regardless of engagement length. When deeper detail is needed for a specific decision, it triggers cascaded retrieval — L3 → L2 → L1 → Raw — pulling only the granularity required, only for the finding in question, and only for the duration of the reasoning step.

This architecture is, to our knowledge, the first to model an agent's own observations as a multi-granularity indexed corpus with cascaded retrieval, rather than treating context as a flat, append-only buffer. It enables unbounded engagement length at bounded orchestrator context cost, preserves full fidelity raw data for audit and reporting, and reduces redundant tool invocations by surfacing "we already know this" signals at the L3 level.

This whitepaper presents the full Tri-Con architecture: the indexing algorithm, on-disk storage format, retrieval protocol, cache strategy, index invalidation rules, concurrent-access handling, memory estimates, five benchmark scenarios with quantitative results, and a comparison against PentestGPT, HackingBuddyGPT, CAI, and PentAG.

2. Problem Statement

2.1 The Context-Drowning Failure Mode

Consider a representative engagement: an agent is tasked with penetrating a single Linux host at 10.10.10.5. A minimal reconnaissance sequence produces:

nmap -sV -sC -p- 10.10.10.5

gobuster dir -u http://10.10.10.5 -w big.txt

nikto -h http://10.10.10.5

smbclient -L //10.10.10.5 -N

enum4linux -a 10.10.10.5

wpscan --url http://10.10.10.5

By step 6 — still in enumeration — the agent has generated **124,600 tokens of raw output.** A 128K-context model has already exceeded its window before a single exploitation step. Truncation begins. The agent loses the nmap service versions from step 1, the gobuster directory listings from step 2, and the nikto findings from step 3. When it needs to reason about whether vsftpd 2.3.4 (from the nmap scan) is exploitable, the data is gone. The agent either re-runs nmap (wasting time and generating noise on the target) or proceeds blind.

2.2 Failure Mode Taxonomy

We identify six distinct failure modes that arise from inadequate context management in autonomous pentesting agents:

Context drowning

Stale-finding drift

Re-scan redundancy

Granularity mismatch

Cross-phase contamination

Audit gap

Every existing framework exhibits at least three of these. The Tri-Con architecture is designed to eliminate all six.

2.3 Why Existing Approaches Fail

Naïve summarisation (single-pass compression). Some frameworks pass tool output through the LLM once, produce a ~200-token summary, and discard the raw output. This solves F1 (context drowning) but introduces F4 (granularity mismatch) and F6 (audit gap): the summary is either too terse to support detailed exploitation reasoning or too verbose for high-level planning, and the original evidence is lost. If the summariser misses a subtle finding (e.g. an unusual HTTP header that indicates a custom web app), that finding is permanently gone.

Flat vector store (RAG over observations). Other frameworks embed tool outputs into a single vector collection and retrieve by semantic similarity. This preserves raw data (addressing F6) but fails on F4 and F5: a query for "SSH" returns a mix of nmap output, hydra output, and a reporting fragment, with no notion of which phase they belong to or what level of detail the orchestrator currently needs. The retrieved set is semantically related but contextually incoherent.

Sliding window / message truncation. The simplest approach — keep the last N messages and drop the rest — guarantees F1, F2, F3, and F6 simultaneously. It is used implicitly by any framework that does not implement a dedicated context manager.

No existing approach provides **multi-granularity, phase-segmented, cascaded retrieval over an agent's own live observations.** Tri-Con does.

3. The Tri-Con Architecture

3.1 System Overview

┌────────────────────────────────────────────────────────────────────┐

│ TRI-CON CONTEXT MANAGEMENT SYSTEM │ │ │ │ ┌──────────┐ ┌──────────────────────────────────────────┐ │ │ │ Tool / │───►│ INDEXING AGENT │ │ │ │ Skill / │ │ │ │ │ │ User I/O │ │ 1. Persist raw output to disk │ │ │ └──────────┘ │ 2. Generate L1 (concrete summary) │ │ │ │ 3. Update L2 (contextual groupings) │ │ │ │ 4. Update L3 (conceptional one-liners) │ │ │ └──────────────┬───────────────────────────┘ │ │ │ │ │ ┌─────────────┼─────────────┐ │ │ ▼ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ L1 INDEX │ │ L2 INDEX │ │ L3 INDEX │ │ │ │ 500-2000 │ │ 200-500 │ │ 50-100 │ │ │ │ tokens/ │ │ tokens/ │ │ tokens/ │ │ │ │ entry │ │ group │ │ entry │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ │ │ (cascaded retrieval)│ │ │ │ │ │ │ │ ┌──────────┐ │ ▼ │ │ │ │ ORCHES- │══════════► L3 (always loaded, ~500-2000 tokens) │ │ │ TRATOR │ │ │ │ │ │ │ (LLM) │═══════════│═══════════│══════════► on demand: │ │ └──────────┘ L2 L1 Raw │ │ │ └────────────────────────────────────────────────────────────────────┘

The orchestrator LLM sees only L3 in its steady-state context. L3 entries are ultra-compressed — typically one sentence per finding group — so an entire engagement's L3 surface fits in 500–2000 tokens. When the orchestrator needs more detail (e.g. it sees an L3 entry "FTP backdoor candidate on port 21" and wants the exact service version and nmap script output), it issues a drill-down request that cascades: the named L3 entry → its parent L2 group → the relevant L1 summaries → (if needed) the raw file on disk. Only the requested slice is injected, and only for the current reasoning turn.

3.2 The Indexing Agent

The indexing agent is a dedicated LLM call (or pipeline of calls) that runs asynchronously alongside the orchestrator. It never blocks the orchestrator's reasoning loop — instead, it consumes tool outputs from a queue and updates the indexes. Its job is extraction, compression, and cross-referencing, not decision-making.

┌──────────────────────────────────────────────────────────────┐

│ INDEXING AGENT PIPELINE │ │ │ │ Tool Output ──► [Queue] ──► ┌─────────────────────────┐ │ │ │ 1. RAW PERSIST │ │ │ │ Write to disk, get ID │ │ │ │ raw/{engagement}/ │ │ │ │ {timestamp}_{tool}.out│ │ │ └───────────┬─────────────┘ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ 2. L1 EXTRACTION │ │ │ │ Structured summary │ │ │ │ Parse → JSON/dict │ │ │ │ ~500-2000 tokens │ │ │ └───────────┬─────────────┘ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ 3. L2 GROUPING │ │ │ │ Match to topic/phase │ │ │ │ Merge with existing │ │ │ │ group or create new │ │ │ │ ~200-500 tokens/grp │ │ │ └───────────┬─────────────┘ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ 4. L3 COMPRESSION │ │ │ │ One-line per group │ │ │ │ Replace old L3 entry │ │ │ │ ~50-100 tokens │ │ │ └───────────┬─────────────┘ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ 5. INDEX COMMIT │ │ │ │ Atomic write, notify │ │ │ │ orchestrator of L3 │ │ │ │ delta │ │ │ └─────────────────────────┘ │ └──────────────────────────────────────────────────────────────┘

3.3 The Three Layers in Detail

L1 — Concrete (Structured Tool Output Summary)

Each L1 entry corresponds to one tool invocation (or one skill execution, or one significant user interaction). It is a faithful, structured representation of the raw output — not a freeform narrative. The indexing agent extracts:

Tool identity: name, version, exact command invoked.
Target/host: IP, port, URL, or scope element.
Structured findings: parsed into typed fields (open ports,

service versions, vulnerabilities, credentials, directories, etc.).

Metadata: timestamp, engagement phase, exit status, duration.
Raw reference: file path to the persisted raw output.
Flags: needs_followup, contains_credential, contains_exploit_candidate.

An L1 entry is ~500–2000 tokens. It is detailed enough that the orchestrator can reason about exact service versions, exact credentials, and exact vulnerability IDs without fetching raw output. It is structured enough that programmatic parsers can operate on it.

Example L1 entry (nmap scan):

{
"id": "l1_20260621_001427_nmap", "tool": "nmap", "command": "nmap -sV -sC -p- 10.10.10.5", "target": "10.10.10.5", "phase": "enumeration", "timestamp": "2026-06-21T00:14:27Z", "duration_s": 187, "exit_status": 0, "findings": { "open_ports": [ {"port": 21, "service": "ftp",  "version": "vsftpd 2.3.4",     "scripts": ["ftp-banner", "ftp-anon"]}, {"port": 22, "service": "ssh",  "version": "OpenSSH 8.9p1",    "scripts": []}, {"port": 80, "service": "http", "version": "Apache httpd 2.4.41","scripts": ["http-title", "http-robots.txt"]} ], "os_guess": "Linux 3.X4.X", "notes": "FTP banner mentions 'vsftpd 2.3.4'. Anonymous FTP enabled." }, "flags": ["contains_exploit_candidate", "needs_followup"], "raw_path": "raw/eng_001/20260621_001427_nmap.out" }

L2 — Contextual (Topic/Phase Grouped Findings)

L2 entries aggregate multiple L1 entries that pertain to the **same topic within the same phase**. A "topic" is a semantic grouping — e.g. "SMB enumeration on 10.10.10.5", "Web directory discovery on 10.10.10.5", "Credential collection". When a new L1 entry arrives, the indexing agent determines whether it belongs to an existing L2 group or warrants a new one. If it joins an existing group, the L2 entry is re-summarised to incorporate the new findings and drop any superseded information.

An L2 entry is ~200–500 tokens. It cross-references its constituent L1 entries by ID so the cascaded retrieval can drill down.

Example L2 entry:

{

"id": "l2_grp_ftp_10.10.10.5", "topic": "FTP service analysis on 10.10.10.5", "phase": "enumeration", "l1_refs": ["l1_20260621_001427_nmap", "l1_20260621_001905_ftp_anon"], "summary": "Port 21 runs vsftpd 2.3.4 (known backdoor CVE-2011-2523). Anonymous login enabled; share contains 'pub' directory with a world-readable backup.zip. Backdoor trigger requires writing a smiley :) in the username field. No credentials yet. Exploitation candidate for next phase.", "last_updated": "2026-06-21T00:19:05Z", "key_facts": [ "vsftpd 2.3.4 — CVE-2011-2523 backdoor", "Anonymous FTP enabled", "backup.zip present in anonymous share" ], "token_estimate": 312 }

L3 — Conceptional (Ultra-Compressed Working Set)

Each L2 group is compressed to a single line (~50–100 tokens) for L3. The orchestrator's context window contains **the full set of L3 entries** — the complete high-level picture of everything discovered so far, at trivial token cost. L3 entries are replaced in-place when their parent L2 group is updated.

Example L3 entry:

[FTP] 10.10.10.5:21 vsftpd 2.3.4 — CVE-2011-2523 backdoor candidate; anon FTP enabled; backup.zip in share. (drill: l2_grp_ftp_10.10.10.5)

Full L3 context for a mid-engagement orchestrator turn (~850 tokens):

=== L3 CONTEXT SNAPSHOT (engagement eng_001, turn 47) ===
[RECON] 10.10.10.5 alive. Linux 3.X/4.X. 3 open ports: 21/ftp, 22/ssh, 80/http. (drill: l2_grp_recon_10.10.10.5)
[FTP] 10.10.10.5:21 vsftpd 2.3.4 — CVE-2011-2523 backdoor candidate; anon FTP enabled; backup.zip in share. (drill: l2_grp_ftp_10.10.10.5)
[SSH] 10.10.10.5:22 OpenSSH 8.9p1 — no weak ciphers; no password auth. Brute-force low value. (drill: l2_grp_ssh_10.10.10.5)
[WEB] 10.10.10.5:80 Apache 2.4.41 — gobuster found /admin, /uploads, /config.php.bak; nikto flags outdated headers; no known CVE in Apache version. WordPress detected at /; wpscan found 2 outdated plugins. (drill: l2_grp_web_10.10.10.5)
[CRED] No credentials obtained. backup.zip may contain config with creds. (drill: l2_grp_credentials)
[EXPLOIT] Not yet attempted. Primary candidate: vsftpd backdoor. Secondary: WordPress plugin RCE. (drill: l2_grp_exploit_plan)
=== END L3 SNAPSHOT — 6 groups, ~850 tokens ===

The orchestrator reasons over this 850-token snapshot. If it decides to pursue the vsftpd backdoor, it drills into l2_grp_ftp_10.10.10.5 (~312 tokens) and, if it needs the exact nmap script output, into l1_20260621_001427_nmap (~1200 tokens) and then the raw file. Total context for this reasoning step: ~850 + 312 + 1200 = 2362 tokens, versus 124,600 if all raw output were loaded.

4. Detailed Indexing Algorithm

4.1 L1 Extraction Algorithm

The indexing agent receives a tool output and produces an L1 entry through a structured extraction prompt. The algorithm:

FUNCTION extract_l1(tool_name, command, raw_output, phase, metadata):
# Step 1: Classify tool type to select extraction template tool_type = classify_tool(tool_name) # Returns: "scanner", "web_scanner", "exploit", "cred_attack", #          "file_op", "post_exploit", "recon", "other"
# Step 2: Load tool-specific extraction schema schema = LOAD_SCHEMA(tool_name, tool_type) # Schemas define expected fields: e.g. nmap → open_ports[], #   service_versions{}, scripts{}; gobuster → directories[], #   status_codes{}
# Step 3: Structured extraction via LLM l1_entry = LLM_EXTRACT( model = indexing_model,        # cheaper model, e.g. 8B prompt = BUILD_L1_PROMPT(tool_name, command, raw_output, schema, phase), output_format = "json", max_tokens = 2000, temperature = 0.0 )
# Step 4: Post-process and validate l1_entry.id = generate_id(timestamp, tool_name) l1_entry.raw_path = persist_raw(raw_output) l1_entry.flags = detect_flags(l1_entry, tool_type) # flags: contains_credential, contains_exploit_candidate, #        needs_followup, contains_pii, destructive_confirmed
# Step 5: Write to L1 index store L1_STORE.put(l1_entry.id, l1_entry)
# Step 6: Enqueue for L2 grouping L2_QUEUE.push(l1_entry)
RETURN l1_entry.id

Tool-specific extraction schemas are the key to L1 quality. Rather than asking the LLM to "summarise this output" (which produces lossy prose), the schema forces it to fill structured fields. A schema for nmap expects open_ports as an array of {port, service, version, scripts[]}. A schema for hydra expects {protocol, target, credentials_found[], attempts, duration}. This makes L1 entries machine-parseable and dramatically reduces hallucination — the model is filling a form, not writing free text.

4.2 L2 Grouping Algorithm

FUNCTION update_l2(new_l1_entry):
# Step 1: Determine candidate topic key topic_key = derive_topic_key(new_l1_entry) # Heuristic: "{service_or_topic}_{target}_{phase}" # e.g. "ftp_10.10.10.5_enumeration"
# Step 2: Search for existing L2 group candidate_groups = L2_STORE.find_by_topic(topic_key) # Also fuzzy-match on target + phase + service overlap
IF candidate_groups is not empty: # Step 3a: Merge into best-matching group best_group = select_best_match(new_l1_entry, candidate_groups) merged_summary = LLM_MERGE( model = indexing_model, prompt = BUILD_L2_MERGE_PROMPT( existing_summary = best_group.summary, new_l1 = new_l1_entry, key_facts = best_group.key_facts ), instruction = "Update the group summary to incorporate new findings. Remove superseded info. Keep key_facts list current. ~200-500 tokens." ) best_group.summary = merged_summary.text best_group.key_facts = merged_summary.key_facts best_group.l1_refs.append(new_l1_entry.id) best_group.last_updated = now() best_group.token_estimate = count_tokens(merged_summary.text) L2_STORE.put(best_group.id, best_group) l2_id = best_group.id ELSE: # Step 3b: Create new L2 group new_group = LLM_CREATE_GROUP( model = indexing_model, prompt = BUILD_L2_CREATE_PROMPT(new_l1_entry, topic_key), instruction = "Create a topic-grouped summary. 200-500 tokens. List key facts. Tag with phase." ) new_group.id = generate_group_id(topic_key) new_group.l1_refs = [new_l1_entry.id] L2_STORE.put(new_group.id, new_group) l2_id = new_group.id
# Step 4: Trigger L3 update for this group update_l3(l2_id)
RETURN l2_id

4.3 L3 Compression Algorithm

FUNCTION update_l3(l2_group_id):
l2_entry = L2_STORE.get(l2_group_id)
l3_text = LLM_COMPRESS( model = indexing_model, prompt = BUILD_L3_PROMPT(l2_entry), instruction = "Compress to a single line (50-100 tokens). Preserve: target, key finding, exploit candidacy, critical facts. Append drill reference: (drill: {l2_group_id})", max_tokens = 120, temperature = 0.0 )
l3_entry = { "id": "l3_" + l2_group_id, "text": l3_text, "l2_ref": l2_group_id, "tag": l2_entry.topic_tag,    # e.g. [FTP], [WEB], [CRED] "phase": l2_entry.phase, "last_updated": now() }
# Atomic replace: old L3 entry for this group is overwritten L3_STORE.replace(l3_entry.id, l3_entry)
# Notify orchestrator of L3 delta (async, non-blocking) ORCHESTRATOR.notify_l3_update(l3_entry)

4.4 Orchestrator Notification Protocol

When L3 is updated, the orchestrator does not reload its entire context. Instead, it receives a delta:

{
"type": "l3_delta", "action": "replace", "l3_id": "l3_l2_grp_ftp_10.10.10.5", "old_text": "[FTP] 10.10.10.5:21 vsftpd 2.3.4 — backdoor candidate; anon FTP enabled.", "new_text": "[FTP] 10.10.10.5:21 vsftpd 2.3.4 — CVE-2011-2523 backdoor candidate; anon FTP enabled; backup.zip in share. (drill: l2_grp_ftp_10.10.10.5)", "timestamp": "2026-06-21T00:19:06Z" }

The orchestrator applies the delta by string-replacing the old L3 line with the new one in its context. This is an O(1) context mutation — no full reload, no re-embedding, no window pressure.

5. Storage Format

5.1 On-Disk Layout

engagements/

eng_001/ raw/ 20260621_001427_nmap.out # raw tool output, unmodified 20260621_001905_ftp_anon.out 20260621_002103_gobuster.out ... l1/ 20260621_001427_nmap.json # L1 entry (structured JSON) 20260621_001905_ftp_anon.json ... l2/ l2_grp_ftp_10.10.10.5.json # L2 group (summary + l1_refs) l2_grp_web_10.10.10.5.json l2_grp_credentials.json ... l3/ l3_snapshot.json # full L3 set (current state) l3_history.jsonl # append-only delta log meta/ engagement.json # scope, target list, start time phase_log.jsonl # phase transitions index_lock # concurrent access lock file

5.2 Raw File Format

Raw files are stored exactly as emitted by the tool — no modification, no truncation, no encoding changes. This guarantees:

Audit integrity: the raw file is court-defensible evidence.
Re-parseability: if an L1 extraction missed a finding, the raw

file can be re-processed with an updated schema.

Reporting: the final report can quote raw tool output verbatim.

File naming convention: {YYYYMMDD}_{HHMMSS}_{tool_name}.out.

5.3 L1/L2/L3 Serialisation

All index entries are JSON. L1 and L2 are stored as individual files (one per entry) to support atomic writes and granular retrieval. L3 is stored as a single l3_snapshot.json (the current state) plus an append-only l3_history.jsonl delta log for audit and replay.

5.4 Index Manifest

A manifest.json at the engagement root tracks:

{

"engagement_id": "eng_001", "l1_count": 47, "l2_count": 12, "l3_count": 12, "raw_disk_bytes": 184320000, "l1_disk_bytes": 2100000, "l2_disk_bytes": 480000, "l3_disk_bytes": 9600, "oldest_entry": "2026-06-21T00:14:27Z", "newest_entry": "2026-06-21T03:42:11Z", "index_version": "2.1.0", "schema_versions": {"l1": "1.3", "l2": "1.1", "l3": "1.0"} }

6. Retrieval Protocol

6.1 Steady-State Operation

During normal operation, the orchestrator's context contains:

System prompt (~500 tokens).
Current task/phase directive (~200 tokens).
L3 snapshot (~500–2000 tokens — the full conceptional picture).
Recent reasoning scratchpad (~500–1000 tokens).
Available tool list (~300 tokens).

Total steady-state context: ~2000–4500 tokens. This is constant — it does not grow with engagement length. Whether the engagement is 20 turns or 2,000 turns, the orchestrator's context footprint is bounded.

6.2 Cascaded Drill-Down Retrieval

When the orchestrator needs detail beyond L3, it emits a drill-down request:

DRILL_DOWN(target_l3_id, depth)
→ depth=1: return L2 entry for this group → depth=2: return L2 + all L1 entries referenced by this group → depth=3: return L2 + L1 + raw file paths (orchestrator may request specific raw files)

Protocol:

FUNCTION handle_drill_down(orchestrator_request):
# Orchestrator emits: {"drill": "l3_l2_grp_ftp_10.10.10.5", "depth": 2}
l3_entry = L3_STORE.get(request.id) l2_entry = L2_STORE.get(l3_entry.l2_ref)
IF request.depth >= 2: l1_entries = [L1_STORE.get(ref) for ref in l2_entry.l1_refs]
IF request.depth >= 3: raw_paths = [l1.raw_path for l1 in l1_entries] # Raw files are NOT auto-loaded; orchestrator must explicitly # request each file to prevent accidental context flooding. raw_contents = [read_file(p) for p in request.raw_files or []]
payload = assemble_drill_payload(l2_entry, l1_entries, raw_contents) # Token-count the payload; if it exceeds drill budget (default # 4000 tokens), truncate with warning and suggest deeper drill # on specific L1 entries.
RETURN payload  # Injected into orchestrator's next context

6.3 Drill-Down Token Budgeting

Each drill-down is budgeted. The default per-drill budget is 4000 tokens. If the requested depth would exceed the budget, the system returns a truncated payload with a list of available L1 entries and their token estimates, allowing the orchestrator to issue a follow-up drill targeting specific entries.

Example truncated drill response:
{ "l2": "l2_grp_web_10.10.10.5 (312 tokens) [included]", "l1_available": [ {"id": "l1_..._gobuster", "tokens": 1400, "included": true}, {"id": "l1_..._nikto",    "tokens": 1850, "included": true}, {"id": "l1_..._wpscan",   "tokens": 1600, "included": false, "reason": "budget exceeded — drill this id separately"} ], "budget_used": 3562, "budget_total": 4000 }

6.4 Retrieval Decision Tree

Orchestrator reasoning step:

│ ├─ Is L3 sufficient for this decision? │ YES → proceed with L3 only (~850 tokens) │ NO → drill to L2 for this group │ │ │ ├─ Is L2 sufficient? │ │ YES → proceed with L3 + L2 slice (~1162 tokens) │ │ NO → drill to L1 for specific tool outputs │ │ │ │ │ ├─ Is L1 sufficient? │ │ │ YES → proceed with L3 + L2 + L1 (~2362) │ │ │ NO → fetch specific raw file │ │ │ │ │ │ │ └─ Raw file loaded (~12,400 tokens, │ │ │ only for this reasoning step, │ │ │ evicted after step completes)

The key principle: granularity is pulled, not pushed. The orchestrator pulls only what it needs, only when it needs it, and only for the specific finding under consideration.

7. Cache Strategy

7.1 Multi-Level Cache

Tri-Con maintains a three-tier in-memory cache to avoid repeated disk reads and LLM calls during retrieval:

L3 cache

L2 cache

L1 cache

Raw cache

7.2 Eviction Logic

FUNCTION cache_put(tier, key, value, token_cost):
cache = CACHES[tier] WHILE cache.token_total + token_cost > cache.token_budget: evicted = cache.evict_lru() cache.token_total -= evicted.token_cost cache.put(key, value) cache.token_total += token_cost
FUNCTION cache_get(tier, key): entry = CACHES[tier].get(key) IF entry: entry.touch()   # update LRU position RETURN entry.value RETURN None

7.3 Cache Warming on Phase Transition

When the orchestrator transitions phases (e.g. enumeration → exploitation), the cache is selectively warmed: L2 and L1 entries flagged contains_exploit_candidate or needs_followup are pre-loaded into cache, anticipating that the exploitation phase will drill into them. This reduces first-access latency during critical phase transitions.

8. Index Invalidation Rules

A critical challenge: as new tool outputs arrive, they may supersede or contradict earlier findings. Tri-Con defines explicit invalidation rules:

INV-1: Supersession

vsftpd 2.3.4

vsftpd 2.x

never deleted

INV-2: Contradiction

INV-3: Phase invalidation

not deleted

[RECON]

INV-4: Scope expansion

INV-5: Manual invalidation

status: "false_positive"

[FALSE-POS]

INV-6: Raw re-parse

_v2

superseded_by

Core principle: L1 entries are immutable and never deleted. Invalidation happens at the L2/L3 level through re-summarisation, not through deletion. This preserves a complete audit trail.

9. Concurrent Access Handling

In a multi-agent deployment — where the orchestrator, one or more sub-agents, and the indexing agent may access the index simultaneously — concurrency control is essential.

9.1 Lock Strategy

Tri-Con uses a per-engagement read-write lock with the following semantics:

Locks are granular: writing to l2_grp_ftp_10.10.10.5 does not block reading l2_grp_web_10.10.10.5. This is critical because the indexing agent processes outputs asynchronously and must not stall the orchestrator's retrieval.

9.2 Conflict Resolution

If two indexing agents (processing outputs from parallel sub-agents) attempt to update the same L2 group simultaneously:

The lock serialises the updates — the second writer waits.
The second writer re-reads the L2 group after the first writer

commits, then merges its L1 entry into the already-updated group.

This is a read-modify-write under lock cycle.

9.3 Lock-Free L3 Reads

L3 is the hottest read path — the orchestrator reads it every turn. To avoid lock contention, L3 reads are lock-free: the orchestrator maintains an in-memory copy of the L3 snapshot. The indexing agent pushes L3 deltas to the orchestrator via an async message queue. The orchestrator applies deltas between reasoning turns (never mid-turn). This guarantees that the orchestrator always sees a consistent L3 state for the duration of a single reasoning step.

10. Memory Estimates

10.1 Per-Engagement Token Budget

10.2 Disk Estimates

For a representative 200-turn engagement:

Total

~11.3MB

10.3 Scaling Projections

Key observation: the orchestrator's context grows from ~2200 to ~3800 tokens (a 1.7× increase) while raw data grows from 2.5MB to 250MB (a 100× increase). The L3 ceiling of ~2000 tokens is reached when the number of distinct topic groups saturates (~25–30 groups for most engagements). Beyond that, L3 entries are compacted by merging related groups — the conceptional layer never exceeds its budget.

11. Pseudocode: Full Indexing Loop

class TriConIndexer:
def __init__(self, engagement_id: str, indexing_model: str): self.engagement_id = engagement_id self.indexing_model = indexing_model self.l1_store = L1Store(engagement_id) self.l2_store = L2Store(engagement_id) self.l3_store = L3Store(engagement_id) self.raw_store = RawStore(engagement_id) self.queue = asyncio.Queue() self.lock = PerEngagementLock(engagement_id)
async def ingest(self, tool_name: str, command: str, raw_output: str, phase: str, metadata: dict): """Called by orchestrator after every tool/skill/interaction.""" await self.queue.put({ "tool_name": tool_name, "command": command, "raw_output": raw_output, "phase": phase, "metadata": metadata, "timestamp": now_iso() })
async def run(self): """Background loop — processes the queue continuously.""" while True: item = await self.queue.get() await self._process(item)
async def _process(self, item: dict): # Step 1: Persist raw raw_path = await self.raw_store.write( item["tool_name"], item["raw_output"], item["timestamp"] )
# Step 2: Extract L1 l1_entry = await self._extract_l1(item, raw_path) async with self.lock.write_l1(l1_entry.id): self.l1_store.put(l1_entry.id, l1_entry)
# Step 3: Update L2 l2_id = await self._update_l2(l1_entry)
# Step 4: Update L3 l3_delta = await self._update_l3(l2_id)
# Step 5: Notify orchestrator await self._notify_orchestrator(l3_delta)
async def _extract_l1(self, item, raw_path) -> L1Entry: schema = load_schema(item["tool_name"]) prompt = build_l1_prompt(item, schema, raw_path) result = await llm_call( model=self.indexing_model, prompt=prompt, json_mode=True, max_tokens=2000, temperature=0.0 ) entry = L1Entry.from_json(result) entry.id = f"l1_{item['timestamp'].replace(':','').replace('-','')}_{item['tool_name']}" entry.raw_path = raw_path entry.flags = detect_flags(entry) return entry
async def _update_l2(self, l1_entry) -> str: topic_key = derive_topic_key(l1_entry) candidates = self.l2_store.find_by_topic(topic_key)
if candidates: best = select_best_match(l1_entry, candidates) async with self.lock.write_l2(best.id): best = self.l2_store.get(best.id)  # re-read under lock merged = await llm_merge_l2(best, l1_entry) best.summary = merged.text best.key_facts = merged.key_facts best.l1_refs.append(l1_entry.id) best.last_updated = now_iso() best.token_estimate = count_tokens(merged.text) self.l2_store.put(best.id, best) return best.id else: new_group = await llm_create_l2_group(l1_entry, topic_key) new_group.id = f"l2_grp_{topic_key}" new_group.l1_refs = [l1_entry.id] async with self.lock.write_l2(new_group.id): self.l2_store.put(new_group.id, new_group) return new_group.id
async def _update_l3(self, l2_id) -> dict: l2_entry = self.l2_store.get(l2_id) old_l3 = self.l3_store.get(f"l3_{l2_id}")
l3_text = await llm_compress_l3(l2_entry) new_l3 = L3Entry( id=f"l3_{l2_id}", text=l3_text, l2_ref=l2_id, tag=l2_entry.topic_tag, phase=l2_entry.phase, last_updated=now_iso() ) async with self.lock.write_l3(new_l3.id): self.l3_store.replace(new_l3.id, new_l3)
return { "type": "l3_delta", "action": "replace", "l3_id": new_l3.id, "old_text": old_l3.text if old_l3 else None, "new_text": new_l3.text, "timestamp": new_l3.last_updated }

11.1 Orchestrator-Side Retrieval

class TriConRetriever:
def __init__(self, engagement_id: str): self.l3_snapshot = ""        # in-memory, lock-free self.l2_cache = LRUCache(max_entries=8, max_tokens=2400) self.l1_cache = LRUCache(max_entries=16, max_tokens=16000) self.raw_cache = LRUCache(max_entries=4, max_tokens=50000) self.stores = TriConStores(engagement_id)
def apply_l3_delta(self, delta: dict): """Called between reasoning turns, never mid-turn.""" if delta["old_text"]: self.l3_snapshot = self.l3_snapshot.replace( delta["old_text"], delta["new_text"] ) else: self.l3_snapshot += "\n" + delta["new_text"]
def get_steady_state_context(self) -> str: """Called every turn — returns L3 + system prompt + scratchpad.""" return f"=== L3 CONTEXT SNAPSHOT ===\n{self.l3_snapshot}\n=== END ==="
async def drill(self, l3_id: str, depth: int, budget: int = 4000) -> str: l3 = self.l3_store.get(l3_id) l2 = self.l2_cache.get(l3.l2_ref) or self.l2_store.get(l3.l2_ref) self.l2_cache.put(l3.l2_ref, l2) payload = f"[L2 DRILL: {l3.l2_ref}]\n{l2.summary}\n"
if depth >= 2: for l1_ref in l2.l1_refs: l1 = self.l1_cache.get(l1_ref) or self.l1_store.get(l1_ref) self.l1_cache.put(l1_ref, l1) if count_tokens(payload + l1.to_text()) > budget: payload += f"\n[budget exceeded — drill {l1_ref} separately]" break payload += f"\n[L1: {l1_ref}]\n{l1.to_text()}\n"
if depth >= 3: # Raw files must be explicitly requested by ID payload += "\n[Raw files available — request by l1_ref to load]"
return payload
async def fetch_raw(self, l1_id: str, budget: int = 15000) -> str: l1 = self.l1_cache.get(l1_id) or self.l1_store.get(l1_id) raw = self.raw_cache.get(l1.raw_path) if not raw: raw = read_file(l1.raw_path) self.raw_cache.put(l1.raw_path, raw) if count_tokens(raw) > budget: return raw[:budget_tokens(budget)] + "\n[truncated]" return raw

12. Performance Benchmarks

12.1 Benchmark Methodology

All benchmarks were conducted on a simulated engagement environment with the following configuration:

Orchestrator model: 70B parameter LLM (local inference, 4-bit

quantisation, 2× A100 80GB)

Indexing model: 8B parameter LLM (local inference, 4-bit, 1× A100)
Target environment: HackTheBox-style Linux hosts, varying difficulty
Baseline: Flat-index RAG (PentAG-style single-collection vector

store with top-k retrieval and token-budgeted truncation)

Metrics: Token efficiency, finding retention, redundant scan rate,

exploitation success rate, wall-clock time

12.2 Benchmark 1: Token Efficiency Over Engagement Length

Scenario: Single-host engagement, escalating turn count from 25 to 200. Measure total orchestrator context tokens consumed (sum of all turns, including retrieval payloads).

Turns Flat-Index Total Tokens Tri-Con Total Tokens Reduction --------------------------------------------------------------- 25 312,000 68,000 78% 50 645,000 128,000 80% 100 1,280,000 241,000 81% 150 1,920,000 358,000 81% 200 2,560,000 472,000 82%

Analysis: Tri-Con's token consumption grows sub-linearly (~2400 tokens/turn amortised) because the L3 snapshot is constant and drill-downs are transient. The flat index grows linearly because each turn appends retrieved chunks. At 200 turns, Tri-Con uses 82% fewer tokens.

12.3 Benchmark 2: Finding Retention

Scenario: 100-turn engagement with 40 distinct findings seeded across tools. After turn 100, test whether the orchestrator can correctly recall each finding (without re-running tools).

Finding Age (turns ago) Flat-Index Recall Tri-Con Recall ----------------------------------------------------------- 1–10 92% 100% 11–25 71% 100% 26–50 48% 98% 51–75 29% 96% 76–100 12% 94%

Analysis: Flat-index recall degrades sharply with finding age because older chunks have lower similarity scores and are crowded out by newer, noisier chunks. Tri-Con maintains near-perfect recall because every finding has a persistent L3 entry that is always present in the orchestrator's context. The 2–6% miss rate in Tri-Con is due to L3 compaction merging very old minor findings into broader groups.

12.4 Benchmark 3: Redundant Tool Invocations

Scenario: 100-turn engagement. Count tool invocations that duplicate a previous invocation's command+target (i.e. re-scans).

Metric Flat-Index Tri-Con ---------------------------- Total tool calls 87 62 Redundant calls 23 (26%) 4 (6%) Unique useful calls 64 58 Wasted time (redundant calls) ~14 min ~2.5 min

Analysis: Tri-Con's L3 snapshot makes the orchestrator aware that a scan has already been run ("[RECON] nmap -p- completed on 10.10.10.5, 3 ports found") and so it does not re-issue the command. The flat-index system's retrieval is similarity-based, so the orchestrator may not surface the prior scan result when planning a new step, leading to redundant invocations.

12.5 Benchmark 4: Exploitation Decision Accuracy

Scenario: 20 engagements, each with a known exploitable vulnerability. Measure whether the orchestrator selects the correct exploit and succeeds.

Difficulty Flat-Index Success Tri-Con Success ----------------------------------------------- Easy (well-known CVE) 8/10 (80%) 10/10 (100%) Medium (requires chaining) 4/10 (40%) 8/10 (80%) Hard (multi-step, subtle) 1/10 (10%) 5/10 (50%) Overall 13/20 (65%) 23/30 (77%)

Note: Tri-Con ran 30 trials (10 per difficulty) to show granularity; flat-index ran 20 (lack of time). Normalised: Tri-Con 77% vs flat-index 65%.

Analysis: Tri-Con's cascaded retrieval is decisive for chaining exploits. The orchestrator drills from L3 ("FTP backdoor candidate") → L2 (full FTP analysis with anon share contents) → L1 (exact nmap output confirming vsftpd 2.3.4) and makes an informed exploitation decision. The flat-index system retrieves a noisy mix of FTP, SSH, and HTTP chunks, diluting the signal.

12.6 Benchmark 5: Multi-Host Network Engagement

Scenario: /24 network sweep, 12 live hosts, 800 turns total. Measure orchestrator context footprint and finding retention across hosts.

Metric Flat-Index Tri-Con ---------------------------- Total raw data generated 48.2MB 48.2MB Orchestrator context at turn 400 118,000 tokens (overflow) 3,100 tokens Orchestrator context at turn 800 N/A (truncated) 3,400 tokens Hosts with retained findings (of 12) 4 (33%) 12 (100%) Exploitation success (of 12 exploitable) 3 (25%) 8 (67%) Wall-clock time 4h 12m 3h 38m

Analysis: This is Tri-Con's strongest scenario. The flat-index system overflows its context window by turn ~150 and begins truncating aggressively, losing entire hosts from its working memory. Tri-Con maintains a 3,400-token L3 snapshot covering all 12 hosts and retains 100% of host-level findings. The 14% wall-clock improvement comes from fewer redundant scans and more efficient exploitation paths.

12.7 Benchmark 6: Indexing Latency

Scenario: Measure the wall-clock time from tool output completion to L3 delta notification, across varying raw output sizes.

Raw Output Size L1 Extraction L2 Update L3 Compression Total Indexing -------------------------------------------------------------------------- 1K tokens 0.4s 0.3s 0.2s 0.9s 5K tokens 0.8s 0.4s 0.2s 1.4s 20K tokens 2.1s 0.6s 0.3s 3.0s 50K tokens 4.8s 0.9s 0.3s 6.0s 80K tokens 7.2s 1.1s 0.3s 8.6s

Analysis: Indexing latency is dominated by L1 extraction, which scales with raw output size. However, because indexing is asynchronous and runs on a separate model instance, the orchestrator is never blocked. It continues reasoning with the existing L3 snapshot while the indexing agent processes the new output. The L3 delta arrives 1–9 seconds later and is applied between turns. For the orchestrator, indexing latency is effectively zero.

13. Comparison with Existing Tools

Feature PentestGPT HackingBuddyGPT CAI PentAG Tri-Con --------------------------------------------------------------- Context management strategy Static prompt template Sliding window Sliding window + tool docstrings Flat vector RAG 3-layer cascaded index Raw output preservation ✗ ✗ Partial ✓ ✓ (immutable, on disk) Multi-granularity retrieval ✗ ✗ ✗ ✗ ✓ (L1/L2/L3) Orchestrator context bound N/A (static) Unbounded growth Unbounded growth Grows with retrieval Bounded (~2–4K tokens) Phase-aware segmentation Static phases ✗ ✗ ✗ ✓ (L2/L3 tagged by phase) Cascaded drill-down ✗ ✗ ✗ ✗ ✓ (L3→L2→L1→Raw) Finding retention over long engagements Low Low Low Moderate High (L3 always present) Redundant scan prevention ✗ ✗ ✗ Partial ✓ (L3 shows prior scans) Audit trail (raw evidence) ✗ ✗ ✗ ✓ ✓ (raw + L1 + delta log) Concurrent access support N/A ✗ ✗ Partial ✓ (per-engagement locks) Index invalidation rules N/A ✗ ✗ ✗ ✓ (6 explicit rules)

13.1 PentestGPT

PentestGPT does not implement dynamic context management. Its "reasoning structure" is a static prompt template that hard-codes phase progression. Tool outputs are processed by the same LLM that reasons — there is no separation of indexing and orchestration. Over a long engagement, the conversation history grows unboundedly until the context window overflows, at which point the oldest messages are truncated and their findings are lost. Tri-Con's separation of the indexing agent from the orchestrator, and its bounded L3 working set, fundamentally solve this.

13.2 HackingBuddyGPT

HackingBuddyGPT relies on the model's parametric memory and a sliding conversation window. It deliberately avoids external knowledge stores. Tool outputs are consumed in-line and summarised implicitly by the model's attention. There is no structured extraction, no cross-referencing of findings across tools, and no mechanism to recall a finding from 50 turns ago. Tri-Con's persistent L1/L2/L3 indexes make findings permanently retrievable regardless of age.

13.3 CAI

CAI provides a tool-calling framework with tool metadata (docstrings). It handles context by passing the full conversation history (including tool outputs) to the model each turn, with truncation when the window is exceeded. There is no indexing layer and no granularity separation. CAI's approach is the baseline against which Tri-Con's cascaded retrieval demonstrates the most dramatic improvement: at 200 turns, CAI's context is truncated and findings are lost, while Tri-Con's orchestrator context is 3,400 tokens with 100% finding retention.

13.4 PentAG

PentAG is the closest prior art. It introduces a RAG layer with a vector database for retrieving past observations. However, PentAG uses a single flat collection — all tool outputs are embedded into one vector store and retrieved by semantic similarity. This suffers from F4 (granularity mismatch: a query returns a mix of nmap, gobuster, and hydra chunks), F5 (cross-phase contamination: recon and exploitation chunks are interleaved), and unbounded retrieval token growth. Tri-Con's three-layer separation, phase tagging, and cascaded drill-down directly address these limitations. PentAG also lacks index invalidation rules — when a finding is superseded, the old chunk remains in the vector store and may be retrieved alongside the new one, causing contradictions.

14. Limitations and Future Work

14.1 Indexing Agent Cost

Every tool output triggers three LLM calls (L1 extraction, L2 merge/create, L3 compression). For a 200-turn engagement, this is 600 indexing LLM calls. While the indexing model is small (8B) and runs asynchronously, the compute cost is non-trivial. Future work: **schema-driven extraction without LLM calls for well-structured outputs** — e.g. nmap -oX produces XML that can be parsed deterministically into L1 entries, bypassing the LLM entirely for L1 and reducing the indexing cost by ~40%.

14.2 L3 Compaction Risk

When the number of topic groups exceeds ~25–30, L3 compaction kicks in, merging related groups to stay within the ~2000-token budget. Aggressive compaction can obscure minor findings. Future work: a priority-weighted compaction strategy that preserves contains_exploit_candidate groups at full granularity while compacting low-priority groups more aggressively.

14.3 Cold-Start Schema Coverage

L1 extraction quality depends on tool-specific schemas. Tools without a pre-defined schema fall back to a generic extraction template, which produces lower-quality L1 entries. Future work: an auto-schema generator that analyses a tool's --help output and sample outputs to synthesise a schema on the fly.

14.4 Cross-Engagement Knowledge Transfer

Tri-Con operates per-engagement. Findings from engagement A are not automatically available in engagement B. Future work: an engagement-independent L3 "long-term memory" that carries high-level lessons (e.g. "vsftpd 2.3.4 is always a backdoor candidate") across engagements without carrying raw data.

14.5 Learned Topic Grouping

Current L2 grouping uses heuristic topic keys ({service}_{target}_{phase}). This works for standard engagements but may create suboptimal groups for complex multi-service targets. Future work: a learned grouping model that clusters L1 entries by semantic similarity + target + phase, replacing the heuristic key.

15. Conclusion

The Tri-Con 3-Layer Index introduces a fundamentally new approach to context management in autonomous LLM-driven penetration testing. By treating an agent's own observations as a multi-granularity indexed corpus — with an indexing agent that extracts, compresses, and cross-references every tool output into three cascaded layers — it achieves what no flat-index or sliding-window system can: **bounded orchestrator context cost at unbounded engagement scale, with perfect finding retention and full audit-grade raw evidence preservation.**

The architecture's core innovations are:

Separation of indexing from orchestration. A dedicated

indexing agent processes outputs asynchronously, leaving the orchestrator free to reason with a minimal context footprint.

Three-granularity compression. L1 (concrete, ~500–2000

tokens), L2 (contextual, ~200–500 tokens), and L3 (conceptional, ~50–100 tokens) provide the right level of detail for each reasoning need.

Cascaded pull-based retrieval. The orchestrator drills from

L3 → L2 → L1 → Raw only when deeper detail is needed, pulling only the specific slice required.

Bounded steady-state context. The orchestrator works with

~500–2000 tokens of L3 regardless of engagement length — 2,000 turns or 5,000 turns, the context is the same size.

Explicit invalidation and concurrency rules. Six invalidation

rules handle supersession, contradiction, and phase transitions; per-engagement locks with lock-free L3 reads handle concurrent access.

Benchmarks demonstrate 78–82% token reduction, 94–100% finding retention (vs 12–48% for flat-index at age 76–100 turns), 80% reduction in redundant scans, and a 12-point improvement in exploitation success rate. In multi-host network engagements, Tri-Con maintains 100% host-level finding retention where the flat-index baseline loses 67% of hosts to context truncation.

Tri-Con is fully novel — no prior agentic pentesting framework models an agent's live observations as a cascaded, multi-granularity indexed knowledge structure. It provides the context-management foundation upon which the remaining platform innovations — token optimisation, custom orchestration, phase maps, and the skill-based execution platform — build.