A fully autonomous AI-powered penetration testing platform — architected from scratch. Not an LLM wrapper. Not a chatbot with tools. A production-grade security assessment system with 5 novel infrastructure innovations that no existing tool implements.
Five infrastructure-level solutions to fundamental problems that no existing AI pentesting tool addresses. Market research confirms 3 are fully novel — no existing product, paper, or open-source project implements them.
In large-scale security assessments, the context window fills up fast. Every tool output, every finding adds tokens. In security, if compression cuts real data at any point, the entire testing chain breaks — missed vulnerabilities, false positives, broken exploit chains. Existing tools handle this by truncating and forgetting (HackingBuddyGPT) or state decomposition (PentestGPT). None preserve raw data while compressing context.
A dedicated indexing agent extracts data from every tool output, skill output, and interaction. It saves raw data to disk, then creates a cascade of increasingly compressed indexes: L1 (concrete) → L2 (contextual) → L3 (conceptional). The orchestrator works with ~500-2000 tokens of L3 index while raw data stays on disk. When detail is needed, retrieval cascades: L3 → L2 → L1 → Raw Data.
┌─────────────────────────────────────────┐
│ ORCHESTRATOR │
│ (Works with L3 Index only) │
│ ~500-2000 tokens per context │
└──────────────────┬──────────────────────┘
│
┌──────────────────▼──────────────────────┐
│ LAYER 3 INDEX — Conceptional │
│ Ultra-compressed summaries │
│ ~50-100 tokens per entry │
│ Indexes L2 indexes │
└──────────────────┬──────────────────────┘
│
┌──────────────────▼──────────────────────┐
│ LAYER 2 INDEX — Contextual │
│ Key findings + pointers │
│ ~200-500 tokens per entry │
│ Indexes L1 indexes │
└──────────────────┬──────────────────────┘
│
┌──────────────────▼──────────────────────┐
│ LAYER 1 INDEX — Concrete │
│ Structured tool output summary │
│ ~500-2000 tokens per entry │
│ Indexes raw data blocks │
└──────────────────┬──────────────────────┘
│
┌──────────────────▼──────────────────────┐
│ RAW DATA STORE │
│ Complete tool outputs, findings │
│ UNLIMITED — stored on disk │
│ Retrieved on-demand only │
└─────────────────────────────────────────┘
RETRIEVAL: L3 → L2 → L1 → Raw (cascaded, on-demand)
Autonomous pentesting generates massive token consumption. Every tool output, every LLM call, every context injection costs tokens. Large engagements can reach $800-2000 per engagement with naive approaches. No existing AI pentesting tool has a dedicated token optimization engine.
A Python engine that processes both input and output through 4 optimization levels. Achieves 18-35% token savings with zero data loss. Applied to every LLM interaction in the platform.
INPUT OUTPUT
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ LEVEL 1 │ │ LEVEL 1 │
│ DEDUP │ │ DEDUP │
│ ~8-12% │ │ │
└────┬─────┘ └────┬─────┘
▼ ▼
┌──────────┐ ┌──────────┐
│ LEVEL 2 │ │ LEVEL 2 │
│ SHORTHAND│ │ SHORTHAND│
│ ~5-10% │ │ │
└────┬─────┘ └────┬─────┘
▼ ▼
┌──────────┐ ┌──────────┐
│ LEVEL 3 │ │ LEVEL 3 │
│ DYNAMIC │ │ DYNAMIC │
│ WORDLIST │ │ WORDLIST │
│ ~3-7% │ │ │
└────┬─────┘ └────┬─────┘
▼ ▼
┌──────────┐ ┌──────────┐
│ LEVEL 4 │ │ LEVEL 4 │
│ COMPRESS │ │ COMPRESS │
│ ~2-6% │ │ │
└────┬─────┘ └────┬─────┘
▼ ▼
┌──────────┐ ┌──────────┐
│ OPTIMIZED│ │ OPTIMIZED│
│ 18-35% │ │ 18-35% │
│ SMALLER │ │ SMALLER │
│ ZERO LOSS│ │ ZERO LOSS│
└──────────┘ └──────────┘
RESULT: ~$40-80/engagement (vs $800-2000 naive)
Existing AI pentesting orchestrators are simple routers. CAI routes by pattern ("web task → web agent"). PentAG uses a fixed 2-agent split. None understand what tools can actually do before assigning tasks. Result: tools get assigned tasks they're not suited for. Capabilities are wasted. Context is polluted with irrelevant outputs.
A custom orchestrator that first builds an understanding of all available tools — their capabilities, limitations, and optimal use cases. Only then does it assign tasks. Includes built-in context compression via Tri-Con integration. After each agent completes, the orchestrator assesses results and can reassign to a better-suited tool if needed.
┌───────────────────────────────────────────────────┐
│ ORCHESTRATOR │
│ │
│ PHASE 1: TOOL UNDERSTANDING │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Tool 1 │ │ Tool 2 │ │ Tool 3 │ │ Tool N │ │
│ │Cap.Map │ │Cap.Map │ │Cap.Map │ │Cap.Map │ │
│ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │
│ └──────────┴──────────┴──────────┘ │
│ │ │
│ ▼ │
│ CAPABILITY REGISTRY │
│ • What each tool does │
│ • Input/output patterns │
│ • Optimal use cases │
│ • Limitations & edge cases │
│ │
│ PHASE 2: TASK ASSIGNMENT │
│ Task ──▶ Match Against ──▶ Select Best │
│ Capability Registry Tool/Agent │
└───────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ AGENT EXECUTION (with Tri-Con compressed context)│
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Agent 1│ │ Agent 2│ │ Agent 3│ │ Agent N│ │
│ │ (Web) │ │(Mobile)│ │ (IoT) │ │(Custom)│ │
│ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │
│ └──────────┴──────────┴──────────┘ │
│ │ │
│ ▼ │
│ RESULT FEEDBACK → Tri-Con Index │
│ → Orchestrator re-assesses │
└───────────────────────────────────────────────────┘
Different pentesting projects need different phases. A web app pentest follows OWASP. An IoT assessment follows hardware/firmware/protocol flow. A network pentest follows PTES. No existing AI tool can switch methodologies — they're hardcoded. PentestGPT embeds methodology in its prompt (static). PentAG uses fixed recon→exploit.
Methodologies are defined as declarative phase maps. Each phase map specifies which phases to execute, which agents and skills to use per phase, and how phases connect. When a project is created with a methodology, the orchestrator loads the phase map and builds the execution plan accordingly. Phases can be skipped, added, or reordered during execution based on real-time findings.
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ OWASP Web Map │ │ IoT Assess Map│ │ PTES Net Map │
│ │ │ │ │ │
│ Phase 1: Recon│ │ Phase 1: HW │ │ Phase 1: Disc │
│ Phase 2: Auth │ │ Phase 2: FW │ │ Phase 2: Scan │
│ Phase 3: BAuth│ │ Phase 3: Proto│ │ Phase 3: Enum │
│ Phase 4: Authz│ │ Phase 4: Cloud│ │ Phase 4: Exploit│
│ Phase 5: Input│ │ Phase 5: Report│ │ Phase 5: Post │
│ Phase 6: API │ │ │ │ Phase 6: Report│
│ Phase 7: Report│ │ │ │ │
└───────┬────────┘ └───────┬────────┘ └───────┬────────┘
└───────────────────┼───────────────────┘
▼
┌──────────────────────┐
│ PHASE MAP SELECTOR │
│ Project → Method │
│ → Load phase map │
└──────────┬───────────┘
▼
┌──────────────────────┐
│ ORCHESTRATOR │
│ → Create plan │
│ → Execute phases │
│ → Skip/add/reorder │
└──────────────────────┘
Security technology evolves constantly. New frameworks, new protocols, new attack techniques. Every time something new comes, the whole backend needs to change — new code, new logic, new agents. This makes platforms fragile and quickly obsolete.
The entire platform is skill-based. Every module uses skills from a shared skill library. When new technology arrives, only skills need updating. The orchestrator, agents, and infrastructure remain unchanged. Each skill contains: knowledge (what to look for), commands (how to test), patterns (what results mean), parser (how to extract findings), and report template.
┌──────────────────────────────────────────────────────┐
│ PLATFORM CORE (Never Changes) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Orchestr. │ │ Tri-Con │ │ Token │ │
│ │ │ │ Index │ │ Engine │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ All modules reference skills from library below. │
└─────────────────────────┬────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ SKILL LIBRARY (Extensible) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Web │ │ Mobile │ │ IoT │ │ Network │ │
│ │ Skills │ │ Skills │ │ Skills │ │ Skills │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Cloud │ │ AI/ML │ │ NEW TECH│ │
│ │ Skills │ │ Skills │ │ (Add │ │
│ │ │ │ │ │ anytime)│ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Each skill: knowledge · commands · patterns · parser│
└──────────────────────────────────────────────────────┘
ADDING NEW TECHNOLOGY:
New CVE/Tech ──▶ Write skill package ──▶ Platform uses it
Time: 1-2 hours · Zero downtime · No core changes
End-to-end autonomous assessment execution flow — from project creation to final report. Everything runs autonomously, but remains fully observable. Humans can watch, pause, or intervene.
┌──────────────────────────────────────────────────────────────┐
│ 1. PROJECT CREATION │
│ │
│ Client details ──▶ Methodology ──▶ Phase Map ──▶ Orchestr. │
│ selected loaded initialized │
└──────────────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────┐
│ 2. METHODOLOGY LOADING │
│ │
│ Phase Map ──▶ Parse phases ──▶ Identify agents ──▶ Identify │
│ & order needed per phase skills │
│ │
│ → Orchestrator builds complete execution plan │
└──────────────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────┐
│ 3. PHASE EXECUTION (Loop) │
│ │
│ FOR EACH PHASE IN PHASE MAP: │
│ Load Skills ──▶ Assign Tasks ──▶ Execute ──▶ Collect │
│ │ Results │
│ ▼ │
│ Agent runs tool │
│ │ │
│ Tool Output │
│ │ │
│ Token Engine (optimize) │
│ │ │
│ Tri-Con Indexing Agent │
│ │ │
│ Orchestrator assesses │
│ → Next phase? Re-do? Add phase? │
└──────────────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────┐
│ 4. FINDINGS AGGREGATION │
│ │
│ All phases done ──▶ Tri-Con L2/L3 merged ──▶ Cross-ref all │
│ findings │
└──────────────────────────┬───────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────┐
│ 5. REPORT GENERATION │
│ │
│ Aggregated ──▶ Reporter ──▶ Structured ──▶ Executive ──▶ Done│
│ findings Agent findings summary │
│ │
│ Output: Full technical report + executive summary │
│ All findings with evidence, CVSS, remediation │
└──────────────────────────────────────────────────────────────┘
Validation and performance tests across all 5 innovations. 98% pass rate across 405 test cases.
10 technical whitepapers documenting the architecture and design decisions. ~27,500 words of detailed analysis covering problems, solutions, and market comparisons. Papers 1-5 cover core innovations. Papers 6-10 cover broader analysis.
A Semantic Knowledge Framework for LLM-Driven Penetration Testing
A 4-Level Engine for Cost-Efficient LLM Penetration Testing
A State-Machine Agent Loop for Penetration Testing
Declarative Engagement Topologies for LLM Penetration Testing
A Composable Capability Framework for LLM Penetration Testing
Why Large-Scale Security Assessment Breaks LLM Agents
Cost Analysis and Optimization Strategies for Autonomous Pentesting
A Systems Design Approach to AI Security Assessment
From Simple Routing to Capability-Aware Assignment
Infrastructure Challenges and Architectural Solutions