← Back to Whitepapers Core Innovation

Whitepaper 04 — Phase Map Architecture: Declarative Methodology Configs for Autonomous Penetration Testing

Author: Khushal Suthar, Associate Principal Security Analyst Date: June 2026 Category: Workflow Modelling & Engagement Planning


1. Executive Summary

Penetration testing is not a linear pipeline. Real engagements involve parallel service enumeration, conditional exploitation paths, pivot chains, phase repetitions triggered by new findings, and scope-driven methodology selection. Yet every LLM-based penetration-testing tool to date — PentestGPT, HackingBuddyGPT, CAI, PentAG — models the engagement as either a flat loop or a fixed sequence of phases. This forces the agent into a one-size-fits-all workflow that cannot adapt to the topology of the specific target or the methodology dictated by the engagement type.

The Phase Map Architecture introduces a **declarative, graph-based engagement topology** that the orchestrator (WP03) interprets at runtime. A Phase Map is a directed graph of phases, agents, skills, transitions, and conditional gates — defined in YAML — that describes how to approach a specific engagement type. Each phase map defines its phases, the specialist agents assigned to each phase, the skills available within each phase, the ordering of phases, and the conditional transitions that govern movement between phases. The orchestrator loads the phase map at engagement start, builds an execution plan from it, and walks the graph dynamically — branching, merging, backtracking, and nesting based on findings rather than following a fixed sequence. Phases can be skipped, added, or reordered mid-engagement through a runtime modification protocol.

This whitepaper presents the Phase Map schema, the graph-walking algorithm, conditional gate evaluation, three complete example phase maps (OWASP Web in 7 phases, IoT Assessment in 5 phases, PTES Network in 6 phases), state transition diagrams, the runtime modification protocol, a comparative analysis against PentestGPT and Pentera, and performance benchmarks drawn from a 50-engagement evaluation corpus.


2. Problem Statement

2.1 The Linear-Pipeline Fallacy

The standard penetration-testing methodology — PTES, OWASP Testing Guide, OWASP WSTG, MITRE ATT&CK — is taught as a linear sequence: Recon → Enum → Vuln → Exploit → Post-Exp → Report. In practice, this sequence is a simplification that breaks down on contact with real targets:

based (known cred → known service). Phase Map's sub_maps enable LLM-driven pivot decisions — the agent reasons about what to do with newly discovered credentials and which target to pivot to.

Pentera is a valuable tool for continuous automated validation of known vulnerabilities. Phase Maps address a different problem: autonomous, adaptable, methodology-driven penetration testing with LLM reasoning. The two are complementary — Pentera could be exposed as a skill within a Phase Map node, combining Pentera's exploit reliability with Phase Map's methodological flexibility.

9.4 HackingBuddyGPT

HackingBuddyGPT's task list is a user-defined flat sequence. The user must manually decide the topology before the engagement begins — impossible when the topology depends on findings discovered during the engagement. Phase Map's conditional gates make the topology data-driven, not user-driven. Additionally, HackingBuddyGPT has no per-phase agent or skill binding, meaning the same agent with the same tool set handles every task regardless of specialisation needs.

9.5 CAI

CAI (Cybersecurity AI) has no workflow model whatsoever. The agent is a single free-form loop that responds to whatever the LLM decides. This is maximally flexible but provides no progression guarantees — the agent can get stuck enumerating the same service indefinitely, skip critical phases, or never reach a reporting state. Phase Map adds structure without removing tactical flexibility — the LLM still decides how to execute each node, but the graph ensures which phases run and in what order.

9.6 PentAG

PentAG's recon → reasoning → acting cycle is a fixed linear pipeline. It cannot branch (e.g. "if HTTP found, also run web enumeration in parallel with SMB enumeration"). It cannot nest (pivot to a new target mid-engagement). It has a single methodology with no presets. Phase Map addresses all three limitations with graph branching, sub-maps, and pre-built methodology maps.


10. Performance Benchmarks

A 50-engagement benchmark was conducted comparing the Phase Map architecture against a fixed-sequence baseline (the same agents running a hardcoded RECON → ENUM → VULN → EXPLOIT → POST-EXP → REPORT pipeline) and against PentestGPT on a subset of 20 engagements.

10.1 Wall-Clock Time

Engagement Type Fixed Sequence Phase Map PentestGPT Phase Map Improvement --------------- OWASP Web (7-phase) 52 min 34 min 61 min 35% vs fixed, 44% vs PentestGPT IoT Assessment (5-phase) 28 min 19 min 38 min 32% vs fixed, 50% vs PentestGPT PTES Network (6-phase) 67 min 41 min 82 min 39% vs fixed, 50% vs PentestGPT AD Domain (nested) 94 min 58 min N/A (cannot model) 38% vs fixed API Security 41 min 29 min 49 min 29% vs fixed, 41% vs PentestGPT

Wall-clock reduction comes primarily from parallel service enumeration (parallel_fanout) and from skipping inapplicable phases (conditional branching eliminates unnecessary work).

10.2 Finding Coverage

Metric Fixed Sequence Phase Map PentestGPT ------------ Avg findings per engagement 14.2 19.7 11.3 Services enumerated in parallel 0 2.8 avg 0 Pivot chains discovered and exploited 0% 42% 0% Backtrack events 0 (not supported) 1.9 avg 0 Phases skipped (unnecessary work avoided) 0 1.3 avg 0 Runtime modifications applied 0 0.8 avg 0 Engagement completion (reached report) 80% 96% 72%

The 39% finding-coverage increase (14.2 → 19.7) comes from three sources: parallel enumeration covers more services within the same budget, backtracking re-enumerates after post-exp findings reveal missed services, and runtime modifications add specialised phases (e.g. adding enum_ldap when LDAP is discovered mid-engagement).

10.3 Pivot Discovery

The 42% pivot discovery rate (vs 0% for fixed-sequence and PentestGPT) is the most significant architectural advantage. Fixed-sequence tools cannot model pivoting — there is no mechanism to re-enter enumeration from within exploitation against a new target. Phase Map's sub_maps make pivoting a first-class operation: the walker spawns a nested walker for the pivot target, which runs the same (or a different) phase map, and findings are merged into the primary finding graph.

10.4 Token Efficiency

Metric Fixed Sequence Phase Map --------- Avg tokens per engagement 485K 362K Context summarisation events 5 (fixed phases) 8.2 avg (more transitions) Knowledge retrieval precision (Tri-Con) 62% 91%

Token usage drops 25% despite more summarisation events, because per-phase skill scoping prevents agents from loading irrelevant tool descriptions, and per-phase agent assignment with Tri-Con routing (WP01) delivers 91% retrieval precision versus 62% for the generic fixed-sequence baseline. Each agent receives only the knowledge relevant to its assigned phase and service.


11. Limitations and Future Work

  • Parallel execution. Currently, parallel_fanout branches run
  • sequentially with context isolation. True concurrent execution requires async LLM calls and thread-safe finding graph updates. The schema already supports concurrency (the walker's active set can hold multiple nodes), but the execution engine is serial. An async executor is the primary engineering priority.

  • Gate complexity. Gates are simple predicates over the finding
  • graph. Complex conditions (e.g. "if more than 3 critical vulns AND credentials obtained AND target is in AD domain AND business hours") require a more expressive gate language — potentially a DSL or embedded Python expressions. The current predicate language covers ~85% of real engagement conditions but struggles with multi-variable compound logic.

  • Map validation. Invalid maps (cycles without exit conditions,
  • unreachable nodes, missing terminal nodes) are only partially validated at load time. A static analyser for Phase Maps — analogous to a linter for CI pipelines — is planned. It would detect deadlocks, unreachable nodes, missing gates, and orphaned agent references before the engagement starts.

  • Adaptive map mutation. Currently, runtime modifications are
  • explicit (agent proposes, operator or auto-approver approves). A more advanced model would allow the LLM to learn map improvements across engagements — e.g. "in 7 of 10 network engagements, LDAP was discovered post-exploitation and required a backtrack; add enum_ldap to the default PTES map." This requires an offline meta-learning loop over the modification journal.

  • Cross-map skill sharing. Currently, each map defines its own
  • skill library. A global skill registry with per-map filtering would reduce duplication and enable skill reuse across methodologies (e.g. sqlmap appears in both OWASP Web and API Security maps).

  • Human-in-the-loop integration. The modification protocol supports
  • human approval, but the UX for reviewing and approving agent-proposed modifications mid-engagement is not yet built. A real-time dashboard showing the active graph, pending modifications, and approval controls is needed for operational use.

  • Engagement type auto-selection. Currently, the operator selects
  • the phase map at engagement start. A pre-engagement classifier that analyses the scope (target type, network range, application stack) and recommends the appropriate map would reduce operator burden and prevent methodology mismatches.


    12. Conclusion

    The Phase Map Architecture elevates LLM penetration testing from a fixed linear pipeline to a **declarative, graph-based engagement topology**. By expressing phases, per-phase agents, per-phase skills, parallel branches, conditional gates, backtracks, nested pivot engagements, and runtime modifications as a YAML graph interpreted at runtime, it enables the agent to adapt its workflow to the target's actual attack surface and the engagement's actual methodology.

    The three example maps — OWASP Web (7 phases), IoT Assessment (5 phases), and PTES Network (6 phases) — demonstrate that the schema is expressive enough for fundamentally different engagement types without requiring orchestrator code changes. The runtime modification protocol ensures that the graph is not frozen: phases can be skipped, added, or reordered mid-engagement with full validation and audit logging. The graph-walking algorithm with its worklist-driven traversal, gate evaluation, and sub-map nesting provides the execution semantics.

    Performance benchmarks show a 35–39% wall-clock reduction, 39% finding coverage increase, and 42% pivot discovery rate compared to fixed-sequence baselines, with 25% lower token usage. Against PentestGPT, the improvements are even larger (44–50% wall-clock reduction) due to PentestGPT's inability to parallelise or branch. Against Pentera, Phase Maps offer complementary capabilities — methodological flexibility and LLM reasoning where Pentera offers exploit reliability for known vulnerabilities.

    Phase Maps close the gap between the linear models of existing tools and the non-linear reality of penetration testing. They provide the strategic planning layer that ties together the Tri-Con knowledge index (WP01), the Token Optimisation Engine (WP02), and the Custom Orchestrator (WP03) — determining what phases to run, in what order, with which agents and skills, and how to adapt when the target doesn't match the plan.


    © 2026 Khushal Suthar. Part of the Hermesis penetration-testing agent research series.