← Back to Whitepapers Core Innovation

Whitepaper 03 — The Custom Orchestrator: A Capability-Aware Task Assignment System for Autonomous LLM Penetration Testing

Author: Khushal Suthar Date: June 2026 Category: Agent Architecture & Capability-Based Routing Series: Autonomous Penetration Testing with AI Agents


1. Executive Summary

The dominant pattern in LLM agent orchestration is the free-form tool-calling loop: the model is handed a list of tool schemas and asked to decide, turn by turn, which tool to invoke and with what arguments. This works adequately for short, single-domain tasks. It fails systematically for penetration testing — a long-horizon, multi-phase discipline in which the right tool for a given task depends not on the model's surface-level familiarity with a tool name, but on a deep, structured understanding of what each tool actually does, what inputs it requires, what outputs it produces, and what post-conditions it establishes.

The problem is not that the model lacks knowledge of nmap or sqlmap. It is that the tool schema — the JSON function-calling descriptor — is a specification of interface, not of capability. A function schema tells the model that nmap accepts target and flags as strings. It does not tell the model that nmap -sV produces service-version fingerprints suitable for CVE matching, that nmap --script vuln produces noisy traffic that may trigger IDS, or that nmap output is best consumed by an L1 Tri-Con indexer rather than injected raw into the orchestrator's context. The schema is a stub; the capability is a graph.

The Custom Orchestrator is a purpose-built agent control system that replaces the free-form tool-calling loop with a capability-aware task assignment architecture. It operates in two phases. Phase 1 — Tool Understanding loads a structured capability map for every registered tool, building a capability registry that captures not just interface signatures but semantic capabilities, input requirements, output characteristics, side effects, and inter-tool data-flow relationships. Phase 2 — Task Assignment decomposes the current engagement objective into atomic task specifications, matches each task's requirements against the capability registry using a weighted scoring algorithm, and selects the optimal tool — or chain of tools — for the task. After the assigned agent executes, the orchestrator assesses the results against the task's success criteria and can reassign to an alternative tool or agent if the outcome is insufficient.

The orchestrator integrates Tri-Con context compression (WP01) as a built-in, first-class mechanism rather than an afterthought. Every tool output flows through the Tri-Con 3-layer index — L1 concrete summaries, L2 contextual groupings, L3 conceptual one-liners — before reaching the orchestrator's working context. This keeps the orchestrator's steady-state context footprint to ~500–2000 tokens regardless of engagement length, while preserving full-fidelity raw data for cascaded retrieval when deeper detail is needed for a specific decision.

This whitepaper presents the full Custom Orchestrator architecture: the capability map schema, the Phase 1 capability-loading protocol, the Phase 2 matching algorithm with pseudocode, the result-assessment and reassignment logic, the orchestrator state machine, the context injection protocol that governs how Tri-Con-compressed knowledge is delivered to assigned agents, the error-handling taxonomy, and a detailed comparison with the CAI router pattern and the PentAG 2-agent split — the two most relevant prior approaches to structured tool routing in LLM pentesting agents.


2. Problem Statement

2.1 The Interface-vs-Capability Gap

Every LLM agent framework in current use — LangChain, AutoGPT, CrewAI, and the pentesting-specific variants PentestGPT, HackingBuddyGPT, CAI, PentAG — presents tools to the model as function schemas: a name, a description string, and a JSON Schema for parameters. The model's tool-selection decision is therefore a function of three inputs:

  • Its pre-training knowledge of the tool name (which may be outdated, incomplete, or hallucinated).
  • The description string (which is typically 1–3 sentences written by the framework developer, not the tool author).
  • The parameter schema (which constrains format but says nothing about semantics).
  • This is the interface-vs-capability gap. The interface is what the model sees; the capability is what the tool actually does. Consider a concrete example. The tool gobuster has the interface:

    {
    

    "name": "gobuster", "description": "Directory/file brute-forcing tool", "parameters": { "url": {"type": "string"}, "wordlist": {"type": "string"}, "mode": {"type": "string", "enum": ["dir", "dns", "vhost"]} } }

    The capability, however, is far richer:

    The integration points are:

  • Context Injection (Section 6): The injection protocol pulls from Tri-Con (L3 working set, cascaded L2/L1 retrieval) and applies Token Engine compression to enforce per-source token budgets.
  • Output Processing: Every tool output flows through Tri-Con indexing before assessment. The assessment logic operates on the L1 structured extract (which has the parsed findings), not the raw output. This ensures the assessment is working with clean, structured data rather than noisy tool output.
  • Finding Graph: Post-conditions are verified against the finding graph, which is populated by Tri-Con's L1 indexing. The orchestrator does not parse raw output to check post-conditions; it queries the finding graph for the expected post-condition types.
  • Historical Performance: The matching algorithm's historical score component (10% weight) is populated from past engagement outcomes stored in the Tri-Con index. This creates a feedback loop: tools that perform well on certain task types get higher scores over time.
  • This three-part architecture addresses the three fundamental challenges of autonomous LLM pentesting:

    Challenge Solution ------ What do I know? (memory) Tri-Con 3-Layer Index (WP01) How do I afford to remember it? (cost) Token Optimisation Engine (WP02) What do I do next? (control) Custom Orchestrator (WP03)

    13. Performance and Reliability

    In a 20-engagement benchmark (HTB-style targets, 150-turn average), the Custom Orchestrator was compared against a free-form tool-calling loop (CAI-style) and the PentAG 2-agent split:

    Metric Free-Form Loop PentAG 2-Agent Custom Orchestrator ------------ Correct tool selection (first choice) 64% 71% 93% Parameter misuse (requiring retry) 22% 16% 4% Context pollution (output drowning reasoning) 18% 7% 0.5% Out-of-scope actions attempted 8% 5% 0% Task reassignment (successful) N/A N/A 1.7 per engagement Tool chain construction (multi-tool tasks) 0% (model sequences across turns) 0% (reasoner sequences) 100% (orchestrator constructs) Engagement completion (all phases reached) 70% 78% 95% Mean token cost per engagement 1.8M 1.4M 0.7M Explainability (assignment rationale available) ✗ ✗ ✓ (scored candidate list)

    The 93% correct-tool-selection rate (vs. 64% free-form, 71% PentAG) is the capability registry's direct contribution. The model is no longer guessing from tool names and description strings; the matching algorithm is scoring against a structured capability model. The 0.5% context pollution rate (vs. 18% free-form) is Tri-Con's contribution — the orchestrator operates on L3 summaries, and agents receive bounded context per the injection protocol. The 0.7M mean token cost (vs. 1.8M free-form) is the combined effect of Tri-Con compression and the Token Optimisation Engine.

    The 95% engagement completion rate (vs. 70% free-form, 78% PentAG) is the reassessment loop's contribution. When a tool fails or produces insufficient results, the orchestrator reassigns to an alternative instead of getting stuck. Free-form and PentAG agents tend to retry the same tool with minor variations, eventually exhausting their turn budget without completing the task.


    14. Limitations and Future Work

  • Capability map authoring burden. Each tool's capability map must be hand-authored by a security engineer who knows the tool operationally. For a 30-tool registry, this is ~60–90 hours of expert time. Future: semi-automated map generation from tool documentation + expert review, with a "draft map" mode that flags low-confidence sections for expert validation.
  • Capability map staleness. Tools evolve. A capability map for nmap 7.94 may be inaccurate for nmap 8.10. Future: version-pinned maps with automated staleness detection (compare map's version field against installed tool version; warn on mismatch).
  • Matching algorithm weights. The current weights (capability 35%, input 20%, constraint 20%, output 15%, historical 10%) are hand-tuned. Future: learned weights via offline optimisation on past engagement outcomes (reinforcement learning from assessment results).
  • Single-task assignment. The current architecture assigns one task to one tool at a time. Some tasks benefit from parallel multi-tool execution (e.g., running nmap and dnsenum simultaneously on different targets). Future: parallel task dispatch with dependency-aware scheduling (the orchestrator already has the data-flow graph for this; the scheduling layer is the missing piece).
  • Dynamic capability discovery. If the orchestrator encounters a task that no registered tool can satisfy, it currently escalates. Future: the orchestrator could dynamically compose a capability from available tools (e.g., "no single tool does X, but tool A's output + a Python script + tool B's input = X") — a form of capability synthesis.
  • Cross-engagement learning. The historical score component is currently per-engagement. Future: a persistent performance database that tracks tool outcomes across engagements, enabling the matching algorithm to learn that feroxbuster outperforms gobuster on IIS servers with soft-404s, for example.

  • 15. Conclusion

    The Custom Orchestrator replaces the free-form tool-calling loop with a capability-aware task assignment system that treats tool selection as a structured matching problem, not a statistical guess. By building a capability registry from hand-authored capability maps — capturing not just tool interfaces but semantics, constraints, side effects, data-flow relationships, and error modes — the orchestrator achieves a level of tool-understanding that no function-schema-based system can match. By matching task requirements against this registry with a deterministic, multi-factor scoring algorithm, it produces tool selections that are correct, explainable, and reproducible. By assessing results against explicit success criteria and reassigning to alternatives when outcomes are insufficient, it closes the loop — turning tool selection from a one-shot guess into an iterative, self-correcting process.

    The built-in integration with Tri-Con context compression (WP01) and the Token Optimisation Engine (WP02) ensures that the orchestrator's decision-making context stays small and affordable regardless of engagement length, while the Context Injection Protocol ensures that assigned agents receive exactly the context they need — no more, no less — to execute their tasks correctly.

    Compared to the CAI router pattern, the Custom Orchestrator offers structured capability matching, constraint awareness, tool chaining, and post-execution reassessment — none of which CAI's LLM-based router provides. Compared to the PentAG 2-agent split, it moves capability knowledge from the model's weights into an external, validated, auditable registry — making tool selection reliable, extensible, and explainable rather than dependent on pre-training knowledge and inference-time reasoning.

    The Custom Orchestrator forms the control backbone of a penetration-testing agent that is not only capable of selecting the right tool for the right task, but of knowing why it made that selection, whether the result was sufficient, and what to do when it wasn't. In a discipline where the difference between a successful engagement and a missed vulnerability is often the choice of tool and the tuning of its parameters, this is not a luxury — it is a necessity.


    © 2026 Khushal Suthar. Part of the Hermesis penetration-testing agent research series. Cross-references: WP01 (Tri-Con 3-Layer Index), WP02 (Token Optimisation 4-Level Engine), WP04 (Phase Map Architecture), WP08 (Architecting Autonomous Pentesting), WP09 (Orchestrator Design Patterns).