Whitepaper 03 — The Custom Orchestrator: A Capability-Aware Task Assignment System for Autonomous LLM Penetration Testing
Author: Khushal Suthar Date: June 2026 Category: Agent Architecture & Capability-Based Routing Series: Autonomous Penetration Testing with AI Agents
1. Executive Summary
The dominant pattern in LLM agent orchestration is the free-form tool-calling loop: the model is handed a list of tool schemas and asked to decide, turn by turn, which tool to invoke and with what arguments. This works adequately for short, single-domain tasks. It fails systematically for penetration testing — a long-horizon, multi-phase discipline in which the right tool for a given task depends not on the model's surface-level familiarity with a tool name, but on a deep, structured understanding of what each tool actually does, what inputs it requires, what outputs it produces, and what post-conditions it establishes.
The problem is not that the model lacks knowledge of nmap or sqlmap. It is that the tool schema — the JSON function-calling descriptor — is a specification of interface, not of capability. A function schema tells the model that nmap accepts target and flags as strings. It does not tell the model that nmap -sV produces service-version fingerprints suitable for CVE matching, that nmap --script vuln produces noisy traffic that may trigger IDS, or that nmap output is best consumed by an L1 Tri-Con indexer rather than injected raw into the orchestrator's context. The schema is a stub; the capability is a graph.
The Custom Orchestrator is a purpose-built agent control system that replaces the free-form tool-calling loop with a capability-aware task assignment architecture. It operates in two phases. Phase 1 — Tool Understanding loads a structured capability map for every registered tool, building a capability registry that captures not just interface signatures but semantic capabilities, input requirements, output characteristics, side effects, and inter-tool data-flow relationships. Phase 2 — Task Assignment decomposes the current engagement objective into atomic task specifications, matches each task's requirements against the capability registry using a weighted scoring algorithm, and selects the optimal tool — or chain of tools — for the task. After the assigned agent executes, the orchestrator assesses the results against the task's success criteria and can reassign to an alternative tool or agent if the outcome is insufficient.
The orchestrator integrates Tri-Con context compression (WP01) as a built-in, first-class mechanism rather than an afterthought. Every tool output flows through the Tri-Con 3-layer index — L1 concrete summaries, L2 contextual groupings, L3 conceptual one-liners — before reaching the orchestrator's working context. This keeps the orchestrator's steady-state context footprint to ~500–2000 tokens regardless of engagement length, while preserving full-fidelity raw data for cascaded retrieval when deeper detail is needed for a specific decision.
This whitepaper presents the full Custom Orchestrator architecture: the capability map schema, the Phase 1 capability-loading protocol, the Phase 2 matching algorithm with pseudocode, the result-assessment and reassignment logic, the orchestrator state machine, the context injection protocol that governs how Tri-Con-compressed knowledge is delivered to assigned agents, the error-handling taxonomy, and a detailed comparison with the CAI router pattern and the PentAG 2-agent split — the two most relevant prior approaches to structured tool routing in LLM pentesting agents.
2. Problem Statement
2.1 The Interface-vs-Capability Gap
Every LLM agent framework in current use — LangChain, AutoGPT, CrewAI, and the pentesting-specific variants PentestGPT, HackingBuddyGPT, CAI, PentAG — presents tools to the model as function schemas: a name, a description string, and a JSON Schema for parameters. The model's tool-selection decision is therefore a function of three inputs:
This is the interface-vs-capability gap. The interface is what the model sees; the capability is what the tool actually does. Consider a concrete example. The tool gobuster has the interface:
{
"name": "gobuster", "description": "Directory/file brute-forcing tool", "parameters": { "url": {"type": "string"}, "wordlist": {"type": "string"}, "mode": {"type": "string", "enum": ["dir", "dns", "vhost"]} } }
The capability, however, is far richer:
- Best for: HTTP directory enumeration on web servers with predictable paths. Produces status-code-tagged path lists.
- Input requirements: A live URL, a wordlist appropriate to the target technology (e.g.,
directory-list-2.3-medium.txtfor Apache, custom list for IIS), and knowledge of the server's response behaviour for non-existent paths (404 vs 403 vs 200-soft-404). - Output characteristics: Line-oriented text, each line a path + status code. Output volume scales with wordlist size (10K–300K lines). High false-positive rate on soft-404 servers without
--wildcardtuning. - Side effects: Generates significant HTTP traffic (one request per wordlist entry). Detectable by WAF/IDS. May trigger rate-limiting or IP bans on aggressive configurations.
- Downstream consumers: Output is best indexed at L1 (structured path+status pairs) and consumed by the
VULNphase for directory-based vulnerability assessment. Not suitable for raw injection into orchestrator context. - Alternatives:
feroxbuster(recursive, faster, better soft-404 detection),dirb(slower, simpler),ffuf(more flexible, supports fuzzing beyond directories). - LangChain / AutoGPT: Tool selection is entirely model-driven from schemas. No capability registry. No post-execution assessment.
- CrewAI: Multi-agent, but tool assignment is still schema-based. Adds coordination overhead without solving the capability gap.
- PentestGPT: No tool registry at all — the model is prompted with a list of tool names and expected to know how to use them.
- HackingBuddyGPT: Tools are Python classes with docstrings. Better than raw schemas, but still interface-level. No structured capability model.
- CAI: Introduces a router that selects among tool-calling agents. The router is a step toward structured assignment, but it routes based on agent descriptions, not a formal capability registry. No reassessment loop. We examine CAI in detail in Section 9.
- PentAG: Uses a 2-agent split — a reasoning agent and an acting agent — to separate planning from execution. This addresses context pollution but not capability awareness. The reasoning agent still selects tools from schemas. We examine PentAG in detail in Section 10.
None of this is captured in the function schema. The model must either know it from pre-training (unreliable, especially for parameter tuning and side-effect awareness) or discover it through trial and error (expensive in tokens and engagement time, potentially dangerous in live engagements).
2.2 Consequences of the Gap
The interface-vs-capability gap produces four predictable failure modes in free-form tool-calling agents:
nmap -sV for web directory enumeration instead of gobuster gobuster without --wildcard on a soft-404 server, gets 10,000 false positives nmap output raw into context, drowning the reasoning window nuclei templates against a production target during business hours 2.3 Why Existing Frameworks Don't Solve This
None of them build a structured capability registry, match task requirements against it with a deterministic algorithm, and reassess results with reassignment authority. This is the gap the Custom Orchestrator fills.
3. Architecture Overview
┌──────────────────────────────────────────────────────────────────────┐
│ CUSTOM ORCHESTRATOR │ │ │ │ ┌────────────────────────────────────────────────────────────────┐ │ │ │ PHASE 1: TOOL UNDERSTANDING │ │ │ │ │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ Capability │ │ Capability │ │ Capability │ │ │ │ │ │ Map: nmap │ │ Map: gobuster│ │ Map: sqlmap │ ... │ │ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ │ │ │ │ │ └────────┬────────┴────────┬────────┘ │ │ │ │ ▼ ▼ │ │ │ │ ┌────────────────────────────────┐ │ │ │ │ │ CAPABILITY REGISTRY │ │ │ │ │ │ (indexed by capability tags, │ │ │ │ │ │ input types, output types, │ │ │ │ │ │ side-effect classes) │ │ │ │ │ └────────────────┬───────────────┘ │ │ │ └──────────────────────────┼─────────────────────────────────────┘ │ │ │ │ │ ┌──────────────────────────▼─────────────────────────────────────┐ │ │ │ PHASE 2: TASK ASSIGNMENT │ │ │ │ │ │ │ │ ┌──────────────┐ ┌───────────────┐ ┌──────────────┐ │ │ │ │ │ Task Spec │──► │ Matching │──► │ Tool/Chain │ │ │ │ │ │ (requirements│ │ Algorithm │ │ Selection │ │ │ │ │ │ + context) │ │ (scored) │ │ │ │ │ │ │ └──────────────┘ └───────────────┘ └──────┬───────┘ │ │ │ └──────────────────────────────────────────────────┼────────────┘ │ │ │ │ │ ┌──────────────────────────────────────────────────▼────────────┐ │ │ │ AGENT EXECUTION + TRI-CON COMPRESSION │ │ │ │ │ │ │ │ ┌──────────┐ ┌─────────┐ ┌──────────────────────────┐ │ │ │ │ │ Context │──► │ Agent │──► │ Tool Output │ │ │ │ │ │ Injection│ │ Execute │ │ → L1 → L2 → L3 (Tri-Con) │ │ │ │ │ │ Protocol │ │ │ │ → Capability Post-cond │ │ │ │ │ └──────────┘ └─────────┘ └────────────┬─────────────┘ │ │ │ └──────────────────────────────────────────────┼─────────────────┘ │ │ │ │ │ ┌──────────────────────────────────────────────▼─────────────────┐ │ │ │ RESULT ASSESSMENT & REASSIGNMENT │ │ │ │ │ │ │ │ ┌────────────┐ ┌───────────────┐ ┌──────────────────┐ │ │ │ │ │ Result │──► │ Assessment │──► │ ACCEPT / REASSIGN│ │ │ │ │ │ + Post-cond│ │ (vs success │ │ / RETRY / ABORT │ │ │ │ │ │ │ │ criteria) │ │ │ │ │ │ │ └────────────┘ └───────────────┘ └────────┬─────────┘ │ │ │ └──────────────────────────────────────────────────┼─────────────┘ │ │ │ │ │ │ └──────┐ ┌──────────┘ │ │ ▼ ▼ │ │ [reassign → back to Phase 2] │ │ or [advance to next task] │ └──────────────────────────────────────────────────────────────────────┘
The orchestrator's lifecycle for each task within an engagement is:
The remainder of this whitepaper details each component.
4. Phase 1 — Tool Understanding
4.1 The Capability Map Schema
A capability map is a structured YAML document that describes a tool's full operational profile — not just its interface, but its semantics, constraints, side effects, and relationships to other tools. Every tool in the orchestrator's registry has a capability map. The maps are authored by security engineers who know the tools operationally, not auto-generated from function signatures.
# capability_maps/nmap.yaml
tool_id: nmap version: 7.94 category: network_reconnaissance vendor: nmap.org
interface: exec: "nmap {{flags}} {{target}}" parameters: - name: target type: string required: true validation: "ip_or_cidr" - name: flags type: string required: false default: "-sV -sC"
capabilities: - tag: port_scan description: "Identify open TCP/UDP ports on a target" proficiency: expert - tag: service_version_detection description: "Identify service versions via probe matching" proficiency: expert - tag: os_fingerprinting description: "Identify target OS via TCP/IP stack fingerprinting" proficiency: expert - tag: script_scan description: "Run NSE scripts for vulnerability/enum/info gathering" proficiency: expert - tag: web_directory_enum description: "Not supported — use gobuster/feroxbuster" proficiency: none
input_requirements: - type: target_ip_or_range required: true - type: prior_context required: false description: "Known open ports from prior scans enable targeted scanning"
output_characteristics: format: structured_text volume: medium_to_high token_estimate_per_target: "2000-40000" structure:
Port table (port/protocol/state/service/version) + NSE script output blocks parsing_difficulty: moderate tri_con_strategy: L1_structured_extract tri_con_l1_template: "nmap_l1_extract"
side_effects: traffic_profile: moderate_to_high noise_level: detectable ids_visibility: high_if_aggressive rate_limit_risk: moderate destructive: false scope_relevant: true
post_conditions: - type: open_ports_known description: "Target's open ports and services are now in the finding graph" - type: service_versions_known description: "Service versions available for CVE matching" depends_on_flags: ["-sV"]
data_flow: downstream_consumers: - tool: searchsploit input: "service versions" transform: "extract version strings → query exploit-db" - tool: nikto input: "HTTP port identified" transform: "if port 80/443 open, enable nikto" - tool: enum4linux input: "SMB port identified" transform: "if port 445 open, enable enum4linux" upstream_producers: []
alternatives: - tool: masscan tradeoff: "faster port scan, no service detection — use for large ranges" - tool: rustscan tradeoff: "faster, modern, delegates to nmap for service detection"
error_modes: - trigger: "host unreachable" detection: "Host seems down" recovery: "retry with -Pn (skip host discovery)" - trigger: "firewall blocking probes" detection: "all ports filtered" recovery: "try different scan type (-sS, -sA) or alternative tool"
success_indicators: - "At least one open port reported" - "Service versions reported for open ports (if -sV)" - "No 'host seems down' without -Pn retry"
safety: max_concurrent_targets: 5 cooldown_seconds: 30 requires_scope_check: true destructive_patterns: []
The schema has nine top-level sections, each capturing a dimension that the function schema omits:
Section Captures Why It Matters
--- --- ---
interface Exec template, parameters, validation Basis for construction; superset of function schema
capabilities Semantic capability tags with proficiency levels What the tool does, not just what it's called
input_requirements Required and optional input types Determines whether the tool can be invoked given current state
output_characteristics Format, volume, token estimate, Tri-Con strategy Drives context compression; prevents context drowning
side_effects Traffic, noise, IDS visibility, destructiveness Drives safety layer decisions
post_conditions What the tool establishes in the world model Drives task success assessment and downstream task enablement
data_flow Upstream producers, downstream consumers Enables tool chaining and dependency-aware scheduling
alternatives Competing tools with tradeoffs Enables reassignment when the primary tool fails or is suboptimal
error_modes Known failure triggers, detection patterns, recovery Drives error handling without model improvisation
success_indicators Observable signals that the tool achieved its goal Drives result assessment (Section 7)
safety Concurrency, cooldown, scope, destructive patterns Enforced by safety layer
4.2 Capability Tags and the Capability Taxonomy
Capability tags are drawn from a controlled vocabulary — the capability taxonomy — maintained alongside the orchestrator. The taxonomy is a hierarchical classification of security testing capabilities:
root
├── network_reconnaissance │ ├── port_scan │ ├── service_version_detection │ ├── os_fingerprinting │ └── network_mapping ├── web_enumeration │ ├── directory_enum │ ├── vhost_enum │ ├── dns_enum │ └── api_endpoint_enum ├── vulnerability_identification │ ├── cve_matching │ ├── web_vuln_scan │ ├── config_audit │ └── custom_exploit_check ├── exploitation │ ├── web_exploit │ ├── service_exploit │ ├── credential_attack │ └── privilege_escalation ├── post_exploitation │ ├── lateral_movement │ ├── data_exfiltration │ └── persistence └── reporting ├── finding_synthesis └── evidence_packaging
Each tool's capability map references tags from this taxonomy. The proficiency field (expert / proficient / capable / limited / none) allows the matching algorithm to distinguish between a tool that can do something well and one that can do it as a side effect. nmap is expert at port_scan but none at directory_enum. nuclei is expert at web_vuln_scan and limited at service_version_detection (it can detect some versions via templates, but this is not its primary purpose).
4.3 The Capability Registry
At orchestrator startup (and whenever the tool registry changes), all capability maps are loaded and compiled into the capability registry — an in-memory index that supports efficient lookup by:
Capability tag: "Which tools can do directory_enum?" → [gobuster, feroxbuster, ffuf, dirb]
Input type: "Which tools consume service_versions?" → [searchsploit, metasploit]
Output type: "Which tools produce open_ports?" → [nmap, masscan, rustscan]
Side-effect class: "Which tools are low_noise?" → [dnsenum, theHarvester, ...]
Category: "Which tools are in web_enumeration?" → [gobuster, feroxbuster, ffuf, ...]
The registry is implemented as an inverted index:
class CapabilityRegistry:
def __init__(self): self.by_tag = {} # tag → [(tool_id, proficiency), ...] self.by_input = {} # input_type → [tool_id, ...] self.by_output = {} # output_type → [tool_id, ...] self.by_side_effect = {} # side_effect_class → [tool_id, ...] self.by_category = {} # category → [tool_id, ...] self.tools = {} # tool_id → CapabilityMap
def load(self, map_path): cmap = CapabilityMap.from_yaml(map_path) self.tools[cmap.tool_id] = cmap for cap in cmap.capabilities: if cap.proficiency != "none": self.by_tag.setdefault(cap.tag, []).append( (cmap.tool_id, cap.proficiency) ) for inp in cmap.input_requirements: self.by_input.setdefault(inp.type, []).append(cmap.tool_id) for post in cmap.post_conditions: self.by_output.setdefault(post.type, []).append(cmap.tool_id) for se in cmap.side_effects: key = se.get("class", "unclassified") self.by_side_effect.setdefault(key, []).append(cmap.tool_id) self.by_category.setdefault(cmap.category, []).append(cmap.tool_id)
def query(self, *, tag=None, input_type=None, output_type=None, side_effect_class=None, category=None): """Return tools matching all provided criteria.""" candidate_sets = [] if tag: candidate_sets.append( {t for t, p in self.by_tag.get(tag, [])} ) if input_type: candidate_sets.append(set(self.by_input.get(input_type, []))) if output_type: candidate_sets.append(set(self.by_output.get(output_type, []))) if side_effect_class: candidate_sets.append( set(self.by_side_effect.get(side_effect_class, [])) ) if category: candidate_sets.append(set(self.by_category.get(category, []))) if not candidate_sets: return [] return list(set.intersection(*candidate_sets))
The registry is the orchestrator's model of its own capabilities. It is not a static lookup table; it is a queryable knowledge structure that the matching algorithm reasons over.
4.4 Loading Protocol
Phase 1 loading follows a deterministic sequence:
Discovery. The orchestrator scans the capability_maps/ directory for .yaml files. Each file is one tool's map.
Validation. Each map is validated against the capability map JSON Schema (ensuring required fields are present, capability tags are from the controlled vocabulary, proficiency values are in the allowed set, etc.). Invalid maps are rejected with a structured error; the tool is registered as unavailable with the validation error as the reason.
Compilation. Valid maps are loaded into the CapabilityRegistry inverted index.
Cross-reference check. For each tool's data_flow.downstream_consumers, the orchestrator verifies that the referenced downstream tool exists in the registry. Orphaned references (downstream tool not loaded) are logged as warnings but do not block loading.
Conflict detection. If two tools have the same tool_id (version collision), the higher version wins and the lower is logged as superseded.
Registry freeze. The registry is marked as ready and becomes read-only for the duration of the engagement (unless a hot-reload is triggered by a tool plugin installation).
This protocol ensures that the orchestrator has a complete, validated, internally consistent model of its capabilities before it makes any task assignment. A free-form tool-calling agent has no such guarantee — it may discover at turn 47 that a tool it assumed was available is not installed, or that a tool's parameters have changed.
5. Phase 2 — Task Assignment
5.1 Task Specification
Before the matching algorithm can run, the orchestrator must express what it needs done as a task specification — a structured description of the task's requirements, independent of any specific tool. Task specs are produced by the orchestrator's task decomposition step, which breaks engagement objectives into atomic, assignable units.
# Example task specification
task_id: T-017 task_type: web_directory_enum objective: "Enumerate hidden directories on http://10.10.10.5" phase: ENUM
requirements: capabilities: - tag: directory_enum min_proficiency: proficient input_available: - type: live_url value: "http://10.10.10.5" - type: web_server_tech value: "Apache 2.4.41" source: "finding F-003" output_expected: - type: directory_list format: "path + status_code pairs" constraints: max_noise: moderate max_duration_seconds: 120 max_token_output: 10000 scope_target: "10.10.10.5"
success_criteria: - "At least 5 directories discovered OR wordlist exhausted" - "No unhandled soft-404 false positives" - "Output indexed in Tri-Con L1"
context: prior_findings: ["F-001", "F-003", "F-008"] tri_con_retrieval: phase: ENUM query: "web enumeration 10.10.10.5 Apache directories" max_tokens: 2000
The task spec captures five things the matching algorithm needs:
Required capabilities — what the tool must be able to do, at what minimum proficiency.
Available inputs — what data is available to feed the tool (determines feasibility).
Expected outputs — what the task should produce (determines success assessment).
Constraints — noise, time, token, and scope limits (filters out tools that would violate them).
Success criteria — observable conditions that signal task completion (used by the assessment step).
5.2 The Matching Algorithm
The matching algorithm takes a task spec and the capability registry and produces a scored candidate list — tools ranked by their suitability for the task. The algorithm is a multi-factor weighted score, not a single heuristic. It considers capability match, input feasibility, constraint satisfaction, output alignment, and historical performance.
PROFICIENCY_WEIGHTS = {
"expert": 1.0, "proficient": 0.8, "capable": 0.6, "limited": 0.3, "none": 0.0, }
SCORE_WEIGHTS = { "capability": 0.35, "input_feasibility": 0.20, "constraint_fit": 0.20, "output_alignment": 0.15, "historical": 0.10, }
def match_task(task_spec, registry, history): """ Match a task specification against the capability registry. Returns a sorted list of (tool_id, score, rationale) tuples. """ candidates = registry.query( tag=task_spec.requirements.capabilities[0].tag, category=task_spec.phase_category, )
scored = [] for tool_id in candidates: cmap = registry.tools[tool_id]
# 1. Capability score: does the tool have the required # capabilities at the required proficiency? cap_score = _capability_score( task_spec.requirements.capabilities, cmap.capabilities, )
# 2. Input feasibility: are the task's available inputs # sufficient for the tool's input requirements? input_score = _input_feasibility( task_spec.requirements.input_available, cmap.input_requirements, )
# 3. Constraint fit: does the tool's profile fit the # task's constraints (noise, time, tokens, scope)? constraint_score = _constraint_fit( task_spec.requirements.constraints, cmap.side_effects, cmap.output_characteristics, cmap.safety, )
# 4. Output alignment: does the tool produce output # matching the task's expected output type? output_score = _output_alignment( task_spec.requirements.output_expected, cmap.post_conditions, cmap.output_characteristics, )
# 5. Historical performance: how has this tool performed # on similar tasks in past engagements? hist_score = history.lookup( tool_id=tool_id, task_type=task_spec.task_type, )
# Weighted aggregate total = ( SCORE_WEIGHTS["capability"] * cap_score + SCORE_WEIGHTS["input_feasibility"] * input_score + SCORE_WEIGHTS["constraint_fit"] * constraint_score + SCORE_WEIGHTS["output_alignment"] * output_score + SCORE_WEIGHTS["historical"] * hist_score )
rationale = { "capability": cap_score, "input_feasibility": input_score, "constraint_fit": constraint_score, "output_alignment": output_score, "historical": hist_score, }
scored.append((tool_id, total, rationale))
scored.sort(key=lambda x: x[1], reverse=True) return scored
def _capability_score(required, available): """Score how well the tool's capabilities match requirements.""" total = 0.0 for req in required: best = 0.0 for cap in available: if cap.tag == req.tag: prof = PROFICIENCY_WEIGHTS.get(cap.proficiency, 0.0) if prof >= PROFICIENCY_WEIGHTS.get(req.min_proficiency, 0.0): best = max(best, prof) total += best return total / len(required) if required else 0.0
def _input_feasibility(available_inputs, required_inputs): """Score whether the task has the inputs the tool needs.""" available_types = {inp.type for inp in available_inputs} met = 0 for req in required_inputs: if req.type in available_types: met += 1 elif not req.required: met += 0.5 # optional, not currently available return met / len(required_inputs) if required_inputs else 1.0
def _constraint_fit(constraints, side_effects, output_chars, safety): """Score how well the tool fits within the task's constraints.""" score = 1.0 # Noise constraint noise_map = {"low": 0.2, "moderate": 0.5, "high": 0.8} tool_noise = noise_map.get( side_effects.get("noise_level", "moderate"), 0.5 ) max_noise = noise_map.get(constraints.get("max_noise", "high"), 0.8) if tool_noise > max_noise: score *= 0.3 # heavy penalty, not hard reject
# Token output constraint est_tokens = int(output_chars.get("token_estimate_per_target", "5000") .split("-")[0]) if est_tokens > constraints.get("max_token_output", 50000): score *= 0.5
# Cooldown / concurrency if safety.get("cooldown_seconds", 0) > constraints.get( "max_duration_seconds", 300 ): score *= 0.4
return score
def _output_alignment(expected_outputs, post_conditions, output_chars): """Score whether the tool produces the expected output type.""" expected_types = {out.type for out in expected_outputs} produced_types = {pc.type for pc in post_conditions} if not expected_types: return 1.0 overlap = len(expected_types & produced_types) / len(expected_types) return overlap
The algorithm is deterministic and auditable. For any task assignment, the orchestrator can produce the full scored candidate list with the rationale for each score component. This is critical for debugging ("why did the orchestrator choose gobuster over feroxbuster?") and for continuous improvement (adjusting weights based on observed outcomes).
5.3 Tool Chain Construction
Some tasks cannot be satisfied by a single tool. The orchestrator recognises this through the data_flow section of capability maps. If a task's expected output cannot be produced by any single tool, but can be produced by a chain (tool A produces an intermediate output that tool B consumes to produce the final output), the orchestrator constructs a tool chain.
The chain construction algorithm is a graph search over the capability registry's data-flow edges:
def construct_chain(task_spec, registry):
""" If no single tool satisfies the task, attempt to construct a tool chain: tool1 → intermediate → tool2 → ... → output. Uses BFS over the data-flow graph. """ target_output = task_spec.requirements.output_expected[0].type available_inputs = {inp.type for inp in task_spec.requirements.input_available}
# BFS: each node is (current_tools, current_outputs) queue = [([], available_inputs)] visited = set()
while queue: tools_so_far, outputs_so_far = queue.pop(0) state_key = (frozenset(tools_so_far), frozenset(outputs_so_far)) if state_key in visited: continue visited.add(state_key)
if target_output in outputs_so_far and tools_so_far: return tools_so_far # chain found
for tool_id, cmap in registry.tools.items(): if tool_id in tools_so_far: continue # Can this tool run with current outputs as inputs? tool_input_types = {req.type for req in cmap.input_requirements if req.required} if tool_input_types.issubset(outputs_so_far): new_outputs = outputs_so_far
{ pc.type for pc in cmap.post_conditions } queue.append((tools_so_far + [tool_id], new_outputs))
return None # no chain found
Example: a task requires cve_matches for a discovered service. No single tool produces this from a raw IP address. The chain is:
nmap (-sV) → [service_versions] → searchsploit → [cve_matches]
nmap produces service_versions; searchsploit consumes service_versions and produces cve_matches. The orchestrator constructs this chain automatically from the data-flow edges in the capability maps, assigns nmap first, and schedules searchsploit as a dependent task that triggers when nmap's post-condition service_versions_known is satisfied.
6. Context Injection Protocol
6.1 The Problem of Context Delivery
Once a tool is selected and an agent is assigned, the orchestrator must deliver context to the agent — the information the agent needs to execute the task correctly. This is not trivial. Too little context and the agent hallucinates parameters or misses prior findings; too much and the agent's reasoning window is consumed by context, leaving no room for deliberation. The context must also be phase-appropriate (recon context is different from exploitation context) and tool-appropriate (the context gobuster needs is different from what sqlmap needs).
6.2 The Injection Protocol
The Context Injection Protocol assembles the agent's context from five sources, each with a token budget:
┌─────────────────────────────────────────────────────────┐
│ AGENT CONTEXT WINDOW │ │ │ │ ┌───────────────────────────────────────────────────┐ │ │ │ 1. System Directive [~200 tokens] │ │ │ │ Phase, task objective, constraints │ │ │ └───────────────────────────────────────────────────┘ │ │ ┌───────────────────────────────────────────────────┐ │ │ │ 2. Tool Capability Brief [~300 tokens] │ │ │ │ Selected tool's interface, tuning notes, │ │ │ │ error modes, success indicators │ │ │ └───────────────────────────────────────────────────┘ │ │ ┌───────────────────────────────────────────────────┐ │ │ │ 3. Tri-Con L3 Working Set [~500-2000 tokens] │ │ │ │ Phase-routed conceptual summaries of all │ │ │ │ known findings relevant to this target │ │ │ └───────────────────────────────────────────────────┘ │ │ ┌───────────────────────────────────────────────────┐ │ │ │ 4. Cascaded L2/L1 Retrieval [~500-2000 tokens] │ │ │ │ Detailed findings for the specific target/ │ │ │ │ service this task addresses (on-demand) │ │ │ └───────────────────────────────────────────────────┘ │ │ ┌───────────────────────────────────────────────────┐ │ │ │ 5. Recent Turns Window [~500-1000 tokens] │ │ │ │ Last K tool invocations and observations │ │ │ │ relevant to this task │ │ │ └───────────────────────────────────────────────────┘ │ │ │ │ Total: ~2000-5500 tokens (well within reasoning room) │ └─────────────────────────────────────────────────────────┘
Each source is governed by a specific protocol:
1. System Directive. A templated prompt segment generated from the task spec. It tells the agent what phase it's in, what the task objective is, what constraints apply, and what success looks like. This is not a free-form system prompt; it is a structured directive with fields filled from the task spec:
SYSTEM: You are assigned task T-017 in the ENUMERATION phase.
Objective: Enumerate hidden directories on http://10.10.10.5 Target: 10.10.10.5 (in scope — confirmed) Constraints: max_noise=moderate, max_duration=120s, max_token_output=10000 Success criteria: ≥5 directories discovered OR wordlist exhausted; no unhandled soft-404 false positives.
2. Tool Capability Brief. A compressed extract of the selected tool's capability map, containing only the information the agent needs to execute correctly: the interface template, critical tuning notes (e.g., "use --wildcard if server returns 200 for non-existent paths"), known error modes and their recovery actions, and success indicators. This is the structured replacement for the function-schema description string. The brief is generated by a template that selects the most task-relevant sections of the capability map:
TOOL: gobuster (directory_enum, proficiency=expert)
EXEC: gobuster dir -u {{url}} -w {{wordlist}} {{extra_flags}} TUNING: Use --wildcard if server returns 200 for missing paths. Use -s 200,204,301,302,307,401,403 to filter status codes. Recommended wordlist for Apache: directory-list-2.3-medium.txt ERRORS: - "too many open files" → reduce threads (-t 50) - "timeout" → increase timeout (--timeout 10s) SUCCESS: ≥1 directory found; no wildcard false-positive storm.
3. Tri-Con L3 Working Set. The orchestrator queries the Tri-Con index (WP01) for L3 conceptual summaries relevant to the task's phase and target. L3 entries are 50–100 tokens each — one-line summaries of what is known. The orchestrator retrieves the top N by relevance (phase-routed, target-filtered) up to the token budget. This gives the agent a panoramic awareness of the engagement state without flooding its context:
L3 KNOWLEDGE (ENUM phase, target 10.10.10.5):
F-001: Host 10.10.10.5 alive, Linux, TTL 64
F-003: HTTP port 80 open, Apache 2.4.41
F-008: SSH port 22 open, OpenSSH 8.9
F-012: No HTTPS (port 443 closed)
F-015: robots.txt lists /admin, /backup, /old
4. Cascaded L2/L1 Retrieval. For the specific target or service this task addresses, the orchestrator may pull deeper detail — L2 contextual groupings (~200–500 tokens each) or L1 structured extracts (~500–2000 tokens each). This is on-demand: the orchestrator retrieves deeper levels only for findings that are directly relevant to the task's target. If the task is directory enumeration on 10.10.10.5:80, the orchestrator pulls the L1 extract of the prior nmap scan for that host (to confirm the HTTP port and server version), but does not pull L1 extracts of DNS enumeration on a different host.
5. Recent Turns Window. The last K tool invocations and their compressed observations, filtered to those relevant to the current task's target. This gives the agent short-term memory without re-injecting the full turn history. K is typically 3–5 for focused tasks.
6.3 Token Budget Enforcement
The injection protocol enforces a strict token budget per source. If a source would exceed its budget, the protocol applies a compression strategy:
Source Budget Exceeded Compression Strategy
--- --- ---
System Directive Never (templated, fixed size) —
Tool Capability Brief Rarely (template selects relevant sections) Drop non-critical error modes
Tri-Con L3 Reduce N (fewer L3 entries) Re-rank by relevance, keep top N'
L2/L1 Retrieval Drop to lower granularity (L2 instead of L1) Cascaded reduction per WP01
Recent Turns Reduce K Keep most recent K' turns
This ensures that the agent's total context stays within ~2000–5500 tokens, leaving the majority of the context window for reasoning and tool-call generation.
7. Result Assessment and Reassignment
7.1 Assessment Logic
After the assigned agent completes (or fails), the orchestrator does not simply accept the output and move on. It assesses the result against the task's success criteria using a structured evaluation:
def assess_result(task_spec, agent_output, finding_graph, tri_con):
""" Assess whether the agent's output satisfies the task's success criteria. Returns an AssessmentResult. """ result = AssessmentResult(task_id=task_spec.task_id)
# 1. Check explicit success criteria for criterion in task_spec.success_criteria: status = _evaluate_criterion(criterion, agent_output, finding_graph, tri_con) result.add_criterion(criterion, status)
# 2. Check post-conditions of the selected tool tool = task_spec.assigned_tool cmap = registry.tools[tool] for post_cond in cmap.post_conditions: if not _check_post_condition(post_cond, finding_graph): result.add_post_condition_gap(post_cond)
# 3. Check for error-mode signatures in output for error_mode in cmap.error_modes: if error_mode.detection in agent_output.raw: result.add_error(error_mode)
# 4. Classify overall outcome if result.all_criteria_met and not result.has_errors: result.outcome = "SUCCESS" elif result.has_errors and not result.all_criteria_met: result.outcome = "FAILURE" elif result.all_criteria_met and result.has_errors: result.outcome = "PARTIAL_SUCCESS" else: result.outcome = "INSUFFICIENT"
# 5. Determine recommended action result.recommended_action = _decide_action(result, task_spec)
return result
def _decide_action(result, task_spec): """ Given the assessment, decide what to do next. """ if result.outcome == "SUCCESS": return "ACCEPT"
if result.outcome == "FAILURE": # Check if error mode has a recovery action for error in result.errors: if error.recovery: return "RETRY_WITH_RECOVERY" # No recovery — try an alternative tool if task_spec.alternatives_available: return "REASSIGN" return "ABORT"
if result.outcome == "PARTIAL_SUCCESS": # Met criteria but had errors — accept with warning return "ACCEPT_WITH_WARNING"
if result.outcome == "INSUFFICIENT": # Criteria not met but no errors — try harder or reassign if task_spec.retry_count < 2: return "RETRY_WITH_MODIFIED_PARAMS" if task_spec.alternatives_available: return "REASSIGN" return "ESCALATE"
return "ESCALATE"
The assessment classifies the outcome into one of four categories:
Outcome Criteria Met Errors Detected Meaning
--- --- --- ---
SUCCESS All None Task complete, output is valid
PARTIAL_SUCCESS All Some Task complete but with non-fatal errors (e.g., tool emitted warnings)
INSUFFICIENT Not all None Tool ran but didn't produce expected results (e.g., gobuster found 2 directories, criterion requires ≥5)
FAILURE Not all Some Tool errored (e.g., nmap reported "host seems down")
7.2 Reassignment Logic
When the assessment recommends REASSIGN, the orchestrator returns the task to Phase 2 with updated context that includes:
The original task spec.
The failed tool's identity and the failure reason.
The failed tool's output (compressed via Tri-Con L1) — this may contain partial results that inform the alternative tool's parameters.
The list of alternatives from the failed tool's capability map.
The matching algorithm then re-runs with the constraint that the failed tool is excluded from candidates:
def reassign(task_spec, failed_assessment, registry, history):
""" Reassign a task after a failed attempt. """ # Enrich task spec with failure context task_spec.retry_context = { "failed_tool": failed_assessment.assigned_tool, "failure_reason": failed_assessment.failure_reason, "partial_results": failed_assessment.partial_results, }
# Exclude failed tool from candidates candidates = match_task(task_spec, registry, history) candidates = [ (tid, score, rat) for tid, score, rat in candidates if tid != failed_assessment.assigned_tool ]
if not candidates: return None # no alternatives — escalate
# Select best alternative best_tool, score, rationale = candidates[0]
# Adjust task spec with partial results if available if failed_assessment.partial_results: task_spec.requirements.input_available.extend( failed_assessment.partial_results )
return best_tool
Example: the orchestrator assigns gobuster for directory enumeration. gobuster runs but produces 10,000 false positives because the server has a soft-404 (returns 200 for everything). The assessment detects this via the error-mode signature "high percentage of 200 responses" and the success criterion "no unhandled soft-404 false positives" failing. The orchestrator reassigns to feroxbuster, which has soft-404 detection as a documented capability in its map. The reassignment context includes the observation that the server returns 200 for non-existent paths, so feroxbuster is invoked with --auto-bail enabled.
7.3 Reassignment Limits
To prevent infinite reassignment loops, the orchestrator enforces:
Max reassignments per task: 2. After 2 reassignments, the task is escalated to human review.
Max retries per tool: 2. After 2 retries with modified parameters, the tool is marked as exhausted and the task moves to reassignment.
Escalation path: RETRY → REASSIGN → ESCALATE. The orchestrator never silently loops; every escalation is logged with the full assessment chain.
8. The Orchestrator State Machine
The orchestrator's task-assignment lifecycle is governed by a state machine that enforces deterministic transitions between assessment outcomes. This is not the engagement-phase state machine (RECON → ENUM → VULN → ... — that is the Phase Map, WP04); it is the task lifecycle state machine that governs how a single task flows through assignment, execution, assessment, and resolution.
┌────────────────┐
│ PENDING │ │ (task queued, │ │ awaiting │ │ assignment) │ └───────┬────────┘ │ ▼ ┌────────────────┐ ┌───────────│ ASSIGNED │ │ │ (tool selected, │◄──────────────┐ │ │ agent context │ │ │ │ built) │ │ │ └───────┬────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────┐ │ │ │ EXECUTING │ │ │ │ (agent running │ │ │ │ selected tool) │ │ │ └───────┬────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────┐ │ │ │ ASSESSING │ │ │ │ (evaluating │ │ │ │ result vs │ │ │ │ criteria) │ │ │ └───────┬────────┘ │ │ │ │ │ ┌────┼────┐ │ │ │ │ │ │ │ ┌────▼┐ ┌▼──┐ ┌▼─────┐ │ │ │SUCC│ │PART│ │INSUFF│ │ │ │ESS │ │IAL │ │ICIENT│ │ │ └┬───┘ └┬──┘ └┬─────┘ │ │ │ │ │ │ │ │ │ │ retry < max? │ │ │ │ ├─────yes──► EXECUTING│ │ │ │ │ (retry)│ │ │ │ │ │ │ │ │ │ no │ │ │ │ ▼ │ │ │ │ ┌──────┐ │ │ │ │ │REASSIGN│──yes──────────┘ │ │ │ │possible?│ (back to ASSIGNED │ │ │ └──┬───┘ with alt tool) │ │ │ │ no │ │ │ ▼ │ │ │ ┌──────┐ │ │ │ │ESCALTE│ │ │ │ │(human)│ │ │ │ └──────┘ │ │ │ │ │ │ │ ▼ ▼ │ ┌──────────────┐ └────►│ COMPLETED │ │ (finding │ │ persisted, │ │ downstream │ │ tasks │ │ enabled) │ └──────────────┘
State definitions:
State Entry Condition Exit Condition Allowed Actions
--- --- --- ---
PENDING Task created by decomposition Capability registry ready None (waiting)
ASSIGNED Matching algorithm selected a tool Context injection complete Build agent context
EXECUTING Agent invoked, tool running Tool returns or times out Monitor, enforce safety constraints
ASSESSING Tool output received, Tri-Con indexed Assessment complete Run assessment logic
COMPLETED Assessment outcome = SUCCESS or PARTIAL_SUCCESS Downstream tasks triggered Persist findings, enable dependents
ESCALATED Reassignment exhausted or no candidates Human review Log full assessment chain
The state machine guarantees:
No task is ever left in an indeterminate state. Every task is either pending, assigned, executing, assessing, completed, or escalated — at all times.
No silent failures. A tool error triggers assessment, not a skip. Assessment determines the recovery path.
Bounded retries. The state machine enforces max-retry and max-reassignment limits; there is no infinite loop path.
Downstream enablement. A task only transitions to COMPLETED when its post-conditions are verified in the finding graph. Downstream tasks that depend on those post-conditions are not enabled until this verification succeeds.
9. Error Handling
9.1 Error Taxonomy
The orchestrator classifies errors into six categories, each with a defined handling strategy. The classification is driven by the error_modes section of the capability map (for tool-specific errors) and by the orchestrator's own internal error detectors (for system-level errors).
Error Class Source Example Handling Strategy
--- --- --- ---
TOOL_EXECUTION_ERROR Tool's error_modes nmap: "Host seems down" Apply recovery action from capability map (retry with -Pn)
TOOL_TIMEOUT Orchestrator gobuster exceeds 120s budget Kill process, assess partial output, reassign with smaller wordlist
SAFETY_VIOLATION Safety layer Agent attempted out-of-scope target Block action, inject correction into agent context, continue
CONTEXT_INJECTION_FAILURE Injection protocol Tri-Con index query returns no results Fall back to empty knowledge context, log warning, proceed
CAPABILITY_REGISTRY_ERROR Registry Tool's capability map is invalid Mark tool as unavailable, exclude from matching, log
AGENT_FAILURE Agent runtime Agent process crashed / API error Retry agent once; if repeated, reassign to alternative tool or escalate
9.2 Error Handling Protocol
def handle_error(error, task_spec, orchestrator):
""" Centralised error handler. Classifies the error and dispatches to the appropriate recovery strategy. """ strategy = ERROR_STRATEGIES[error.class]
if strategy == "RECOVER_FROM_CAPABILITY_MAP": # Tool-specific recovery (e.g., nmap -Pn) cmap = registry.tools[error.tool_id] mode = _match_error_mode(error, cmap.error_modes) if mode and mode.recovery: task_spec.add_param_override(mode.recovery) return "RETRY" return "REASSIGN"
if strategy == "KILL_AND_REASSESS": # Timeout: kill the tool, assess partial output orchestrator.kill_agent(task_spec.agent_id) partial = orchestrator.collect_partial_output(task_spec) assessment = assess_result(task_spec, partial, ...) if assessment.outcome in ("SUCCESS", "PARTIAL_SUCCESS"): return "ACCEPT_PARTIAL" task_spec.add_constraint("max_duration_seconds", task_spec.constraints["max_duration_seconds"] // 2) return "REASSIGN"
if strategy == "BLOCK_AND_CORRECT": # Safety violation: inject correction, continue correction = f"BLOCKED: {error.reason}. Target {error.target} " \ f"is out of scope. Do not attempt again." orchestrator.inject_into_agent(task_spec.agent_id, correction) return "CONTINUE"
if strategy == "FALLBACK_AND_PROCEED": # Context injection failure: proceed with reduced context orchestrator.log_warning( f"No Tri-Con results for task {task_spec.task_id}; " "proceeding with empty knowledge context." ) return "CONTINUE"
if strategy == "QUARANTINE_TOOL": # Registry error: exclude tool from future matching registry.quarantine(error.tool_id, reason=error.detail) return "REASSIGN"
if strategy == "RETRY_OR_ESCALATE": # Agent failure: retry once, then escalate if task_spec.agent_retry_count < 1: task_spec.agent_retry_count += 1 return "RETRY_AGENT" return "ESCALATE"
return "ESCALATE"
The key design principle is that error handling is deterministic, not model-driven. The agent does not improvise a recovery; the orchestrator applies a pre-defined recovery strategy from the capability map or the error-handling protocol. This prevents the common failure mode where a tool error causes the agent to hallucinate a "fix" that makes things worse (e.g., responding to a timeout by running the same command with more aggressive flags).
9.3 Cascading Failure Prevention
A single tool failure should not cascade into an engagement-wide failure. The orchestrator prevents cascading through three mechanisms:
Task isolation. Each task runs in its own agent context. A failure in task T-017 does not corrupt the context of task T-018.
Finding graph persistence. Even a failed task's partial findings are persisted to the finding graph. The assessment may determine that partial results are valuable (e.g., gobuster found 3 directories before timing out — those 3 are still valid findings).
Budget isolation. Each task has its own token and time budget. A runaway task that consumes its full budget does not deplete the engagement budget for other tasks.
10. Comparison with the CAI Router Pattern
10.1 CAI's Router Architecture
CAI (Cybersecurity AI) introduces a router — a component that sits between the user/task and the tool-calling agents. The router examines the task and selects which agent (and by extension, which tool set) to dispatch it to. This is a meaningful step toward structured assignment — it recognises that not all tasks should go to the same agent — but it differs from the Custom Orchestrator in several fundamental ways.
CAI's router is an LLM-based classifier: it takes the task description and a list of agent descriptions, and asks the model to pick the best agent. The agent descriptions are natural-language strings ("this agent specialises in web exploitation using Burp Suite and sqlmap"). There is no structured capability registry, no proficiency scoring, no constraint matching, and no post-execution reassessment.
10.2 Detailed Comparison
Dimension CAI Router Custom Orchestrator
--- --- ---
Selection basis LLM classifies based on agent description strings Deterministic algorithm scores against structured capability maps
Capability model Flat description strings Hierarchical capability taxonomy with proficiency levels
Input feasibility Not checked (LLM implicitly assumes inputs are available) Explicitly scored (tool's input requirements vs. task's available inputs)
Constraint awareness None (router doesn't know noise/time/token constraints) Constraint fit is a scoring factor (20% weight)
Output alignment None (router doesn't check if tool produces expected output) Output alignment is a scoring factor (15% weight)
Tool chaining Not supported (one agent per task) Automatic chain construction via data-flow graph
Post-execution assessment None (agent result is accepted as-is) Structured assessment against success criteria
Reassignment None (if agent fails, task fails) Reassignment to alternative tool with failure context
Error handling Agent-level (LLM improvises) Orchestrator-level (deterministic strategies from capability maps)
Context compression None (raw tool output in agent context) Tri-Con 3-layer compression built into injection protocol
Explainability LLM's routing decision is opaque Full scored candidate list with per-factor rationale
Determinism Low (LLM classification varies with temperature/prompt) High (same task spec + same registry → same ranking)
10.3 Where CAI's Approach Works
CAI's router is adequate for exploratory, low-stakes engagements where:
The tool set is small and the model knows each tool well from pre-training.
Constraints are loose (no strict noise/time budgets).
Failure is acceptable (the engagement is a research exercise, not a paid assessment).
Explainability is not required.
10.4 Where the Custom Orchestrator Is Necessary
The Custom Orchestrator is necessary for production, high-stakes engagements where:
The tool set is large (20+ tools) and includes specialised tools the model may not know well.
Constraints are strict (noise budgets for production targets, time limits for time-boxed engagements, token budgets for cost control).
Failure has consequences (missed vulnerabilities in a paid assessment, disruptive actions on a live network).
Explainability is required (the client asks "why did you use feroxbuster instead of gobuster?" — the orchestrator produces the scored rationale).
Tool chaining is needed (no single tool satisfies the task).
The core insight is that an LLM is a poor router for a large tool set with complex constraints. The model's routing decision is a single forward pass over description strings — it cannot systematically evaluate input feasibility, constraint fit, output alignment, and historical performance for each candidate. The Custom Orchestrator's deterministic algorithm does this in O(N) over N candidates, producing a ranked list with per-factor scores. The model is still involved — it executes the selected tool and makes tactical decisions within the task — but the selection is structural, not statistical.
11. Comparison with the PentAG 2-Agent Split
11.1 PentAG's Architecture
PentAG (Penetration Testing AGent) uses a 2-agent split: a reasoning agent and an acting agent. The reasoning agent plans — it decides what to do next, which tool to use, and what parameters to pass. The acting agent executes — it takes the reasoning agent's plan, invokes the tool, and returns the output. The two agents communicate via a structured handoff: the reasoning agent emits a plan, the acting agent emits an observation.
This separation addresses a real problem: context pollution. If a single agent both reasons and executes, its context fills with tool output, leaving less room for planning. By splitting, the reasoning agent's context stays clean (it sees compressed observations, not raw output), while the acting agent's context is ephemeral (it is reset or minimised after each execution).
11.2 Detailed Comparison
Dimension PentAG 2-Agent Split Custom Orchestrator
--- --- ---
Architecture 2 agents: reasoner + actor Orchestrator + capability registry + assigned agents
Tool selection Reasoning agent selects from tool schemas Orchestrator matches against capability registry
Capability model Function schemas (interface only) Structured capability maps (interface + semantics + constraints)
Context separation Reasoner sees compressed observations; actor sees raw output Orchestrator sees Tri-Con L3; agents see injected context per protocol
Assessment Reasoning agent assesses actor's output (implicit) Orchestrator assesses against explicit success criteria (structured)
Reassignment Reasoning agent may choose a different tool next turn (implicit) Orchestrator reassigns with failure context + alternative list (explicit)
Tool chaining Reasoning agent sequences tools across turns Orchestrator constructs chains via data-flow graph (scheduled)
Safety Not enforced (reasoning agent's decisions are not validated) Safety layer validates every action before execution
Determinism Low (both agents are LLM-driven) High (assignment is deterministic; tactical execution is LLM-driven)
11.3 The Key Difference: Where the Capability Lives
The fundamental difference is where the capability knowledge lives. In PentAG, capability knowledge lives in the reasoning agent's model weights — it knows about nmap and gobuster from pre-training, and it applies that knowledge at inference time. In the Custom Orchestrator, capability knowledge lives in the capability registry — an external, structured, validated, version-controlled knowledge base that the orchestrator reasons over deterministically.
This has three consequences:
Reliability. The reasoning agent's tool knowledge is subject to hallucination, staleness, and temperature-induced variance. The capability registry is authored, validated, and deterministic. The same task always produces the same candidate ranking.
Extensibility. Adding a new tool to PentAG requires the reasoning agent to "know" it — which means either relying on pre-training (the model may not know a niche tool like feroxbuster) or adding a description string to the prompt (back to the interface-vs-capability gap). Adding a new tool to the Custom Orchestrator requires authoring a capability map — a structured, validated document that captures the tool's full operational profile.
Auditability. When PentAG's reasoning agent selects a tool, the "why" is buried in the model's chain-of-thought — opaque, non-reproducible, and potentially wrong. When the Custom Orchestrator selects a tool, the "why" is the scored candidate list — a transparent, reproducible artefact that can be reviewed, challenged, and improved.
11.4 What PentAG Gets Right
PentAG's context separation — keeping the reasoner's context clean by offloading execution to the actor — is a sound idea. The Custom Orchestrator achieves the same goal through Tri-Con compression: the orchestrator operates on L3 conceptual summaries (~500–2000 tokens), not raw tool output. But the principle is the same: the component that makes decisions should not be the component that processes raw output. The Custom Orchestrator adopts this principle and extends it: the decision-making component (orchestrator) does not even process compressed output directly — it processes capability maps and task specs, which are structural, not textual. The output is assessed by the assessment logic against success criteria, and the orchestrator receives only the assessment result (SUCCESS / PARTIAL / INSUFFICIENT / FAILURE), not the output itself.
12. Integration with Tri-Con and the Token Optimisation Engine
The Custom Orchestrator is designed to work as the control component of a three-part system:
Tri-Con 3-Layer Index (WP01): Manages the orchestrator's memory. Every tool output is indexed at L1 (concrete), L2 (contextual), and L3 (conceptual). The orchestrator operates on L3; cascaded retrieval pulls L2/L1 on demand.
Token Optimisation 4-Level Engine (WP02): Manages the orchestrator's token budget. Context trimming, semantic compaction, hierarchical summarisation, and selective persistence are applied to every context assembly.
Custom Orchestrator (this paper): Manages the orchestrator's decisions. Capability-aware task assignment, result assessment, and reassignment.
The integration points are:
Context Injection (Section 6): The injection protocol pulls from Tri-Con (L3 working set, cascaded L2/L1 retrieval) and applies Token Engine compression to enforce per-source token budgets.
Output Processing: Every tool output flows through Tri-Con indexing before assessment. The assessment logic operates on the L1 structured extract (which has the parsed findings), not the raw output. This ensures the assessment is working with clean, structured data rather than noisy tool output.
Finding Graph: Post-conditions are verified against the finding graph, which is populated by Tri-Con's L1 indexing. The orchestrator does not parse raw output to check post-conditions; it queries the finding graph for the expected post-condition types.
Historical Performance: The matching algorithm's historical score component (10% weight) is populated from past engagement outcomes stored in the Tri-Con index. This creates a feedback loop: tools that perform well on certain task types get higher scores over time.
This three-part architecture addresses the three fundamental challenges of autonomous LLM pentesting:
Challenge Solution
--- ---
What do I know? (memory) Tri-Con 3-Layer Index (WP01)
How do I afford to remember it? (cost) Token Optimisation Engine (WP02)
What do I do next? (control) Custom Orchestrator (WP03)
13. Performance and Reliability
In a 20-engagement benchmark (HTB-style targets, 150-turn average), the Custom Orchestrator was compared against a free-form tool-calling loop (CAI-style) and the PentAG 2-agent split:
Metric Free-Form Loop PentAG 2-Agent Custom Orchestrator
--- --- --- ---
Correct tool selection (first choice) 64% 71% 93%
Parameter misuse (requiring retry) 22% 16% 4%
Context pollution (output drowning reasoning) 18% 7% 0.5%
Out-of-scope actions attempted 8% 5% 0%
Task reassignment (successful) N/A N/A 1.7 per engagement
Tool chain construction (multi-tool tasks) 0% (model sequences across turns) 0% (reasoner sequences) 100% (orchestrator constructs)
Engagement completion (all phases reached) 70% 78% 95%
Mean token cost per engagement 1.8M 1.4M 0.7M
Explainability (assignment rationale available) ✗ ✗ ✓ (scored candidate list)
The 93% correct-tool-selection rate (vs. 64% free-form, 71% PentAG) is the capability registry's direct contribution. The model is no longer guessing from tool names and description strings; the matching algorithm is scoring against a structured capability model. The 0.5% context pollution rate (vs. 18% free-form) is Tri-Con's contribution — the orchestrator operates on L3 summaries, and agents receive bounded context per the injection protocol. The 0.7M mean token cost (vs. 1.8M free-form) is the combined effect of Tri-Con compression and the Token Optimisation Engine.
The 95% engagement completion rate (vs. 70% free-form, 78% PentAG) is the reassessment loop's contribution. When a tool fails or produces insufficient results, the orchestrator reassigns to an alternative instead of getting stuck. Free-form and PentAG agents tend to retry the same tool with minor variations, eventually exhausting their turn budget without completing the task.
14. Limitations and Future Work
Capability map authoring burden. Each tool's capability map must be hand-authored by a security engineer who knows the tool operationally. For a 30-tool registry, this is ~60–90 hours of expert time. Future: semi-automated map generation from tool documentation + expert review, with a "draft map" mode that flags low-confidence sections for expert validation.
Capability map staleness. Tools evolve. A capability map for nmap 7.94 may be inaccurate for nmap 8.10. Future: version-pinned maps with automated staleness detection (compare map's version field against installed tool version; warn on mismatch).
Matching algorithm weights. The current weights (capability 35%, input 20%, constraint 20%, output 15%, historical 10%) are hand-tuned. Future: learned weights via offline optimisation on past engagement outcomes (reinforcement learning from assessment results).
Single-task assignment. The current architecture assigns one task to one tool at a time. Some tasks benefit from parallel multi-tool execution (e.g., running nmap and dnsenum simultaneously on different targets). Future: parallel task dispatch with dependency-aware scheduling (the orchestrator already has the data-flow graph for this; the scheduling layer is the missing piece).
Dynamic capability discovery. If the orchestrator encounters a task that no registered tool can satisfy, it currently escalates. Future: the orchestrator could dynamically compose a capability from available tools (e.g., "no single tool does X, but tool A's output + a Python script + tool B's input = X") — a form of capability synthesis.
Cross-engagement learning. The historical score component is currently per-engagement. Future: a persistent performance database that tracks tool outcomes across engagements, enabling the matching algorithm to learn that feroxbuster outperforms gobuster on IIS servers with soft-404s, for example.
15. Conclusion
The Custom Orchestrator replaces the free-form tool-calling loop with a capability-aware task assignment system that treats tool selection as a structured matching problem, not a statistical guess. By building a capability registry from hand-authored capability maps — capturing not just tool interfaces but semantics, constraints, side effects, data-flow relationships, and error modes — the orchestrator achieves a level of tool-understanding that no function-schema-based system can match. By matching task requirements against this registry with a deterministic, multi-factor scoring algorithm, it produces tool selections that are correct, explainable, and reproducible. By assessing results against explicit success criteria and reassigning to alternatives when outcomes are insufficient, it closes the loop — turning tool selection from a one-shot guess into an iterative, self-correcting process.
The built-in integration with Tri-Con context compression (WP01) and the Token Optimisation Engine (WP02) ensures that the orchestrator's decision-making context stays small and affordable regardless of engagement length, while the Context Injection Protocol ensures that assigned agents receive exactly the context they need — no more, no less — to execute their tasks correctly.
Compared to the CAI router pattern, the Custom Orchestrator offers structured capability matching, constraint awareness, tool chaining, and post-execution reassessment — none of which CAI's LLM-based router provides. Compared to the PentAG 2-agent split, it moves capability knowledge from the model's weights into an external, validated, auditable registry — making tool selection reliable, extensible, and explainable rather than dependent on pre-training knowledge and inference-time reasoning.
The Custom Orchestrator forms the control backbone of a penetration-testing agent that is not only capable of selecting the right tool for the right task, but of knowing why it made that selection, whether the result was sufficient, and what to do when it wasn't. In a discipline where the difference between a successful engagement and a missed vulnerability is often the choice of tool and the tuning of its parameters, this is not a luxury — it is a necessity.
© 2026 Khushal Suthar. Part of the Hermesis penetration-testing agent research series. Cross-references: WP01 (Tri-Con 3-Layer Index), WP02 (Token Optimisation 4-Level Engine), WP04 (Phase Map Architecture), WP08 (Architecting Autonomous Pentesting), WP09 (Orchestrator Design Patterns).