WP05 — Skill-Based Platform: A Never-Changing Core with a Shared Skill Library

> Thesis: The platform's core binary never ships a new release. Every module — every scanner, every fuzzer, every reporter, every agent — is assembled at runtime from skills pulled out of a shared, versioned Skill Library. Adding support for a new technology is no longer an engineering project that touches the core; it is a documentation project that drops a new skill package into the library. The result is a security platform whose capability surface grows continuously while its attack surface for regression stays frozen at zero.

This whitepaper specifies the architecture, schema, loader, runtime composition, versioning, marketplace, and safety model of a Skill-Based Platform. It is written to be implementable: every section contains enough concrete detail (YAML schemas, sequence diagrams rendered as ASCII, code-shaped pseudocode, threat models) that a small team could begin construction within a sprint. It also situates the design against two prior art systems — Legion's plugin framework and CAI's modular agents — and explains where the Skill-Based Platform deliberately diverges.

Motivation: Why the Core Must Never Change

What Is a Skill?

Skill Schema (YAML)

Skill Categories

Skill Loader Architecture

Runtime Composition

Versioning and Compatibility

The Skill Marketplace

Safety Rules and Sandboxing

Comparison: Legion Plugins vs. CAI Modular Agents vs. Skill-Based Platform

Operational Scenarios

Failure Modes and Mitigations

Roadmap

Appendix: Glossary

1. Motivation: Why the Core Must Never Change

Every security platform of meaningful age carries a scar pattern: the core was once small and correct, then feature requests arrived, then the core grew, then the core became the thing everyone is afraid to touch. A refactor that should have taken a day now takes a quarter because the core is simultaneously the scheduler, the transport, the plugin host, the policy engine, the report renderer, and the catalog of every technology the platform has ever been asked to look at. Releases become coordination problems. A regression in the HTTP/2 parser delays the release that also fixes a memory leak in the MQTT fuzzer, because both live in the same binary and ship on the same train.

The Skill-Based Platform inverts this. The core is a deliberately tiny, deliberately boring program. It does exactly four things:

Load skills from the Skill Library.

Compose those skills into runnable modules at request time.

Enforce safety policy around every skill invocation.

Stream results to the operator and to durable storage.

It does not know what HTTP is. It does not know what MQTT is. It does not know what a Kubernetes API server is. It knows how to read a skill manifest, validate it, materialize the components the manifest describes, wire those components into a directed graph, run that graph under a resource envelope, and emit results. The knowledge of what to do lives entirely in skills. The knowledge of how to do it safely lives in the core. This separation is the single most important property of the system, and every other design decision in this document exists to defend it.

The payoff is mechanical. When a customer asks, "Can you scan our new IoT protocol stack?", the answer is never "we'll put it on the roadmap for Q3." The answer is "we'll have a skill for it by Friday." Adding a technology is no longer a core change; it is a skill package, and a skill package is 1–2 hours of focused work for a domain expert. Zero downtime, zero regression risk to the core, zero coordination with a release train. The skill is written, validated against the schema, pushed to the marketplace, and available to every deployed platform instance within the library's propagation window (seconds for a cloud-hosted library, minutes for an on-prem mirror).

2. What Is a Skill?

A skill is the smallest unit of capability the platform understands. It is a self-contained, declaratively-described package that teaches the platform how to engage with one specific technology, technique, or reporting format. Critically, a skill is not code that runs inside the core. A skill is a manifest (the skill.yaml) plus a set of assets (knowledge documents, command definitions, pattern libraries, parser scripts, report templates). The core reads the manifest, loads the assets, and uses them to compose a module at runtime. The skill itself is inert until the core activates it.

Every skill contains exactly five component slots, each of which is optional but at least one must be present:

Slot Purpose Format Example ------------ Knowledge Domain context, CVE references, config defaults, threat models Markdown + structured YAML frontmatter "Apache 2.4.49 has CVE-2021-41773; path traversal in mod_alias; check /cgi-bin/." Commands Executable actions the platform can run Declarative command specs (argv templates, container images, HTTP requests) "Run nuclei -t cves/2021/CVE-2021-41773.yaml -u {target}" Patterns Regex / AST / semantic matchers for identifying findings in command output or raw traffic YAML pattern definitions with severity and confidence "Match 403 followed by path containing ../ → Path Traversal, High." Parser Transforms raw command output into normalized findings Python script (sandboxed) or declarative JSONata/jq expression "Parse nuclei JSON output → Finding objects." Report Template Human-facing rendering for findings produced by this skill Jinja2 / Markdown / HTML template "Render CVE-2021-41773 findings as an executive summary section."

A skill that only has knowledge is a reference skill — the platform can surface it to operators or feed it to an LLM agent as context, but it won't run commands. A skill that has commands and patterns but no parser will rely on the core's default parser. A skill that has only a report template is a presentation skill that formats findings from other skills. This composability is intentional: skills are Lego bricks, not monoliths.

2.1 Skill Identity

Every skill has a globally unique identity composed of three parts:

Category (e.g., web, iot, cloud) — the broad domain.
Slug (e.g., apache-path-traversal) — the specific capability within the category.
Version (semver, e.g., 1.2.0) — the skill's own version, independent of the core.

The fully qualified ID is category/slug@version, e.g., web/apache-path-traversal@1.2.0. This ID is how skills reference each other (a cloud skill may depends_on a network skill for port discovery), how the marketplace indexes them, and how operators pin specific versions in policy.

2.2 Skill Package Layout

A skill is distributed as a directory or a tarball with a fixed layout:

web/apache-path-traversal/

├── skill.yaml # The manifest (required) ├── knowledge/ │ ├── overview.md # Domain knowledge │ └── references.yaml # CVE list, config defaults ├── commands/ │ └── nuclei-cve.yaml # Command spec ├── patterns/ │ └── traversal.yaml # Pattern definitions ├── parser/ │ └── parse.py # Parser script (sandboxed) └── report/ └── template.j2 # Jinja2 report template

Every component is optional. The manifest declares which components are present and how they relate.

3. Skill Schema (YAML)

The skill.yaml manifest is the contract between a skill author and the platform core. It is validated at load time against a JSON Schema derived from the specification below. A manifest that fails validation is rejected before any of its assets are read — the core never partially loads a skill.

3.1 Top-Level Structure

# skill.yaml — top-level apiVersion: skill.platform/v1 # Schema version, pinned kind: Skill # Currently only "Skill"; reserved for future kinds id: category: web slug: apache-path-traversal version: 1.2.0 # SemVer; must match the package version meta: name: "Apache 2.4.49 Path Traversal (CVE-2021-41773)" description: Detects and verifies CVE-2021-41773 and CVE-2021-42013 in Apache HTTP Server versions 2.4.49 and 2.4.50. Maps confirmed findings to CVSS 7.5. author: "platform-security-team" license: "Apache-2.0" homepage: "https://marketplace.platform/skills/web/apache-path-traversal" tags: [apache, cve, path-traversal, web, http] min_core_version: "1.0.0" # Minimum core version that can load this skill supported_targets: # What kinds of targets this skill applies to - web-url - ip-address - hostname Component declarations — each is optional components: knowledge: documents: - path: knowledge/overview.md title: "CVE-2021-41773 Technical Overview" - path: knowledge/references.yaml title: "CVE References and Defaults" commands: - id: nuclei-scan spec: commands/nuclei-cve.yaml patterns: - id: traversal-detect spec: patterns/traversal.yaml parser: type: python # "python" "jq" "jsonata" "builtin" entrypoint: parser/parse.py function: parse_nuclei_output report: template: report/template.j2 format: markdown # "markdown" "html" "json" "pdf" Dependency graph depends_on: - skill: web/http-reachability@^1.0 optional: true # If present, run first; if absent, skip purpose: "Confirm target is reachable on HTTP/HTTPS before launching nuclei." Runtime profile — tells the core how to schedule this skill runtime: executor: container # "container" "process" "http" "inline" container: image: ghcr.io/platform/nuclei:3.0.4 timeout_seconds: 120 cpu: "0.5" memory: "256Mi" network: egress-only # "none" "egress-only" "full" concurrency: 4 # Max parallel invocations of this skill retry: max_attempts: 2 backoff: exponential Safety declarations — see Section 9

safety: impact: "non-destructive" # "passive" "non-destructive" "active" "destructive" requires_confirmation: false # If true, operator must approve before execution scope: - "web-path-traversal-detection" data_handling: stores_raw_output: true stores_credentials: false pii_risk: low

3.2 Command Spec

Each command is declared declaratively so the core can validate, parameterize, and sandbox it without executing arbitrary code:

# commands/nuclei-cve.yaml

id: nuclei-scan type: container # "container" "process" "http-request" container: image: ghcr.io/platform/nuclei:3.0.4 argv: - "nuclei" - "-json" - "-t" - "cves/2021/CVE-2021-41773.yaml" - "-u" - "{target.url}" env: NUCLEI_RATE_LIMIT: "{policy.rate_limit default(150)}" inputs: - name: target type: object required: true schema: url: { type: string, format: uri } outputs: - name: raw_findings type: json-stream # One JSON object per line - name: exit_code type: integer

The {target.url} and {policy.rate_limit default(150)} placeholders are resolved by the core's template engine at invocation time. The core never passes raw user input directly into a command — all interpolation goes through a typed, schema-validated input contract.

3.3 Pattern Spec

Patterns identify findings in command output or raw traffic. They are declarative so the core can optimize, cache, and audit them:

# patterns/traversal.yaml

id: traversal-detect matchers: - id: confirmed-traversal where: output.json condition: item.matched == true and item.template-id contains "CVE-2021-41773" severity: high confidence: confirmed extract: cve: "item.info.reference match('CVE-\\d{4}-\\d+')" matched_url: "item.matched-at" ip: "item.ip" - id: probable-traversal where: output.json condition: item.matched == true and item.template-id contains "path-traversal" severity: medium confidence: probable normalize_to: # Maps to the platform's normalized Finding schema category: web class: path-traversal cwe: "CWE-22"

3.4 Parser Spec

The parser transforms raw command output into the platform's normalized Finding schema. It can be a Python function (sandboxed — see Section 9), a jq expression, a JSONata expression, or the builtin default:

# Declared in skill.yaml under components.parser

parser: type: python entrypoint: parser/parse.py function: parse_nuclei_output input: raw_findings # Which command output to consume output: findings # Which module output to produce

# parser/parse.py — runs in a restricted sandbox
from platform.sdk import Finding, Severity
def parse_nuclei_output(raw_findings: list[dict]) -> list[Finding]: results = [] for item in raw_findings: results.append(Finding( skill_id="web/apache-path-traversal@1.2.0", title=f"Path Traversal: {item.get('template-id')}", severity=Severity.from_string(item.get('info', {}).get('severity', 'info')), target=item.get('matched-at', ''), evidence=item, cve=item.get('info', {}).get('reference', [None])[0], )) return results

3.5 Report Template

The report template renders findings for human consumption. It uses Jinja2 with a restricted filter set:

{# report/template.j2 #}
Apache Path Traversal — {{ findings  length }} finding(s)
{% for f in findings %}
{{ f.title }}
Severity: {{ f.severity  upper }}
Target: {{ f.target }}
CVE: {{ f.cve or "N/A" }}
Evidence: {{ f.evidence  tojson(indent=2) }}
{% endfor %}
{% if findings  length == 0 %} No path traversal findings detected. Apache target appears patched against CVE-2021-41773 and CVE-2021-42013. {% endif %}

4. Skill Categories

The Skill Library is organized into six top-level categories. These are not arbitrary buckets — each category maps to a distinct engagement model, a distinct target topology, and a distinct safety envelope. The core uses the category to select default runtime profiles, default safety rules, and default report scaffolding.

4.1 Web

Skills that engage web applications, APIs, and HTTP services. Targets are URLs, hostnames, or IP:port pairs. Typical skills: SQL injection detection, XSS detection, path traversal, directory enumeration, JWT analysis, GraphQL introspection, WAF fingerprinting. Commands are usually HTTP requests or containerized scanners (nuclei, ffuf, sqlmap). Safety tends toward non-destructive to active. The Web category is the most populous because the technology surface is the largest and changes the fastest.

4.2 Mobile

Skills that engage mobile applications (APK, IPA) and mobile backend APIs. Targets are binary artifacts or bundle identifiers. Typical skills: APK static analysis (manifest extraction, hardcoded secrets, insecure storage), IPA static analysis, Frida hook recipes, certificate pinning bypass, deep link analysis, backend API discovery from mobile traffic. Commands are typically containerized static analyzers (mobSF, apktool, jadx) or dynamic instrumentation (Frida). Safety is non-destructive for static, active for dynamic. Mobile skills frequently depends_on Web skills for the backend API phase.

4.3 IoT

Skills that engage embedded devices, firmware, and IoT protocols. Targets are IP addresses, firmware blobs, or protocol endpoints. Typical skills: firmware extraction (binwalk), hardcoded credential detection in firmware, MQTT broker enumeration, CoAP recon, Zigbee/BLE sniffer recipes, default credential brute force, UPnP enumeration. Commands are a mix of containerized tools (binwalk, nmap scripts) and protocol-specific clients. Safety is active for live devices (a brute force can lock out a device), non-destructive for firmware analysis. IoT skills carry the highest requires_confirmation rate because misconfigured active scans can brick consumer hardware.

4.4 Network

Skills that engage network infrastructure: routers, switches, firewalls, VPN concentrators, load balancers. Targets are IP ranges, CIDR blocks, or hostnames. Typical skills: port scanning (nmap), service fingerprinting, TLS cipher analysis, SNMP enumeration, BGP route analysis, firewall rule inference, VPN detection. Commands are containerized network tools. Safety spans passive (banner analysis from Shodan-style data) to active (nmap with NSE scripts). Network skills are heavily composed — a network discovery skill produces targets that feed into Web, IoT, or Cloud skills.

4.5 Cloud

Skills that engage cloud providers and cloud-native infrastructure: AWS, GCP, Azure, Kubernetes, Terraform, CloudFormation. Targets are cloud accounts, regions, clusters, or IaC repositories. Typical skills: S3 bucket exposure detection, IAM policy analysis, EKS RBAC audit, Terraform static analysis (tfsec, checkov), Kubernetes pod security analysis, cloud metadata endpoint checks, exposed secrets in CI configs. Commands are typically cloud CLI calls (aws, gcloud, az, kubectl) or IaC scanners. Safety is non-destructive (read-only API calls) for most skills, active for a small number that create canary resources. Cloud skills are the most policy-sensitive: the safety declarations must respect the customer's cloud tenancy boundaries.

4.6 AI/ML

Skills that engage AI/ML systems: LLM endpoints, model serving infrastructure, training pipelines, datasets. Targets are API endpoints, model artifacts, or dataset repositories. Typical skills: LLM prompt injection testing, model extraction attack detection, training data poisoning detection, model card validation, inference endpoint DoS testing, vector database exposure detection, agent tool-use abuse testing. Commands are HTTP requests to inference endpoints or containerized adversarial toolkits. Safety is non-destructive for analysis skills, active for red-team skills that send crafted inputs. This is the newest category and the fastest-growing; it is also the category where the skill schema is most likely to evolve, because the attack surface is still being mapped.

4.7 Cross-Category Composition

Categories are not silos. A typical engagement composes skills across categories. A cloud pen-test might run: cloud/eks-rbac-audit → network/port-scan (on discovered nodes) → web/api-reachability (on discovered services) → ai/llm-endpoint-test (on discovered inference endpoints). The core's composition engine (Section 6) handles the data flow between categories. The category system exists for organization, defaults, and safety policy — not for runtime isolation.

5. Skill Loader Architecture

The skill loader is the core's front door. It is the only component that reads from the Skill Library, and it is the only component that can introduce new skills into a running platform. Its design is deliberately simple and deliberately strict: it validates everything, trusts nothing, and fails closed.

5.1 Loader Pipeline

The loader runs a five-stage pipeline for every skill it encounters:

┌─────────────────────────────────────────────────────────────────┐

│ SKILL LOADER PIPELINE │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ 1. Fetch │──▶│ 2. Parse │──▶│ 3.Validate│──▶│ 4. Compile│──▶│ 5.Register│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ Fetch: Pull skill package from library (local FS or remote) │ │ Parse: Read skill.yaml, expand into in-memory manifest tree │ │ Validate: JSON Schema validation + semantic checks + safety │ │ Compile: Resolve templates, pre-compile parsers, index patterns│ │ Register: Add to skill registry, notify composition engine │ └─────────────────────────────────────────────────────────────────┘

Stage 1 — Fetch. The loader retrieves the skill package from the configured library source. Sources are pluggable: local filesystem (file://), HTTP registry (https://marketplace.platform/), S3 mirror (s3://), or a Git repository. The loader verifies a package signature (Sigstore/cosign) before proceeding. Unsigned packages are rejected in production mode; a --allow-unsigned flag exists for development only and is gated behind a core configuration flag that cannot be set via skill.

Stage 2 — Parse. The loader reads skill.yaml, resolves any include: directives (skills can include shared YAML fragments), and expands the manifest into an in-memory tree. Parsing failures (malformed YAML, unknown fields in non-strict mode) cause the skill to be rejected with a structured error. The core never guesses.

Stage 3 — Validate. This is the most important stage. Validation has three layers:

Structural validation: The manifest is checked against the JSON Schema for apiVersion: skill.platform/v1. Unknown fields in strict mode are rejected. Required fields are checked. Type constraints are enforced.
Semantic validation: Cross-field checks run. min_core_version must be ≤ the running core version. depends_on references must point to skills that exist (or are being loaded in the same batch). runtime.executor must be one the core supports. safety.impact must be compatible with the platform's configured engagement mode (a destructive skill cannot be loaded if the platform is in safe mode).
Safety validation: The safety declarations are checked against the platform's safety policy. A skill that declares requires_confirmation: true is allowed; a skill that should declare it but doesn't (based on its commands) is flagged for manual review. The loader maintains a heuristic classifier that inspects command argv for high-risk patterns (e.g., rm, dd, mkfs, --destroy) and escalates the impact declaration if the skill author under-declared it.

Stage 4 — Compile. The loader resolves all template placeholders in command specs against the platform's default policy context (not against a specific target — that happens at invocation time). Parsers are pre-compiled: Python parsers are byte-compiled and checked for sandbox violations (no import os, no open() outside approved paths, no subprocess, no network access). Jinja2 templates are compiled and checked for forbidden filters/tags. Patterns are compiled into matcher objects and indexed by the fields they extract, so the composition engine can query "which skills can produce a cve field?" in O(1).

Stage 5 — Register. The validated, compiled skill is added to the in-process Skill Registry, a concurrent map keyed by fully-qualified skill ID. The registry publishes an event (skill.loaded) that the composition engine, the policy engine, and the operator console subscribe to. From this point, the skill is available for composition. Registration is atomic: a skill is either fully registered or not registered at all. There is no "partially loaded" state.

5.2 Registry Design

The Skill Registry is the core's source of truth for what skills are available. It is:

In-memory for hot path performance. A typical platform instance has 200–500 skills loaded; the registry is a HashMap<String, CompiledSkill> with an RCU (read-copy-update) pattern so readers never block writers.
Append-only for versions. A new version of a skill is a new entry (web/apache-path-traversal@1.2.1 is a distinct key from web/apache-path-traversal@1.2.0). Old versions are not evicted unless an explicit GC policy runs, so running modules that pinned an old version are never broken by a new release.
Queryable. The registry supports queries by category, by tag, by supported_targets, by safety.impact, by depends_on, and by "can produce field X." These queries power the composition engine's skill selection.

5.3 Hot Loading and Unloading

Skills can be loaded, upgraded, and unloaded while the platform is running — this is the "zero downtime" promise. The mechanism:

Load: New skill goes through the pipeline, enters the registry. New module requests can use it immediately. Existing running modules are unaffected (they pinned their skill versions at composition time).
Upgrade: A new version of a skill is loaded as a new registry entry. The old version remains. Module requests that don't pin a version get the latest by default; requests that pin a version get what they pinned. An operator can issue a skill.roll command to drain old-version references and evict the old entry.
Unload: An operator issues skill.unload web/apache-path-traversal@1.2.0. The core marks the skill as "draining." New module requests cannot use it. Running modules that reference it are allowed to finish. When the last reference is released, the skill is evicted from the registry and its compiled assets are dropped from memory.

At no point during any of these operations does the core binary change, restart, or lose in-flight work.

6. Runtime Composition

Composition is the process by which the core assembles a set of skills into a runnable module graph in response to an operator request. This is where the platform's real power lives: an operator does not say "run skill X," they say "assess target Y for risk class Z," and the composition engine selects and wires the appropriate skills.

6.1 The Composition Request

An operator (or an automated trigger, or an LLM agent) submits a composition request:

# composition request

target: type: web-url value: "https://apache-vuln.example.com" engagement: mode: safe # "safe" "standard" "aggressive" categories: [web, network] # Optionally constrain; null = all applicable skills: null # Optionally pin specific skills; null = auto-select max_duration_seconds: 1800 policy: rate_limit: 150 scope: "production-readonly"

6.2 Skill Selection

The composition engine queries the Skill Registry to find skills that:

Declare supported_targets matching the request's target type.
Are in the requested categories (if specified).
Have a safety.impact compatible with the engagement mode (safe allows passive and non-destructive; standard adds active; aggressive adds destructive but still requires explicit confirmation).
Are not blocked by the platform's denylist.
Have all their depends_on satisfied (transitively).

The result is a candidate set. The engine then applies a relevance filter: it scores each candidate by how well its meta.tags and knowledge match the target's fingerprint (technology stack, open ports, HTTP headers, etc.). The target fingerprint is produced by a lightweight recon phase that runs a small set of always-present core skills (port scan, HTTP probe, TLS probe) before the main composition. This is the one place the core has "baked-in" knowledge — and it is deliberately minimal: just enough to fingerprint the target so skill selection is intelligent.

6.3 Graph Construction

Selected skills are wired into a Directed Acyclic Graph (DAG). Each skill is a node. depends_on creates edges. Skills that produce outputs consumed by other skills' inputs create data-flow edges. The engine performs a topological sort and emits an execution plan:

┌───────────────────────────────────────────────────────────────┐

│ COMPOSITION DAG EXAMPLE │ │ │ │ ┌──────────────┐ │ │ │ network/ │ │ │ │ port-scan │ │ │ └──────┬───────┘ │ │ │ ports[] │ │ ▼ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ web/http- │────▶│ web/apache- │ │ │ │ reachability │ │ path-traversal│ │ │ └──────┬───────┘ └──────┬───────┘ │ │ │ reachable? │ findings[] │ │ ▼ │ │ │ ┌──────────────┐ │ ┌──────────────┐ │ │ │ web/tls- │ │ │ report/ │ │ │ │ cipher-audit │ │ │ executive- │ │ │ └──────┬───────┘ │ │ summary │ │ │ │ findings[] │ └──────▲───────┘ │ │ │ │ │ │ │ └───────────────────┴──────────┘ │ │ merged findings[] │ └───────────────────────────────────────────────────────────────┘

The DAG is the module. It is what runs. The core executes it node by node, respecting dependencies, concurrency limits, and safety policy. Each node's output feeds into a typed data bus; downstream nodes consume typed inputs. Type mismatches are caught at composition time, not at runtime.

6.4 Execution

The core's execution engine runs the DAG:

Schedule: Nodes with no unmet dependencies are scheduled. The core respects each skill's runtime.concurrency limit — if a skill says concurrency: 4, no more than 4 instances of that skill run simultaneously across all active modules.
Invoke: For each node, the core materializes the skill's command with the target and policy context, sends it to the executor (container runtime, process spawner, HTTP client, or inline Python sandbox), and streams output back.
Parse: Raw output is fed to the skill's parser, producing normalized Finding objects.
Match: Patterns are applied to findings and raw output to classify, enrich, and deduplicate.
Publish: Findings are published to the data bus, to the operator console (streamed), and to durable storage (batched).
Report: When the DAG completes (or when an operator requests an interim report), the report templates from each skill are rendered with the accumulated findings, producing a unified report.

6.5 Dynamic Re-Composition

The DAG is not frozen at composition time. The engine supports dynamic re-composition: if a skill's output reveals a new sub-target (e.g., the HTTP probe discovers an API endpoint that wasn't in the initial target spec), the engine can request a sub-composition for that sub-target, pulling in additional skills mid-run. This is how the platform handles the reality that security engagement is inherently exploratory — you don't know what you'll find until you start looking.

Dynamic re-composition is bounded: the engine has a configurable depth limit (default 3) and a total-skill-count limit (default 50 skills per module) to prevent runaway expansion. Every dynamic re-composition is logged and visible to the operator in real time.

7. Versioning and Compatibility

The Skill-Based Platform has three independent version axes that must never be conflated:

7.1 Core Version

The core binary has a semver version. It changes rarely — the design goal is "never," but security patches and executor upgrades will necessitate occasional releases. The core version defines:

Which apiVersion of the skill schema it understands (currently skill.platform/v1).
Which executors are available (container, process, http, inline).
Which safety policies are enforced.

A core version bump is a platform-wide event. It is tested against the entire skill library in CI before release. The core maintains backward compatibility with at least one prior major schema version, loading older skills in a compatibility shim.

7.2 Skill Version

Each skill has its own semver version, independent of the core. A skill author bumps the skill version when they change any component — a new CVE reference, a tuned pattern, a parser bugfix. The min_core_version field in the manifest declares the oldest core that can load this skill. This is the only coupling between skill and core, and it is one-directional: a skill declares what it needs; the core never declares what skills it needs.

Skill versioning follows semver with a security-specific convention:

Patch (1.2.0 → 1.2.1): Bugfix in parser, pattern tuning, knowledge doc update. No new command, no new finding class. Safe to auto-upgrade.
Minor (1.2.1 → 1.3.0): New sub-check, new pattern, expanded target support. No breaking change to outputs. Safe to auto-upgrade within an engagement.
Major (1.3.0 → 2.0.0): Breaking change — renamed findings, changed output schema, removed command, changed safety impact. Requires operator acknowledgment before upgrade.

7.3 Library Version

The Skill Library as a whole has a manifest version — a signed list of all skill IDs and versions it currently contains. This is used for reproducibility: an engagement report includes the library manifest hash, so the exact skill set that produced the report can be reconstructed. This is critical for audit and for comparing engagement results over time.

7.4 Compatibility Matrix

min_core_version

The key property: no change in any version axis ever breaks a running module. Modules pin their skill versions and core version at composition time. The only way to affect a running module is to explicitly cancel it.

8. The Skill Marketplace

The Skill Library is not just a filesystem — it is a marketplace: a versioned, signed, searchable registry of skills that can be published, discovered, subscribed to, and audited. The marketplace model is what makes "adding new tech = 1–2 hours" scale beyond a single team to an entire ecosystem.

8.1 Roles

Authors: Write skills. Can be internal (platform team, red team, a customer's security team) or external (community, vendors). Authors have identities (OIDC-backed) and signing keys.
Consumers: Platform instances that load skills from the marketplace. A consumer is configured with one or more marketplace subscriptions.
Curators: Marketplace operators who review skills before they are promoted to stable. Curators do not gate publishing to dev or beta, but they gate promotion to stable and verified.
Auditors: Read-only access to the full marketplace history for compliance. Every publish, upgrade, takedown, and download is logged.

8.2 Publication Channels

Every skill exists in one of four channels, which map to its maturity and trust level:

dev

beta

stable

verified

A platform instance's configuration declares which channels it will auto-load from. A production deployment typically auto-loads stable and verified only; a research deployment might add beta.

8.3 Publication Workflow

Author writes skill → local validation (CLI: skill validate)
↓ Author signs package (cosign) → publishes to dev channel ↓ Author requests promotion to beta → marketplace runs automated checks ↓ Curator reviews → promotes to stable (or rejects with feedback) ↓ Safety team audits → promotes to verified (adds safety attestation)

Automated checks on promotion include: schema validation, parser sandbox validation, command safety classification, dependency resolution test, and a dry-run composition against a synthetic target to verify the skill produces valid findings without errors.

8.4 Discovery

The marketplace exposes a search API and a web UI. Skills are discoverable by:

Full-text search over meta.name, meta.description, meta.tags.
Category browse.
Target type browse ("what skills can scan a Kubernetes cluster?").
CVE/ CWE lookup ("which skills detect CWE-22?").
Dependency graph browse ("what skills depend on web/http-reachability?").

Each skill's marketplace page shows: description, version history, channel, author identity, signing key, safety declarations, dependency graph, sample findings, and download/load counts.

8.5 Subscriptions and Sync

A platform instance subscribes to marketplace feeds. The core's library sync daemon polls (or receives webhooks for) new and updated skills matching the subscription filter. A subscription filter might be: "all verified skills in web and cloud categories, plus stable skills tagged apache or kubernetes." New skills matching the filter are fetched, signature-verified, and hot-loaded per Section 5.3. This is how a deployed platform stays current without anyone touching it.

8.6 Private Skills

Not every skill should be public. A customer's internal security team may write skills for proprietary protocols or custom APIs. The marketplace supports private skills: skill packages published to a private registry that requires authentication. The core's library sync daemon handles private registries with the same pipeline as public ones, adding an auth step to Fetch. Private skills never appear in public search and never propagate to other consumers.

9. Safety Rules and Sandboxing

The Skill-Based Platform runs code from many authors against real targets. This is inherently dangerous. The safety model is defense-in-depth: every layer assumes the layer above it has been compromised.

9.1 Impact Classification

Every skill declares an impact level. The core enforces this as a hard gate — a skill cannot run if its impact exceeds the engagement mode's allowance.

passive

safe

non-destructive

safe

active

standard

destructive

aggressive

9.2 Confirmation Gate

Skills that declare requires_confirmation: true (or that are classified as destructive) are held at a confirmation gate. The core pauses the module, surfaces the skill, its impact, its target, and its commands to the operator, and waits for explicit approval. If the operator does not approve within the module's timeout, the skill is skipped. This is non-bypassable: the confirmation gate is in the core, not in the skill.

9.3 Command Sandboxing

Commands run in the executor the skill declares. For the container executor (the most common), the core enforces:

Read-only root filesystem. The skill's container cannot write to its own filesystem except to a tmpfs /tmp. This prevents a compromised skill from persisting payloads.
No host mounts. The container has no access to the host filesystem, host network, host PID space, or host IPC.
Network policy. The container's network is governed by the skill's runtime.container.network declaration:

- none: No network. For offline parsers and static analysis. - egress-only: Outbound to the target only. The core injects iptables/eBPF rules that restrict egress to the target's IP and port. No inbound. No lateral movement. - full: Unrestricted network. Requires impact: destructive and operator confirmation. Logged at packet level.

Resource limits. CPU, memory, and PID limits are enforced via cgroups. A skill that OOMs is killed and its failure is recorded; it does not crash the core.
Timeout. Every command has a timeout. A command that exceeds it is killed (SIGKILL after SIGTERM grace period). No skill can hang the platform.

9.4 Parser Sandboxing

Python parsers run in a restricted execution environment:

No filesystem access except the skill's own package directory (read-only).
No network access.
No subprocess. os.system, subprocess., os.exec are blocked.
No eval/exec/compile of dynamic code.
No import of unsafe modules. An allowlist permits: json, re, datetime, collections, itertools, platform.sdk. Everything else is denied.
Memory and CPU limits. Parsers run in a worker pool with per-task resource limits. A parser that loops is killed.
Structured output only. A parser must return a list of Finding objects (validated by the core) or raise a ParseError. It cannot return arbitrary objects.

9.5 Supply Chain Safety

Every skill package is signed (Sigstore/cosign). The core verifies the signature at Fetch time and checks it against the marketplace's trust store. A skill signed by an unknown key is rejected in production. The marketplace records a transparency log (Rekor-style) of every publish, so the history of a skill is publicly auditable. The core pins the marketplace's signing root in its configuration, which can only be changed by a core restart with a new config file — not by a skill, not by an API call, not by an operator command.

9.6 Safety Policy as Code

The platform's safety policy is itself a declarative document, versioned and auditable:

# safety-policy.yaml
engagement_modes: safe: allow_impact: [passive, non-destructive] require_confirmation_for: [] max_concurrency: 8 standard: allow_impact: [passive, non-destructive, active] require_confirmation_for: [destructive] max_concurrency: 16 aggressive: allow_impact: [passive, non-destructive, active, destructive] require_confirmation_for: [destructive] max_concurrency: 32
denylist: skills: []                      # Explicit skill IDs to block categories: []                  # Category-level blocks authors: []                     # Author-level blocks (trust revocation)
allowlist: channels: [stable, verified]    # Only load from these channels private_registries: [corp-internal]  # Additional private registries
data_handling: redact_secrets_in_reports: true redact_pii_in_reports: true encrypt_raw_output_at_rest: true raw_output_retention_days: 30

The safety policy is loaded at core startup and can be hot-reloaded from disk (the core watches the file). It cannot be modified by a skill. It is the outermost ring of the defense-in-depth model.

10. Comparison: Legion Plugins vs. CAI Modular Agents vs. Skill-Based Platform

The Skill-Based Platform is not the first system to pursue extensibility without core changes. Two prior art systems are worth comparing in detail: Legion's plugin framework and CAI's modular agents. Both influenced this design; both have properties the Skill-Based Platform deliberately improves upon.

10.1 Legion Plugins

Legion (a security automation framework) uses a plugin system where each plugin is a Python module that registers itself with the framework at import time. Plugins define tasks, hooks, and UI components. The framework's core loads plugins from a plugin directory, calls their register() function, and integrates their capabilities.

Strengths:

Simple mental model: write a Python file, drop it in a directory, it works.
Plugins have full access to the framework's internal APIs, enabling deep integration.
Hot-reloadable in some configurations.

Weaknesses that the Skill-Based Platform addresses:

Plugins are code, not declarations. A Legion plugin is a Python module that runs arbitrary code at import time. This means a plugin can do anything — modify the framework's internals, monkey-patch other plugins, access the filesystem, make network calls — during registration. The Skill-Based Platform's skills are declarative manifests with sandboxed components; the core never executes skill code at load time.
No schema contract. Legion plugins conform to a convention (duck typing), not a schema. If a plugin doesn't implement an expected method, the framework fails at runtime, often with opaque errors. The Skill-Based Platform validates every skill against a JSON Schema at load time; a non-conforming skill is rejected with a structured error before it can affect the system.
Tight coupling. Legion plugins often import each other directly, creating a tangled dependency web. The Skill-Based Platform's depends_on is declarative and resolved by the core, not by the skills themselves. Skills never import each other; they declare what they need and the core wires it.
Safety is opt-in. A Legion plugin can run arbitrary code with no safety declaration. The framework trusts the plugin author. The Skill-Based Platform requires every skill to declare its impact, runs commands in sandboxed executors, and validates parser code against an allowlist. Safety is not optional; it is structural.
Versioning is implicit. Legion plugins have versions, but the framework doesn't manage compatibility. Two plugins that depend on different versions of a shared helper can conflict silently. The Skill-Based Platform's semver + min_core_version + library manifest hash makes every dependency explicit and every composition reproducible.
No marketplace. Legion plugins are distributed as Git repos or Python packages. There is no curated, signed, channel-tiered registry. The Skill-Based Platform's marketplace provides discovery, trust tiers, supply-chain signing, and audit logging.

10.2 CAI Modular Agents

CAI (a Cybersecurity AI framework) uses a modular agent architecture where different agent types (recon agent, exploit agent, report agent) are composed into a workflow. Each agent is a Python class that implements a specific interface (run(), observe(), act()). Agents communicate via a shared state object. An orchestrator selects and sequences agents based on the task.

Strengths:

Agent abstraction maps well to security workflows (recon → exploit → report).
Agents can be swapped or added without changing the orchestrator.
The shared state object enables rich inter-agent communication.

Weaknesses that the Skill-Based Platform addresses:

Agents are behavioral abstractions, not capability units. A CAI agent bundles behavior (what to do) with implementation (how to do it) in a single Python class. The Skill-Based Platform separates these: a skill declares what it knows and what it does, but the how (execution, sandboxing, scheduling) is the core's job. This separation means the same skill can be run by different executors (container, process, HTTP) without changing the skill.
Agent communication is untyped. CAI agents share a state object that is typically a dict or a loosely-typed context. Type errors surface at runtime. The Skill-Based Platform's composition DAG has typed data-flow edges; type mismatches are caught at composition time.
No declarative knowledge. CAI agents embed their domain knowledge in Python code. The Skill-Based Platform's skills have a knowledge component that is Markdown + structured YAML, readable by both humans and LLM agents. This makes skills useful even when they're not executing — an operator or an AI agent can query the knowledge base directly.
Safety is per-agent, not systemic. A CAI agent implements its own safety checks (if it chooses to). There is no systemic enforcement. The Skill-Based Platform's safety policy is a core-level, non-bypassable gate that applies to every skill uniformly.
Versioning and reproducibility. CAI workflows are reproducible only if you pin every agent's version and the shared state schema. In practice, this is rarely done. The Skill-Based Platform's library manifest hash makes every composition cryptographically reproducible.
LLM integration. CAI agents are often LLM-driven, meaning their behavior is non-deterministic. The Skill-Based Platform supports LLM-driven skills (an ai/ skill can use an LLM for analysis), but the skill itself is deterministic in its declaration — the LLM is a tool the skill uses, not the skill's definition. This keeps the composition graph predictable even when individual nodes use AI.

10.3 Summary Comparison

depends_on

knowledge

report

The fundamental difference is one of philosophy. Legion and CAI both extend a framework by adding code that runs inside it. The Skill-Based Platform extends a core by adding declarations that the core interprets. The core's code never grows; only its data does. This is the property that makes "the core never changes" literally true, not aspirational.

11. Operational Scenarios

11.1 Adding a New Technology (The 1–2 Hour Promise)

A customer reports they've deployed a new IoT protocol gateway using Thread/Matter. The platform has no Thread skill. The timeline:

Minute 0–20: Security engineer reads the Matter specification's threat model. Identifies three checks: (a) commissioning window exposure, (b) default vendor key reuse, (b) BLE pairing replay. Drafts knowledge/overview.md with the CVE references and protocol background.
Minute 20–50: Engineer writes commands/matter-recon.yaml — a containerized Python script that sends MATTER commissioning queries to the target. Writes patterns/matter-findings.yaml with three matchers, one per check. Each matcher maps to a normalized finding class.
Minute 50–70: Engineer writes parser/parse.py (30 lines — parse the command's JSON output into Finding objects). Writes report/template.j2 (a Jinja2 template that renders the three finding types).
Minute 70–80: Engineer writes skill.yaml, runs skill validate locally. Fixes the two validation errors (a missing supported_targets entry and an under-declared impact that the safety classifier caught). Re-validates. Passes.
Minute 80–90: Engineer signs the package and publishes to dev. Runs a dry-run composition against a test target. Findings render correctly.
Minute 90–100: Engineer requests promotion to beta. Automated checks pass. Engineer opens a curator review ticket.
Curator reviews (out of band, ~1 day for stable, ~1 week for verified). The skill is usable in beta immediately by any platform opted into the beta channel.

Total active engineering time: ~90 minutes. Zero core changes. Zero downtime. Zero regression risk. The skill is available to every platform instance subscribed to the beta channel within the sync daemon's poll interval (default 60 seconds).

11.2 Upgrading a Skill Mid-Engagement

An engagement is running web/apache-path-traversal@1.2.0. The skill author publishes 1.2.1 (a pattern tuning fix that reduces false positives). The library sync daemon fetches 1.2.1 and hot-loads it into the registry. The running module is unaffected — it pinned 1.2.0 at composition time. The next module request that doesn't pin a version gets 1.2.1. An operator who wants the running module to benefit from the fix can either restart the module (it will compose with 1.2.1) or issue a skill.roll command, which drains the old version after the current scan pass completes and lets the next pass use the new version.

11.3 Emergency Takedown

A skill is found to have a safety issue — its parser, in an edge case, exfiltrates target data to an external endpoint (a violation of the sandbox rules that slipped through review). The curator issues a takedown command. The marketplace marks the skill as revoked, adds it to the global denylist, and propagates the revocation to all subscribed platform instances within seconds. The core's library sync daemon receives the revocation, unloads the skill immediately (killing any running instances of it with a logged reason), and prevents future loads. An audit log entry is created with the takedown reason, the curator identity, and the timestamp. The skill's author is notified and can publish a fixed version, which goes through the full review pipeline again.

11.4 Air-Gapped Deployment

A customer runs the platform in an air-gapped environment. They mirror the marketplace to an internal registry on a weekly cadence (a signed bundle of all stable and verified skills). The core's library sync daemon points at the internal registry. All signing keys are pre-distributed. The core operates identically — it never needs internet access. Private skills are published directly to the internal registry. The only difference from a cloud-connected deployment is the sync cadence.

12. Failure Modes and Mitigations

error

impact

depends_on

class

target

cve

ai/

confidence: low

13. Roadmap

13.1 Phase 1 — Core MVP (Weeks 1–6)

Skill loader with skill.platform/v1 schema validation.
Skill registry with hot-load/unload.
Container and process executors.
Composition DAG with static skill selection (no fingerprinting yet).
Basic safety policy (impact gate, confirmation gate, container sandbox).
10 reference skills across Web and Network categories.

13.2 Phase 2 — Composition Intelligence (Weeks 7–12)

Target fingerprinting (lightweight recon phase).
Dynamic re-composition with depth/count limits.
Typed data-flow edges with compile-time validation.
Skill scoring and relevance filtering.
30 additional skills across all six categories.

13.3 Phase 3 — Marketplace (Weeks 13–20)

Marketplace registry with dev/beta/stable/verified channels.
Sigstore signing and transparency log.
Publication workflow with automated checks.
Discovery API and web UI.
Library sync daemon with subscription filters.

13.4 Phase 4 — Ecosystem (Weeks 21+)

Private registries and air-gapped mirrors.
LLM agent integration (AI agents as skill consumers and as ai/ skill components).
Skill dependency graph visualization and impact analysis.
Community contribution program with curator onboarding.
Performance: parser worker pooling, pattern index optimization, composition cache.

Appendix: Glossary

Term Definition ------ Skill The smallest unit of platform capability. A declarative package with knowledge, commands, patterns, parser, and report template. Skill Library The collection of all available skills, organized by category and version. Skill Registry The in-process, in-memory index of loaded skills. The core's source of truth for what is available. Core The never-changing platform binary. Loads skills, composes modules, enforces safety, streams results. Module A running instance of a composition DAG. The unit of execution. Composition The process of selecting skills and wiring them into a DAG for a given target and engagement. Finding The normalized output unit of any skill. Has severity, confidence, target, evidence, class, and CWE. Engagement Mode A safety tier (safe, standard, aggressive) that gates which impact levels are allowed. Impact A skill's declared potential to affect the target: passive, non-destructive, active, destructive. Marketplace The signed, versioned, channel-tiered registry where skills are published, discovered, and subscribed to. Channel A maturity/trust tier in the marketplace: dev, beta, stable, verified. Library Manifest Hash A cryptographic hash of the library's complete skill inventory at a point in time. Used for reproducibility. DAG Directed Acyclic Graph. The structure of a composed module. Nodes are skills; edges are dependencies and data flows. Executor The runtime that executes a skill's command: container, process, http, or inline.

This document specifies the Skill-Based Platform architecture as of schema version skill.platform/v1. The design is stable; the skill library is not. That is the point.

WP05 — Skill-Based Platform: A Never-Changing Core with a Shared Skill Library

Table of Contents

1. Motivation: Why the Core Must Never Change

2. What Is a Skill?

2.1 Skill Identity

2.2 Skill Package Layout

3. Skill Schema (YAML)

3.1 Top-Level Structure

Component declarations — each is optional

Dependency graph

Runtime profile — tells the core how to schedule this skill

Safety declarations — see Section 9

3.2 Command Spec

3.3 Pattern Spec

3.4 Parser Spec

3.5 Report Template

Apache Path Traversal — {{ findings length }} finding(s)

{{ f.title }}

4. Skill Categories

4.1 Web

4.2 Mobile

4.3 IoT

4.4 Network

4.5 Cloud

4.6 AI/ML

4.7 Cross-Category Composition

5. Skill Loader Architecture

5.1 Loader Pipeline

5.2 Registry Design

5.3 Hot Loading and Unloading

6. Runtime Composition

6.1 The Composition Request

6.2 Skill Selection

6.3 Graph Construction

6.4 Execution

6.5 Dynamic Re-Composition

7. Versioning and Compatibility

7.1 Core Version

7.2 Skill Version

7.3 Library Version

7.4 Compatibility Matrix

8. The Skill Marketplace

8.1 Roles

8.2 Publication Channels

8.3 Publication Workflow

8.4 Discovery

8.5 Subscriptions and Sync

8.6 Private Skills

9. Safety Rules and Sandboxing

9.1 Impact Classification

9.2 Confirmation Gate

9.3 Command Sandboxing

9.4 Parser Sandboxing

9.5 Supply Chain Safety

9.6 Safety Policy as Code

10. Comparison: Legion Plugins vs. CAI Modular Agents vs. Skill-Based Platform

10.1 Legion Plugins

10.2 CAI Modular Agents

10.3 Summary Comparison

11. Operational Scenarios

11.1 Adding a New Technology (The 1–2 Hour Promise)

11.2 Upgrading a Skill Mid-Engagement

11.3 Emergency Takedown

11.4 Air-Gapped Deployment

12. Failure Modes and Mitigations

13. Roadmap

13.1 Phase 1 — Core MVP (Weeks 1–6)

13.2 Phase 2 — Composition Intelligence (Weeks 7–12)

13.3 Phase 3 — Marketplace (Weeks 13–20)

13.4 Phase 4 — Ecosystem (Weeks 21+)

Appendix: Glossary