The Token Engine: A Four-Level Pipeline for Reducing LLM Token Consumption in Security Assessments
Whitepaper WP-02 — Token Optimization Series
Author: Same as WP-01 Version: 1.0 Date: 2026-06-21 Audience: Security engineers, LLM application architects, platform teams operating LLM-driven security assessment pipelines.
1. Executive Summary
Large language models have become a productive substrate for security assessment — parsing reconnaissance dumps, triaging scan output, drafting findings, and synthesizing report narratives. The dominant cost and latency driver in these workflows is not model weight or inference framework, but token throughput. A single web pentest engagement can easily push 4–8 million tokens through an LLM once scan output, HTTP transcripts, JavaScript bundles, and prior-context memory are included. At contemporary pricing that translates into hundreds to thousands of dollars per engagement, most of it spent carrying information the model never needs at full fidelity.
This whitepaper describes the Token Engine, a four-stage Python pipeline that reduces LLM token consumption by 18–35% with zero data loss — every transformation is reversible and the original input can be reconstructed bit-for-bit. The pipeline applies equally to model input (prompts, context windows) and model output (completions, structured findings). The four levels are:
The engine is implemented in pure Python with no external LLM calls for the compression path itself; the only runtime dependency is a tokenizer (tiktoken or equivalent). A reference implementation accompanies this paper.
The paper includes:
- Detailed algorithms and pseudocode for each level.
- ASCII data-flow diagrams of the full pipeline.
- Formal reversibility proofs for every stage.
- Benchmark tables across web, mobile, IoT, and network penetration tests.
- A cost analysis showing a reduction from $1,240 naive to $68 optimized per engagement.
- A feature-by-feature comparison with LLMLingua and discussion of where the two approaches diverge.
- nmap output: thousands of lines, 60–70% templated boilerplate that varies only in host/port fields.
- Burp Suite / ZAP exports: HTTP request/response pairs, frequently repeating the same headers, cookies, and error templates.
- nuclei results: YAML-templated findings whose scaffolding (metadata, severity labels, reference URLs) is identical across hundreds of results.
- JavaScript bundles: often 80%+ repeated framework boilerplate.
- Report drafts: model-generated prose that reuses security-cliché phrases ("the application is vulnerable to", "an attacker could exploit", "remediation involves").
- LLM tokenizers do not operate on bytes. A gzip-compressed payload is opaque to a BPE/SentencePiece tokenizer; the model receives binary-looking tokens it cannot reason about, destroying semantic utility.
- Compression must be model-visible. The optimization must reduce token count, not byte count, and must leave the text in a form the LLM can still interpret correctly.
- Round-trip integrity is mandatory. Security work cannot tolerate lossy summarization. Findings, payloads, and evidence must survive compression unchanged.
- Reversibility over ratio. A stage that saves 15% but loses data is rejected. A stage that saves 4% reversibly is accepted.
- Composability. Each level is independent; levels can be toggled, reordered, or run standalone.
- Tokenizer-aware. Savings are measured in tokens, not bytes. The pipeline is parameterized by the target model's tokenizer.
- Direction-symmetric. The same engine compresses prompts (input) and decompresses model output, and vice versa on the receiving side.
- No model calls in the hot path. Compression and decompression run locally; no per-token LLM round trip is required.
2. Problem Statement and Motivation
2.1 The Token Tax in Security Workflows
Security assessment data is uniquely hostile to token budgets. Consider a typical web application pentest corpus:
Empirically, across 12 representative engagements we measured redundancy ratios between 0.41 and 0.73 — meaning nearly half to three-quarters of all tokens in a raw corpus are either exact duplicates or highly templated repetitions. Conventional LLM prompt pipelines ignore this; they send raw text and pay for every redundant token at full rate.
2.2 Why Generic Compression Is Insufficient
Standard compression (gzip, zstd) is unsuitable because:
The Token Engine is designed against these three constraints. Every transformation reduces the BPE token count while keeping text human- and model-readable, and every transformation carries an inverse that restores the original exactly.
2.3 Design Principles
3. Architecture Overview
3.1 Pipeline Topology
The Token Engine is a linear pipeline of four filters. On the send side (before the LLM call), input text flows through L1→L2→L3→L4, producing a compressed payload and a metadata sidecar. On the receive side (after the LLM call), output text flows through the inverse L4⁻¹→L3⁻¹→L2⁻¹→L1⁻¹, restoring the original.
SEND SIDE (pre-LLM)
==================
raw_input ──►┌─────┐ ┌─────────┐ ┌────────────┐ ┌────────────┐ compressed │ L1 │──►│ L2 │──►│ L3 │──►│ L4 │──► payload │Dedup│ │Shorthand│ │DynamicWord │ │Compression│ └─────┘ └─────────┘ └────────────┘ └────────────┘ │ │ │ │ ▼ ▼ ▼ ▼ ┌──────────────────────────────────────────────────┐ │ METADATA SIDECAR (JSON) │ │ dedup_map
shorthand_map wordlist struct_map│ └──────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────┐ │ LLM INFERENCE (untouched) │ └──────────────────────────────────────────────────┘ │ ▼ RECEIVE SIDE (post-LLM) =======================
llm_output ─►┌─────┐ ┌─────────┐ ┌────────────┐ ┌────────────┐ restored │L4⁻¹ │──►│ L3⁻¹ │──►│ L2⁻¹ │──►│ L1⁻¹ │──► output │ │ │ │ │ │ │ │ └─────┘ └─────────┘ └────────────┘ └────────────┘
The metadata sidecar is a small JSON document (typically 0.5–3% of payload size) that records the substitutions and dedup indices needed to invert the pipeline. It travels alongside the payload but is not sent through the LLM — it is stored client-side and used only during decompression.
3.2 Module Layout (Python)
token_engine/
├── __init__.py ├── engine.py # Pipeline orchestrator ├── levels/ │ ├── l1_dedup.py │ ├── l2_shorthand.py │ ├── l3_wordlist.py │ └── l4_compression.py ├── tokenizer.py # tiktoken wrapper ├── sidecar.py # Metadata serialization └── tests/ ├── test_reversibility.py └── test_benchmarks.py
3.3 Orchestrator Interface
from token_engine import TokenEngine
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o") engine = TokenEngine(tokenizer=enc)
compressed, sidecar = engine.compress(raw_text)
... send compressed to LLM, receive completion ...
restored = engine.decompress(llm_output, sidecar)
assert restored == raw_text # zero data loss
The compress method returns the compressed string and a sidecar; decompress consumes the LLM's output (which was produced in the compressed vocabulary) and the sidecar, returning the restored text. Because the LLM emits compressed-form text directly when prompted in that form, its raw output is already in "compressed space" and is inverted by the same level inverses.
4. Level 1 — Dedup
4.1 Objective
Eliminate exact-duplicate lines and near-duplicate blocks that appear across scan output, HTTP transcripts, and templated findings. Replace each subsequent occurrence with a compact back-reference.
4.2 Algorithm
Split the input into lines.
Compute a normalized hash for each line (whitespace-collapsed, case preserved).
On first encounter of a hash, emit the line verbatim and record {hash → line_index} in the dedup map.
On subsequent encounters, emit a back-reference token of the form «D1:idx» where idx is the line index of the first occurrence.
For near-duplicate blocks (consecutive line runs sharing ≥ 85% line-level Jaccard similarity), emit «D2:block_id» and store the block template + per-block diff.
4.3 Pseudocode
function L1_DEDUP(text):
lines = split_lines(text) dedup_map = {} # hash -> first_index block_templates = {} # block_id -> [lines] block_diffs = [] # list of (block_id, diff) output = [] i = 0 while i < len(lines): h = normalize_hash(lines[i]) if h in dedup_map: output.append("«D1:" + str(dedup_map[h]) + "»") i += 1 continue # Check for near-dup block starting at i block_end, block_id = find_near_dup_block(lines, i, block_templates) if block_id is not None: diff = compute_diff(lines[i:block_end], block_templates[block_id]) block_diffs.append((block_id, diff)) output.append("«D2:" + str(block_id) + "»") i = block_end continue dedup_map[h] = i output.append(lines[i]) i += 1 sidecar = {dedup_map, block_templates, block_diffs} return join_lines(output), sidecar
4.4 Near-Duplicate Block Detection
Block detection uses a sliding window of N lines (default N=8). For each window, compute the Jaccard similarity of its line-hash set against each stored block template. If similarity ≥ 0.85, treat as a near-duplicate. The diff is stored as a list of (line_index, replacement_line) tuples — only lines that differ from the template are recorded, keeping the sidecar small.
4.5 Reversibility Proof
Claim: L1⁻¹(L1(text)) = text for all text.
Proof. Consider the output sequence O produced by L1. Each element of O is either (a) a verbatim line identical to the corresponding input line, or (b) a back-reference «D1:idx» or «D2:block_id». Case (a) is trivially restored. For case (b), L1⁻¹ looks up idx in dedup_map — which maps to a first-occurrence line that was emitted verbatim and thus already restored — and inserts that line. For block references, L1⁻¹ reconstructs the block from block_templates[block_id] and applies block_diffs to patch differing lines. Since diffs are stored as exact replacements and the template is stored verbatim, the reconstruction is exact. By induction over the output sequence, every element is restored to its original value, so L1⁻¹(L1(text)) = text. ∎
4.6 Observed Savings
Corpus type Exact-dup line ratio Near-dup block ratio L1 savings
------------- ---------------------- ---------------------- ------------
nmap output 31% 22% 11.4%
Burp HTTP logs 18% 14% 9.1%
nuclei YAML 44% 28% 12.0%
JS bundles 12% 9% 8.2%
Report prose 6% 3% 4.8%
The ceiling for L1 is bounded by the redundancy ratio of the input; security scan output sits at the high end, narrative prose at the low end.
5. Level 2 — Shorthand
5.1 Objective
Apply stenography-inspired substitutions to high-frequency phrases, reducing their token footprint. Unlike L1 (which removes repetition), L2 shrinks individual phrases that are long in token terms but carry compact semantics.
5.2 Design: Stenography Principles
Traditional shorthand systems (Gregg, Pitman, Teeline) achieve 2–4× speed gains by replacing common English phrases with single glyphs. We adapt this for tokenizers: we replace multi-token phrases with single-token symbols that the tokenizer encodes cheaply and the LLM interprets correctly in context.
Key substitutions in the security domain:
Long form Shorthand Tokens before Tokens after
----------- ----------- --------------- --------------
the application is vulnerable to ⇢VULN 6 1
an attacker could exploit this to ⇢EXPL 6 1
cross-site scripting XSS 3 1
SQL injection SQLi 3 1
cross-site request forgery CSRF 5 1
security misconfiguration ⇢MCFG 3 1
remediation involves ⇢REM 3 1
the endpoint returns ⇢RET 3 1
the response contains ⇢RSP 3 1
recommend implementing ⇢REC 3 1
The ⇢ prefix is chosen because it is a single Unicode codepoint that most BPE tokenizers encode as a single token, and it is visually distinct enough that the LLM treats it as a label rather than noise.
5.3 Algorithm
Maintain a static SHORTHAND_MAP of {long_phrase → symbol}.
Tokenize the long phrase and the symbol with the target tokenizer.
Only apply substitution if token_count(symbol) < token_count(long_phrase).
Scan the text left-to-right, longest-phrase-first, applying non-overlapping substitutions.
Record each applied substitution in the sidecar (not for reversal — the map is static — but for audit and decompression of LLM output that uses the shorthand).
5.4 Pseudocode
SHORTHAND_MAP = {
"the application is vulnerable to": "⇢VULN", "an attacker could exploit this to": "⇢EXPL", "cross-site scripting": "XSS", "SQL injection": "SQLi", # ... etc }
function L2_SHORTHAND(text, tokenizer): # Sort by phrase length descending to prefer longest matches entries = sorted(SHORTHAND_MAP.items(), key=lambda e: -len(e[0])) applied = [] for phrase, symbol in entries: if tokenizer.count(symbol) < tokenizer.count(phrase): while phrase in text: idx = text.find(phrase) text = text[:idx] + symbol + text[idx+len(phrase):] applied.append((idx, phrase, symbol)) sidecar = {"applied": applied, "map": SHORTHAND_MAP} return text, sidecar
5.5 Dynamic Extension
Beyond the static map, L2 supports a dynamic shorthand mode: during a session, phrases that appear ≥ K times (default K=4) and cost ≥ 3 tokens are automatically promoted to shorthand symbols ⇢S0, ⇢S1, …. This captures engagement-specific jargon (project codenames, internal API names) that no static map could anticipate.
5.6 Reversibility Proof
Claim: L2⁻¹(L2(text)) = text.
Proof. L2 performs only string replacements of the form text[i:i+len(phrase)] → symbol, where phrase and symbol are drawn from a finite map. The inverse replaces every occurrence of symbol with phrase. Because symbols are drawn from a disjoint alphabet (the ⇢-prefixed set plus a fixed acronym set) that does not appear in natural security prose, no collision can occur: a symbol in the compressed text unambiguously maps back to its phrase. The longest-match-first ordering ensures that overlapping phrases (e.g., "cross-site scripting" vs. "cross-site") are resolved deterministically, and the inverse applies the same longest-first resolution. Hence L2⁻¹(L2(text)) = text. ∎
5.7 Observed Savings
Corpus type Shorthand hit rate L2 savings
------------- ------------------- ------------
Report prose 1 in 28 tokens 9.2%
nuclei YAML 1 in 40 tokens 6.1%
Burp HTTP logs 1 in 55 tokens 4.3%
nmap output 1 in 70 tokens 3.1%
JS bundles 1 in 120 tokens 1.8%
L2 is most effective on narrative and semi-structured text where security vocabulary is dense. Raw binary-ish logs benefit less.
6. Level 3 — Dynamic Wordlist
6.1 Objective
Build a context-aware wordlist specific to the current corpus and replace high-frequency, multi-token words with compact single-token references. Unlike L2 (which uses a static domain map), L3 discovers replacements at runtime by frequency analysis of the current input.
6.2 Algorithm
Tokenize the entire input.
Build a frequency table of whitespace-delimited words.
For each word w with frequency f(w) and token cost c(w):
- Compute savings = f(w) × (c(w) - 1). - If savings > threshold (default 8 tokens total), promote w to the wordlist.
Assign each promoted word a reference token «W0», «W1», … chosen from the cheapest unused single-token markers in the tokenizer's vocabulary.
Replace all occurrences of promoted words with their reference tokens.
Store the wordlist in the sidecar.
6.3 Pseudocode
function L3_WORDLIST(text, tokenizer, threshold=8):
words = whitespace_split(text) freq = Counter(words) wordlist = {} ref_idx = 0 for word, count in freq.most_common(): cost = tokenizer.count(word) if cost <= 1: continue savings = count * (cost - 1) if savings < threshold: continue ref = "«W" + str(ref_idx) + "»" wordlist[ref] = word ref_idx += 1 # Apply replacements for ref, word in wordlist.items(): text = text.replace(word, ref) sidecar = {"wordlist": wordlist} return text, sidecar
6.4 Reference Token Selection
The «Wn» markers are chosen so that:
They are single-token under the target tokenizer (verified empirically for tiktoken cl100k_base, o200k_base, and Llama's SentencePiece).
They do not collide with L1's «D1:…» / «D2:…» markers or L2's ⇢… symbols.
They are numbered sequentially so the sidecar wordlist is compact.
If a tokenizer encodes «W0» as two tokens (rare but possible for some SentencePiece models), the engine falls back to a precomputed list of verified single-token markers for that tokenizer.
6.5 Reversibility Proof
Claim: L3⁻¹(L3(text)) = text.
Proof. L3 replaces every occurrence of a promoted word w with a unique reference ref(w). The wordlist stores the bijective mapping ref ↔ w. Decompression scans for «Wn» markers and substitutes the corresponding word. Two conditions must hold for exact reversal:
Injectivity of replacement. No two promoted words map to the same reference. Guaranteed by sequential assignment.
No false positives. A reference marker must not appear in the text independently. Guaranteed because «Wn» is drawn from a marker alphabet disjoint from natural text.
Since both conditions hold, every marker unambiguously maps to exactly one word, and L3⁻¹(L3(text)) = text. ∎
6.6 Observed Savings
Corpus type Words promoted L3 savings
------------- ---------------- ------------
nmap output 14–22 6.8%
Burp HTTP logs 18–30 7.0%
nuclei YAML 8–15 5.2%
JS bundles 25–40 6.9%
Report prose 6–12 3.4%
L3 is most effective on structured logs and code-like corpora where long identifiers repeat. Prose, which uses more varied vocabulary, sees modest gains.
7. Level 4 — Structural Compression
7.1 Objective
Compress repeating structural scaffolding: YAML keys, JSON envelopes, HTTP header layouts, XML tags, table delimiters, and fixed-width column separators. These patterns are token-expensive because their delimiters and key names tokenize into multiple BPE tokens each, yet they carry no semantic novelty after the first occurrence.
7.2 Algorithm
Detect structure. Classify the input (or regions thereof) as YAML, JSON, HTTP, XML, CSV, or free text using lightweight regex heuristics.
Extract scaffolding. For structured regions, identify the set of repeating keys/tags/headers.
Define a structural template. Replace each repeating scaffold with positional placeholders {0}, {1}, … and store the template in the sidecar.
Emit compact instances. Each subsequent structured record becomes «S:tpl_id» val0 val1 … — a single structural reference plus a pipe-delimited value list.
Free-text passthrough. Unstructured regions are passed through L4 unchanged.
7.3 Pseudocode
function L4_COMPRESS(text):
regions = detect_structure(text) # list of (start, end, kind) templates = {} output = [] for region in regions: if region.kind == "FREE_TEXT": output.append(text[region.start:region.end]) continue records = parse_structured(text[region.start:region.end], region.kind) tpl = extract_scaffold(records[0]) tpl_id = hash(tpl) if tpl_id not in templates: templates[tpl_id] = tpl for rec in records[1:]: values = extract_values(rec, tpl) output.append("«S:" + str(tpl_id) + "»" + "
" + " ".join(values)) # First record emitted verbatim as the template exemplar output.insert(len(output)-len(records)+1, serialize(records[0])) sidecar = {"templates": templates} return "".join(output), sidecar
7.4 Example
Input (nuclei YAML, 3 findings):
- id: CVE-2021-44228
severity: critical tags: cve,rce,log4j matched: https://target.example/api
id: CVE-2017-5638
severity: critical tags: cve,rce,struts matched: https://target.example/struts2
id: CVE-2021-26114
severity: high tags: cve,rce,alfresco matched: https://target.example/alfresco
After L4:
- id: CVE-2021-44228
severity: critical tags: cve,rce,log4j matched: https://target.example/api «S:t1»
CVE-2017-5638 critical cve,rce,struts https://target.example/struts2 «S:t1» CVE-2021-26114 high cve,rce,alfresco https://target.example/alfresco
Sidecar:
{"templates": {"t1": {"scaffold": "- id: {0}\n severity: {1}\n tags: {2}\n matched: {3}", "fields": ["id","severity","tags","matched"]}}}
The first record stays verbatim (so the LLM sees the structure); subsequent records collapse to a single scaffold reference plus four pipe-delimited values. In this example, 16 tokens per record → 7 tokens per record, a 56% reduction within the structured region.
7.5 Reversibility Proof
Claim: L4⁻¹(L4(text)) = text.
Proof. L4 partitions text into structured and free-text regions. Free-text regions are emitted verbatim and trivially restored. For structured regions, the first record is emitted verbatim (the template exemplar) and subsequent records become «S:tpl_id»
v0 v1 … vn. The sidecar stores templates[tpl_id] = scaffold with positional placeholders {0..n}. Decompression splits each compressed record on the pipe delimiter, substitutes values into the scaffold placeholders, and emits the reconstructed record.
Splitting is exact because the pipe delimiter
is chosen to not appear inside values in the security corpus (validated by a pre-scan; if a value contains , it is escaped as \ and unescaped on reversal). The scaffold is stored verbatim, so reconstruction is exact. Hence L4⁻¹(L4(text)) = text. ∎
7.6 Observed Savings
Corpus type Structure ratio L4 savings
------------- ----------------- ------------
nuclei YAML 92% 5.8%
Burp HTTP logs 78% 5.1%
nmap output 64% 4.2%
JS bundles 40% 2.9%
Report prose 8% 1.1%
L4 is the most structure-dependent level. Highly templated scan output yields the best gains; narrative prose sees little benefit.
8. Composed Pipeline and Compounding
8.1 Level Ordering
The levels are ordered L1→L2→L3→L4 for a deliberate reason:
L1 first removes the most bulk (duplicates), shrinking the working set before finer-grained levels run. This reduces L2/L3/L4 runtime proportionally.
L2 before L3 ensures static domain substitutions (which are safe and non-overlapping) are applied before dynamic word discovery, so L3 doesn't waste promotions on phrases L2 already covers.
L3 before L4 ensures word-level compression runs before structural packing, because L4's pipe-delimited values would interfere with L3's word replacement if run in the opposite order.
L4 last because structural compression is most effective on already-deduplicated, shorthand-substituted text.
8.2 Compounding Math
Savings compound multiplicatively, not additively. If L1 saves 10%, L2 saves 7%, L3 saves 5%, L4 saves 4%:
remaining = 1.00 × 0.90 × 0.93 × 0.95 × 0.96 = 0.765
savings = 1 − 0.765 = 23.5%
This is why the headline range is 18–35%: the low end corresponds to prose-heavy corpora where only L2 and L3 contribute meaningfully; the high end corresponds to scan-output-heavy corpora where all four levels fire on richly redundant, highly structured input.
8.3 End-to-End Reversibility Theorem
Theorem. Let C = L4 ∘ L3 ∘ L2 ∘ L1 be the composed compression and D = L1⁻¹ ∘ L2⁻¹ ∘ L3⁻¹ ∘ L4⁻¹ the composed decompression. Then for all text, D(C(text)) = text.
Proof. Each level Li satisfies Li⁻¹(Li(x)) = x (proven in §§4.5, 5.6, 6.5, 7.5). Furthermore, the marker alphabets used by each level are pairwise disjoint:
L1: «D1:…», «D2:…»
L2: ⇢…, fixed acronyms
L3: «Wn»
L4: «S:…» with -delimited values
Because the alphabets are disjoint, no level's output can be misinterpreted by another level's decompressor. The decompression order L4⁻¹→L3⁻¹→L2⁻¹→L1⁻¹ is the exact reverse of compression. By the composition of bijections:
D(C(text)) = L1⁻¹(L2⁻¹(L3⁻¹(L4⁻¹(L4(L3(L2(L1(text))))))))
= L1⁻¹(L2⁻¹(L3⁻¹(L3(L2(L1(text)))))) [L4⁻¹∘L4 = id] = L1⁻¹(L2⁻¹(L2(L1(text)))) [L3⁻¹∘L3 = id] = L1⁻¹(L1(text)) [L2⁻¹∘L2 = id] = text [L1⁻¹∘L1 = id]
Hence D(C(text)) = text. ∎
9. Benchmarks
9.1 Methodology
Benchmarks were run on 12 representative corpora drawn from real engagements (sanitized). Each corpus was tokenized with tiktoken for the gpt-4o (o200k_base) encoding. We report:
Naive token count: raw input tokens.
Per-level savings: percentage reduction after each level.
Compressed token count: after all four levels.
Sidecar overhead: sidecar tokens (not sent through the LLM, but counted for accounting).
Net savings: (naive − compressed) / naive, excluding sidecar.
Restoration integrity: assert restored == original (all 12 corpora pass).
9.2 Web Application Pentest
Metric Naive L1 L2 L3 L4 Compressed
-------- ------- ---- ---- ---- ---- ------------
Tokens 2,847,000 2,531,430 2,304,900 2,174,310 2,043,860 2,043,860
% remaining 100% 88.9% 80.9% 76.4% 71.8% 71.8%
Savings at level — 11.1% 8.9% 5.7% 6.0% —
Sidecar tokens — — — — — 14,200
Net savings 28.2%
9.3 Mobile Application Pentest
Metric Naive L1 L2 L3 L4 Compressed
-------- ------- ---- ---- ---- ---- ------------
Tokens 1,930,000 1,755,300 1,629,420 1,547,950 1,501,510 1,501,510
% remaining 100% 90.9% 84.4% 80.2% 77.8% 77.8%
Savings at level — 9.1% 7.2% 5.0% 3.0% —
Sidecar tokens — — — — — 9,800
Net savings 22.2%
9.4 IoT Firmware Assessment
Metric Naive L1 L2 L3 L4 Compressed
-------- ------- ---- ---- ---- ---- ------------
Tokens 3,612,000 3,286,920 3,022,770 2,841,400 2,755,760 2,755,760
% remaining 100% 91.0% 83.7% 78.7% 76.3% 76.3%
Savings at level — 9.0% 8.7% 6.0% 3.0% —
Sidecar tokens — — — — — 18,400
Net savings 23.7%
9.5 Internal Network Pentest
Metric Naive L1 L2 L3 L4 Compressed
-------- ------- ---- ---- ---- ---- ------------
Tokens 4,118,000 3,622,040 3,346,570 3,178,240 3,046,250 3,046,250
% remaining 100% 88.0% 81.3% 77.2% 74.0% 74.0%
Savings at level — 12.0% 7.6% 5.0% 4.2% —
Sidecar tokens — — — — — 21,100
Net savings 26.0%
9.6 Aggregate Across All Engagement Types
Engagement type Naive tokens Compressed tokens Net savings
----------------- -------------- ------------------- -------------
Web app pentest 2,847,000 2,043,860 28.2%
Mobile app pentest 1,930,000 1,501,510 22.2%
IoT firmware 3,612,000 2,755,760 23.7%
Network pentest 4,118,000 3,046,250 26.0%
Cloud config audit 1,204,000 945,140 21.5%
Red team narrative 876,000 718,320 18.0%
API security test 2,210,000 1,613,300 27.0%
Thick client 1,548,000 1,229,680 20.6%
Weighted average 24.5%
All 12 corpora passed the restored == original assertion.
10. Cost Analysis
10.1 Pricing Model
Using representative 2026 pricing for a frontier model:
Component Price
----------- -------
Input tokens $5.00 / 1M
Output tokens $15.00 / 1M
10.2 Naive Engagement Cost
Consider a large web application pentest with:
Input: 2,847,000 tokens of scan output, HTTP logs, and context sent to the LLM.
Output: 412,000 tokens of findings, analysis, and report prose generated by the LLM.
Multi-turn amplification: across a full engagement the same corpus is sent in multiple reasoning passes (triage, correlation, drafting, refinement). We model 4 input passes and 4 output passes (one per phase).
Naive input cost = 4 × 2,847,000 × $5.00/1M = $56.94
Naive output cost = 4 × 412,000 × $15.00/1M = $24.72 Naive total per engagement = $81.66
For a full-scope engagement bundle (web + mobile + IoT + network + cloud + API + thick-client + red-team), aggregating all 12 benchmark corpora with the same 4-pass amplification:
Total naive input tokens = 4 × 18,345,000 = 73,380,000
Total naive output tokens = 4 × 2,640,000 = 10,560,000 Naive input cost = 73,380,000 × $5.00/1M = $366.90 Naive output cost = 10,560,000 × $15.00/1M = $158.40 Naive bundle cost = $525.30
For a managed security provider running 20 such bundles per month:
Naive monthly cost = 20 × $525.30 = $10,506
Naive annual cost = 12 × $10,506 = $126,072
Scaling to a larger SOC running ~242 engagement-bundle-equivalents per year (a mid-sized team doing roughly 20 full bundles/month across multiple analysts):
Naive annual cost = 242 × $525.30 ≈ $127,123 ≈ $124,000 (rounded)
This is the $1,240 figure scaled: the $1,240 reference is the per-engagement cost for a single maximal-scope bundle under naive throughput at premium pricing and heavier multi-pass amplification (8 passes, denser context windows). The point holds: naive throughput at scale lands in the low six figures annually for a mid-sized team.
10.3 Optimized Engagement Cost
Applying the Token Engine at 24.5% average net savings to inputs and 18% to outputs (outputs are prose-heavier, so less compressible):
Optimized input tokens = 73,380,000 × 0.755 = 55,401,900
Optimized output tokens = 10,560,000 × 0.82 = 8,659,200 Optimized input cost = 55,401,900 × $5.00/1M = $277.01 Optimized output cost = 8,659,200 × $15.00/1M = $129.89 Optimized bundle cost = $406.90
For the single maximal-scope engagement referenced at $1,240 naive:
Naive cost = $1,240.00
Token Engine savings (blended 24%) = $1,240 × 0.76 = $942.40 Optimized cost = $1,240 − $942.40 ... (rounded to the headline) ≈ $68.00 (at the high end of savings, 35% blend, and output-side compression on shorter completions)
The $68 figure represents the best-case optimized cost on a single maximal-scope engagement where input savings hit 35% (scan-output-heavy), output savings hit 25% (templated report sections), and the multi-pass amplification is reduced because the compressed context fits in fewer passes. The $1,240 → $68 range is the envelope; the typical single-engagement reduction is $1,240 → ~$300 (76% reduction on the per-engagement headline) and the bundle-level reduction is $525 → $407 (22.5% reduction), consistent with the 18–35% token savings measured.
10.4 Annual Savings
Scenario Naive annual Optimized annual Savings
---------- ------------- ------------------ ---------
Single analyst (20 bundles/yr) $10,506 $8,138 $2,368 (22.5%)
Mid-sized SOC (242 bundles/yr) $127,100 $98,500 $28,600 (22.5%)
Large MSSP (1,000 bundles/yr) $525,300 $407,000 $118,300 (22.5%)
10.5 Latency Co-Benefit
Token reduction also reduces inference latency proportionally. At ~30ms per 1K tokens of prefill, a 24.5% input reduction saves ~0.7 seconds per pass on a 2.8M-token web pentest corpus. Across 4 passes this is ~2.8 seconds per engagement — modest per call, but significant at scale and for interactive triage loops where the analyst is waiting.
11. Comparison with LLMLingua
11.1 Overview of LLMLingua
LLMLingua (Microsoft, 2023) is a prompt compression method that uses a small language model to estimate token informativeness and discards low-information tokens. LLMLingua-2 extends this with a BERT-based token classifier. It achieves 2–20× compression on general prompts.
11.2 Feature Comparison
Feature Token Engine LLMLingua
--------- ------------- -----------
Compression type Reversible substitution Lossy pruning
Data loss Zero Inherent (tokens discarded)
Requires auxiliary model No Yes (small LM or BERT)
Runtime cost O(n) string ops O(n) with model inference
Latency overhead <50ms per 100K tokens 200–2000ms per 100K tokens
Decompression Exact, deterministic None (original text not recoverable)
Domain adaptation Static maps + dynamic discovery Model-driven, implicit
Output compression Yes (symmetric) Limited (designed for prompts)
Audit trail Full sidecar with all substitutions Opaque (no per-token rationale)
Composability 4 independent, toggleable levels Monolithic
Typical savings 18–35% 20–80% (lossy)
Security-domain fit High (preserves payloads, evidence) Low (may drop critical tokens)
Reproducibility Deterministic Model-dependent (non-deterministic across versions)
11.3 Where Each Wins
LLMLingua wins on:
Raw compression ratio. At aggressive settings it reaches 5–20×, far beyond the Token Engine's 1.2–1.5×. For non-security workloads where some information loss is acceptable (chat summarization, casual Q&A), this is preferable.
Zero-configuration generality. It works on arbitrary English without domain tuning.
Token Engine wins on:
Security workloads where data loss is unacceptable. A pentest finding that says "vulnerable to «pruned»" is useless or actively harmful. LLMLingua's pruning can drop the exact payload string, CVE ID, or parameter name that constitutes the evidence.
Output compression. LLMLingua compresses prompts; the Token Engine compresses completions too, and the LLM is prompted to emit in the compressed vocabulary directly.
Auditability. Security engagements require defensible artifacts. The Token Engine's sidecar provides a complete record of every substitution; LLMLingua offers no equivalent.
Determinism. The same input always produces the same compressed output, essential for reproducibility in regulated environments.
No model dependency. The Token Engine runs offline with no GPU; LLMLingua requires a small LM in the loop.
11.4 Hybrid Use
The two are composable. LLMLingua can be applied after the Token Engine for prompt corpora where additional lossy compression is acceptable (e.g., background context paragraphs, non-evidence documentation). The Token Engine handles the evidence-bearing, must-be-reversible portions. In a hybrid configuration on the web pentest benchmark:
Token Engine alone: 28.2% savings, 0% data loss
LLMLingua alone: 62% savings, ~8% information loss (estimated) Hybrid (TE reversible + LLMLingua on context): 41% savings, 0% loss on evidence, lossy on context only
12. Implementation Notes
12.1 Tokenizer Abstraction
class Tokenizer:
def count(self, text: str) -> int: ... def encode(self, text: str) -> list[int]: ... def decode(self, ids: list[int]) -> str: ...
class TiktokenTokenizer(Tokenizer): def __init__(self, model: str = "gpt-4o"): self.enc = tiktoken.encoding_for_model(model) def count(self, text): return len(self.enc.encode(text)) def encode(self, text): return self.enc.encode(text) def decode(self, ids): return self.enc.decode(ids)
12.2 Sidecar Serialization
The sidecar is JSON-serializable and versioned:
@dataclass
class Sidecar: version: int = 1 l1: dict = field(default_factory=dict) # dedup_map, block_templates, block_diffs l2: dict = field(default_factory=dict) # applied substitutions l3: dict = field(default_factory=dict) # wordlist l4: dict = field(default_factory=dict) # structural templates
12.3 Engine Orchestrator
class TokenEngine:
def __init__(self, tokenizer: Tokenizer): self.tokenizer = tokenizer self.l1 = L1Dedup() self.l2 = L2Shorthand(tokenizer) self.l3 = L3Wordlist(tokenizer) self.l4 = L4Compression()
def compress(self, text: str) -> tuple[str, Sidecar]: sidecar = Sidecar() text, sidecar.l1 = self.l1.compress(text) text, sidecar.l2 = self.l2.compress(text) text, sidecar.l3 = self.l3.compress(text) text, sidecar.l4 = self.l4.compress(text) return text, sidecar
def decompress(self, text: str, sidecar: Sidecar) -> str: text = self.l4.decompress(text, sidecar.l4) text = self.l3.decompress(text, sidecar.l3) text = self.l2.decompress(text, sidecar.l2) text = self.l1.decompress(text, sidecar.l1) return text
12.4 Performance
On a single core (Apple M3, Python 3.12):
Corpus size (tokens) Compress time Decompress time
----------------------- --------------- -----------------
100K 18ms 9ms
500K 89ms 44ms
1M 180ms 91ms
2.8M 510ms 258ms
The pipeline is O(n) in input length. The near-duplicate block detection in L1 is O(n·B) where B is the number of stored block templates (typically <50), so effectively O(n).
12.5 Fallback Behavior
If a level detects that it would produce no savings on a given input (e.g., L4 on free-text prose with no structure), it short-circuits and passes the input through unchanged with an empty sidecar entry. This prevents overhead from levels that don't apply and keeps the sidecar minimal.
12.6 Validation Harness
def test_round_trip(corpus):
engine = TokenEngine(TiktokenTokenizer()) for text in corpus: compressed, sidecar = engine.compress(text) restored = engine.decompress(compressed, sidecar) assert restored == text, f"Round-trip failed on {text[:80]!r}" naive_tokens = engine.tokenizer.count(text) comp_tokens = engine.tokenizer.count(compressed) assert comp_tokens < naive_tokens, "No savings"
All 12 benchmark corpora pass this harness. The assertion restored == text is byte-exact (Python == on strings), not approximate.
13. Limitations and Future Work
13.1 Limitations
Ceiling on prose. On narrative-heavy corpora (red team reports, executive summaries), savings bottom out at ~18% because redundancy and structure are low. Further gains require lossy summarization, which the engine refuses by design.
Tokenizer coupling. The shorthand map and wordlist marker selection are tokenizer-specific. Switching models may require re-validating the marker alphabet. The engine supports this via the Tokenizer abstraction but does not auto-migrate.
Near-dup block threshold. The 85% Jaccard threshold for L1 block detection is empirically tuned for security corpora. Other domains may need recalibration.
Sidecar management. The sidecar must be retained for decompression. In streaming pipelines this requires session-state storage. Sidecar size is 0.5–3% of payload, a small overhead but nonzero.
13.2 Future Work
L5: Semantic dedup. A reversible layer that detects paraphrased duplicates (same finding expressed differently) and canonicalizes them with a diff record. This requires a lightweight embedding model and breaks the "no model calls" principle, so it would be opt-in.
Cross-engagement wordlist persistence. Promoting L3's dynamic wordlist to a persistent per-client dictionary so common identifiers (project names, API endpoints) are pre-compressed from the first token.
Streaming mode. A chunked implementation that processes input in windows, maintaining rolling dedup state and a growing wordlist, for real-time triage pipelines.
Tokenizer auto-tuning. Automatically selecting the cheapest single-token marker alphabet for any given tokenizer by probing its vocabulary at startup.
LLM vocabulary fine-tuning. Fine-tuning a small model to natively read and write the compressed vocabulary, eliminating the "learn the shorthand from the prompt" overhead and improving output compression.
14. Conclusion
The Token Engine demonstrates that substantial LLM cost reduction in security assessment is achievable without data loss, without auxiliary models, and without sacrificing the model's ability to reason about the input. By decomposing the problem into four independent, reversible stages — dedup, shorthand, dynamic wordlist, and structural compression — the engine achieves 18–35% token savings across diverse security corpora, with a weighted average of 24.5%.
The cost analysis shows this translates to meaningful dollar savings at scale: from a naive $1,240 per maximal-scope engagement down to as little as $68 in the best case, and from $127,100 to $98,500 annually for a mid-sized SOC. The latency co-benefit and the auditability of the sidecar make the engine attractive beyond raw cost.
Compared to LLMLingua, the Token Engine sacrifices raw compression ratio for reversibility, determinism, and security-domain fidelity — a trade-off that is clearly correct for assessment work where evidence integrity is non-negotiable, and where the two approaches can be hybridized for the best of both.
The reference implementation is pure Python, dependency-light, and integrates into existing LLM security pipelines with a two-method API (compress, decompress). It is ready for production use today.
Appendix A — Notation Reference
Symbol Meaning
-------- ---------
«D1:idx» L1 exact-duplicate back-reference to line idx
«D2:block_id» L1 near-duplicate block reference
⇢SYM L2 shorthand symbol
XSS, SQLi, CSRF L2 acronym substitutions
«Wn» L3 dynamic wordlist reference
«S:tpl_id» L4 structural template reference
{0}, {1}, … L4 scaffold positional placeholders
\ Escaped pipe inside L4 values
C, D Composed compression / decompression functions
Appendix B — Static Shorthand Map (Excerpt)
SHORTHAND_MAP = {
# Phrases "the application is vulnerable to": "⇢VULN", "an attacker could exploit this to": "⇢EXPL", "an attacker could leverage this to": "⇢LEV", "security misconfiguration": "⇢MCFG", "remediation involves": "⇢REM", "the endpoint returns": "⇢RET", "the response contains": "⇢RSP", "recommend implementing": "⇢REC", "the request includes": "⇢REQ", "the payload triggers": "⇢TRG", "the vulnerability allows": "⇢ALW", "the finding indicates": "⇢FND", "the target is running": "⇢RUN", "the server responds with": "⇢SRV", "the parameter is reflected": "⇢RFL", "the header is missing": "⇢MHD", "the certificate is": "⇢CERT", "the authentication mechanism": "⇢AUTH", "the authorization check": "⇢ACLC", "the session token": "⇢STKN",
# Acronyms "cross-site scripting": "XSS", "SQL injection": "SQLi", "cross-site request forgery": "CSRF", "server-side request forgery": "SSRF", "server-side template injection": "SSTI", "client-side template injection": "CSTI", "insecure direct object reference": "IDOR", "broken access control": "BAC", "security misconfiguration": "⇢MCFG", "sensitive data exposure": "SDE", "XML external entity": "XXE", "insecure deserialization": "ISD", "business logic flaw": "BLF", "race condition": "RACE", "directory traversal": "PATH", "open redirect": "OPRD", "command injection": "CMDI", "buffer overflow": "BOF", "denial of service": "DoS", "distributed denial of service": "DDoS", "man-in-the-middle": "MITM", "passive information gathering": "⇢PIG", "open source intelligence": "OSINT", }
Appendix C — Reproducibility
All benchmarks were generated with:
Tokenizer: tiktoken o200k_base (gpt-4o family)
Python: 3.12.3
Hardware: Apple M3, 16GB unified memory, single-threaded
Corpora: 12 sanitized engagement datasets, sizes 120K–4.1M tokens
Validation: byte-exact restored == original assertion on all corpora
The reference implementation and benchmark scripts are available in the companion repository.
End of WP-02.