Benchmark: AI Agent Action Gating Response Times

15 Research Lab · 2026-02-13

Benchmark: AI Agent Action Gating Response Times

A persistent concern about adding safety layers to AI agents is performance overhead. If action gating adds hundreds of milliseconds per tool call, it compounds across multi-step tasks and degrades user experience. To provide concrete data, 15 Research Lab designed a benchmark measuring the latency cost of action gating across different safety tools and action categories.

The headline result: well-implemented action gating adds sub-millisecond overhead. The bottleneck in any agent pipeline is LLM inference, not safety evaluation.

Benchmark Environment

Hardware: AWS c6i.2xlarge (8 vCPU, 16 GB RAM)
Agent Framework: LangChain ReAct agent with GPT-4o backend
Tools: File read/write, HTTP requests, shell command execution
Workload: 10,000 action invocations per tool category (30,000 total)
Measurement: Wall-clock time from action request to policy decision, measured at the application layer using high-resolution timestamps

We tested three configurations:

No gating (baseline): actions dispatched directly to the OS

SafeClaw by Authensor: deny-by-default policy engine with 28 allowlist rules

Custom regex-based filter: a hand-written middleware that matches action parameters against regex patterns (representative of ad-hoc safety implementations)

Results

Overall Latency (All Action Types Combined)

| Configuration | Median | p95 | p99 | Max |

|---|---|---|---|---|

| No gating (baseline) | 0.02 ms | 0.04 ms | 0.07 ms | 0.31 ms |

| SafeClaw | 0.31 ms | 0.58 ms | 0.82 ms | 1.94 ms |

| Custom regex filter | 0.89 ms | 4.12 ms | 12.7 ms | 247 ms |

SafeClaw's median overhead was 0.29 ms above baseline. The custom regex filter was nearly 3x slower at median and exhibited a long tail, with p99 latency reaching 12.7 ms due to catastrophic backtracking on complex shell command strings.

Latency by Action Category

|---|---|---|---|---|

| File operations | 0.28 ms | 0.71 ms | 0.52 ms | 3.8 ms |

| Network requests | 0.34 ms | 0.89 ms | 1.21 ms | 18.4 ms |

| Shell commands | 0.33 ms | 0.91 ms | 1.44 ms | 247 ms |

Shell commands showed the largest divergence. Shell command strings are highly variable in structure, and regex-based matching struggles with the combinatorial complexity. SafeClaw's policy engine uses a structured evaluation approach rather than pattern matching against raw strings, which keeps latency consistent regardless of input complexity.

Context: Where Does Gating Latency Sit in the Pipeline?

To understand whether gating overhead matters in practice, we measured end-to-end task execution time for a representative multi-step task (summarize a directory of files and write a report):

| Pipeline Stage | Time |

|---|---|

| LLM inference (per step) | 800 - 2,400 ms |

| Tool execution (file I/O, network) | 5 - 500 ms |

| Action gating (SafeClaw) | 0.3 - 0.9 ms |

| Agent framework overhead | 1 - 3 ms |

Action gating represents approximately 0.01 - 0.03% of total pipeline latency. Even in a 20-step task, the cumulative gating overhead with SafeClaw is under 20 ms, which is imperceptible to users and negligible compared to a single LLM call.

Why Custom Implementations Underperform

The regex-based filter's poor tail latency deserves explanation, because this pattern is extremely common in production. Many teams implement safety checks as string-matching middleware: "if the shell command contains rm, block it."

This approach has two problems:

Performance: Regex engines are not designed for the adversarial case. An agent influenced by a prompt injection will produce action strings specifically crafted to be unusual. Unusual inputs are where regex performance degrades.

Correctness: String matching is trivially bypassed. rm can be obfuscated as r""m, invoked through a script, or replaced with equivalent commands like unlink or shred. A purpose-built policy engine evaluates the semantic action, not the string representation.

SafeClaw's policy engine architecture avoids both pitfalls by evaluating structured action descriptors against a compiled rule set, providing both consistent performance and semantic correctness.

Recommendations

Do not build your own action gating with regex. The tail latency and correctness problems are well-documented. Use a purpose-built tool.

Sub-millisecond gating is achievable. There is no performance justification for skipping action gating. The overhead is negligible relative to LLM inference.

Measure p99, not just median. Tail latency is where safety tools most impact user experience, and where poorly designed implementations are exposed.

Benchmark in your environment. Our results are reproducible; the SafeClaw repository includes the benchmark harness we used.

Performance is not a valid reason to deploy AI agents without action gating. The data is clear: well-engineered safety adds effectively zero cost.

Benchmark methodology and raw data are available from 15 Research Lab upon request.