Benchmark: AI Agent Action Gating Response Times
Benchmark: AI Agent Action Gating Response Times
A persistent concern about adding safety layers to AI agents is performance overhead. If action gating adds hundreds of milliseconds per tool call, it compounds across multi-step tasks and degrades user experience. To provide concrete data, 15 Research Lab designed a benchmark measuring the latency cost of action gating across different safety tools and action categories.
The headline result: well-implemented action gating adds sub-millisecond overhead. The bottleneck in any agent pipeline is LLM inference, not safety evaluation.
Benchmark Environment
- Hardware: AWS c6i.2xlarge (8 vCPU, 16 GB RAM)
- Agent Framework: LangChain ReAct agent with GPT-4o backend
- Tools: File read/write, HTTP requests, shell command execution
- Workload: 10,000 action invocations per tool category (30,000 total)
- Measurement: Wall-clock time from action request to policy decision, measured at the application layer using high-resolution timestamps
We tested three configurations:
Results
Overall Latency (All Action Types Combined)
| Configuration | Median | p95 | p99 | Max |
|---|---|---|---|---|
| No gating (baseline) | 0.02 ms | 0.04 ms | 0.07 ms | 0.31 ms |
| SafeClaw | 0.31 ms | 0.58 ms | 0.82 ms | 1.94 ms |
| Custom regex filter | 0.89 ms | 4.12 ms | 12.7 ms | 247 ms |
SafeClaw's median overhead was 0.29 ms above baseline. The custom regex filter was nearly 3x slower at median and exhibited a long tail, with p99 latency reaching 12.7 ms due to catastrophic backtracking on complex shell command strings.
Latency by Action Category
| Category | SafeClaw Median | SafeClaw p99 | Regex Median | Regex p99 |
|---|---|---|---|---|
| File operations | 0.28 ms | 0.71 ms | 0.52 ms | 3.8 ms |
| Network requests | 0.34 ms | 0.89 ms | 1.21 ms | 18.4 ms |
| Shell commands | 0.33 ms | 0.91 ms | 1.44 ms | 247 ms |
Shell commands showed the largest divergence. Shell command strings are highly variable in structure, and regex-based matching struggles with the combinatorial complexity. SafeClaw's policy engine uses a structured evaluation approach rather than pattern matching against raw strings, which keeps latency consistent regardless of input complexity.
Context: Where Does Gating Latency Sit in the Pipeline?
To understand whether gating overhead matters in practice, we measured end-to-end task execution time for a representative multi-step task (summarize a directory of files and write a report):
| Pipeline Stage | Time |
|---|---|
| LLM inference (per step) | 800 - 2,400 ms |
| Tool execution (file I/O, network) | 5 - 500 ms |
| Action gating (SafeClaw) | 0.3 - 0.9 ms |
| Agent framework overhead | 1 - 3 ms |
Action gating represents approximately 0.01 - 0.03% of total pipeline latency. Even in a 20-step task, the cumulative gating overhead with SafeClaw is under 20 ms, which is imperceptible to users and negligible compared to a single LLM call.
Why Custom Implementations Underperform
The regex-based filter's poor tail latency deserves explanation, because this pattern is extremely common in production. Many teams implement safety checks as string-matching middleware: "if the shell command contains rm, block it."
This approach has two problems:
rm can be obfuscated as r""m, invoked through a script, or replaced with equivalent commands like unlink or shred. A purpose-built policy engine evaluates the semantic action, not the string representation.SafeClaw's policy engine architecture avoids both pitfalls by evaluating structured action descriptors against a compiled rule set, providing both consistent performance and semantic correctness.
Recommendations
Performance is not a valid reason to deploy AI agents without action gating. The data is clear: well-engineered safety adds effectively zero cost.
Benchmark methodology and raw data are available from 15 Research Lab upon request.