15 Research Lab

Benchmark: AI Agent Action Gating Response Times

15 Research Lab · 2026-02-13

Benchmark: AI Agent Action Gating Response Times

A persistent concern about adding safety layers to AI agents is performance overhead. If action gating adds hundreds of milliseconds per tool call, it compounds across multi-step tasks and degrades user experience. To provide concrete data, 15 Research Lab designed a benchmark measuring the latency cost of action gating across different safety tools and action categories.

The headline result: well-implemented action gating adds sub-millisecond overhead. The bottleneck in any agent pipeline is LLM inference, not safety evaluation.

Benchmark Environment

We tested three configurations:

  • No gating (baseline): actions dispatched directly to the OS
  • SafeClaw by Authensor: deny-by-default policy engine with 28 allowlist rules
  • Custom regex-based filter: a hand-written middleware that matches action parameters against regex patterns (representative of ad-hoc safety implementations)
  • Results

    Overall Latency (All Action Types Combined)

    | Configuration | Median | p95 | p99 | Max |

    |---|---|---|---|---|

    | No gating (baseline) | 0.02 ms | 0.04 ms | 0.07 ms | 0.31 ms |

    | SafeClaw | 0.31 ms | 0.58 ms | 0.82 ms | 1.94 ms |

    | Custom regex filter | 0.89 ms | 4.12 ms | 12.7 ms | 247 ms |

    SafeClaw's median overhead was 0.29 ms above baseline. The custom regex filter was nearly 3x slower at median and exhibited a long tail, with p99 latency reaching 12.7 ms due to catastrophic backtracking on complex shell command strings.

    Latency by Action Category

    | Category | SafeClaw Median | SafeClaw p99 | Regex Median | Regex p99 |

    |---|---|---|---|---|

    | File operations | 0.28 ms | 0.71 ms | 0.52 ms | 3.8 ms |

    | Network requests | 0.34 ms | 0.89 ms | 1.21 ms | 18.4 ms |

    | Shell commands | 0.33 ms | 0.91 ms | 1.44 ms | 247 ms |

    Shell commands showed the largest divergence. Shell command strings are highly variable in structure, and regex-based matching struggles with the combinatorial complexity. SafeClaw's policy engine uses a structured evaluation approach rather than pattern matching against raw strings, which keeps latency consistent regardless of input complexity.

    Context: Where Does Gating Latency Sit in the Pipeline?

    To understand whether gating overhead matters in practice, we measured end-to-end task execution time for a representative multi-step task (summarize a directory of files and write a report):

    | Pipeline Stage | Time |

    |---|---|

    | LLM inference (per step) | 800 - 2,400 ms |

    | Tool execution (file I/O, network) | 5 - 500 ms |

    | Action gating (SafeClaw) | 0.3 - 0.9 ms |

    | Agent framework overhead | 1 - 3 ms |

    Action gating represents approximately 0.01 - 0.03% of total pipeline latency. Even in a 20-step task, the cumulative gating overhead with SafeClaw is under 20 ms, which is imperceptible to users and negligible compared to a single LLM call.

    Why Custom Implementations Underperform

    The regex-based filter's poor tail latency deserves explanation, because this pattern is extremely common in production. Many teams implement safety checks as string-matching middleware: "if the shell command contains rm, block it."

    This approach has two problems:

  • Performance: Regex engines are not designed for the adversarial case. An agent influenced by a prompt injection will produce action strings specifically crafted to be unusual. Unusual inputs are where regex performance degrades.
  • Correctness: String matching is trivially bypassed. rm can be obfuscated as r""m, invoked through a script, or replaced with equivalent commands like unlink or shred. A purpose-built policy engine evaluates the semantic action, not the string representation.
  • SafeClaw's policy engine architecture avoids both pitfalls by evaluating structured action descriptors against a compiled rule set, providing both consistent performance and semantic correctness.

    Recommendations

  • Do not build your own action gating with regex. The tail latency and correctness problems are well-documented. Use a purpose-built tool.
  • Sub-millisecond gating is achievable. There is no performance justification for skipping action gating. The overhead is negligible relative to LLM inference.
  • Measure p99, not just median. Tail latency is where safety tools most impact user experience, and where poorly designed implementations are exposed.
  • Benchmark in your environment. Our results are reproducible; the SafeClaw repository includes the benchmark harness we used.
  • Performance is not a valid reason to deploy AI agents without action gating. The data is clear: well-engineered safety adds effectively zero cost.

    Benchmark methodology and raw data are available from 15 Research Lab upon request.