15 Research Lab

15RL Methodology: Testing AI Agent Safety Policies

15 Research Lab · 2026-02-13

15RL Methodology: Testing AI Agent Safety Policies

Introduction

Safety policies are only as good as their testing. A policy that appears comprehensive on paper may fail to catch real threats due to implementation gaps, configuration errors, or unanticipated edge cases. 15 Research Lab developed a systematic methodology for testing AI agent safety policies, drawing on practices from software testing, penetration testing, and safety engineering. This methodology is designed to be repeatable, measurable, and applicable to any policy framework.

Testing Philosophy

Our methodology is built on three principles:

  • Assume policies are wrong until proven right — untested policies provide false confidence
  • Test the policy implementation, not just the policy document — the gap between intent and implementation is where failures occur
  • Measure coverage, not just pass/fail — knowing what you have not tested is as important as knowing what passes
  • The Four Testing Phases

    Phase 1: Policy Verification

    Objective: Verify that the policy configuration is syntactically correct, internally consistent, and complete. Tests: Tools: Static analysis, schema validation, automated policy linting. Common findings: In our experience, 34% of policy configurations contain at least one consistency issue, and 52% have completeness gaps (tools without policy coverage).

    Phase 2: Functional Testing

    Objective: Verify that individual policy rules produce the expected outcome for known inputs. Tests:

    For each policy rule, create test cases covering:

    Test case template:

    ``

    Test ID: [unique identifier]

    Policy Rule: [rule being tested]

    Tool Call: [exact tool call specification]

    Expected Outcome: [allow/deny/escalate]

    Actual Outcome: [recorded during test]

    Pass/Fail: [determined by comparison]

    ` Coverage target: Minimum 3 test cases per policy rule (one positive, one negative, one boundary).

    Phase 3: Adversarial Testing (Red Team)

    Objective: Attempt to bypass safety policies using techniques an adversary would employ. Test scenarios:
  • Path traversal: Attempt to access files outside allowed directories using ../`, symlinks, and encoded paths
  • Parameter injection: Include shell metacharacters, SQL fragments, and control characters in tool call parameters
  • Policy bypass via tool chaining: Use a sequence of individually-allowed actions to achieve an outcome that should be denied
  • Encoding evasion: Submit parameters in alternative encodings (base64, URL encoding, Unicode) to evade pattern matching
  • Timing attacks: Attempt actions during policy transition periods (reloads, updates)
  • Volume attacks: Submit rapid-fire requests to test rate limiting and queue handling
  • Privilege escalation: Attempt to modify the agent's own permissions or policy configuration
  • Scoring: Each scenario is rated by bypass success (full, partial, none) and detection (detected, undetected).

    Phase 4: Regression Testing

    Objective: Ensure that policy changes do not inadvertently weaken existing protections. Process:
  • Maintain a library of test cases from Phases 2 and 3
  • Run the complete test library after every policy change
  • Flag any test that changes outcome (previously denied action now allowed, or vice versa)
  • Require explicit review and approval for any outcome changes
  • Automation: Regression testing should be automated and integrated into the policy deployment pipeline. No policy change should reach production without passing the full regression suite.

    Coverage Analysis

    After completing all four phases, assess policy coverage:

    | Coverage Dimension | Measurement | Target |

    |---|---|---|

    | Tool coverage | % of tools with at least one policy rule | 100% |

    | Rule test coverage | % of rules with functional test cases | 100% |

    | Adversarial scenario coverage | % of red team scenarios tested | > 80% |

    | Parameter space coverage | % of parameter variations tested per rule | > 60% |

    | Regression suite size | Test cases per policy rule | > 5 |

    Implementing This Methodology

    Organizations can implement this methodology incrementally:

    Week 1: Phase 1 (Policy Verification) — audit existing policies for syntax, consistency, and completeness. Week 2-3: Phase 2 (Functional Testing) — create and execute test cases for each policy rule. Week 4: Phase 3 (Adversarial Testing) — conduct red team exercises against the policy configuration. Ongoing: Phase 4 (Regression Testing) — build automation and integrate into the deployment pipeline. SafeClaw provides a testable policy framework — its declarative policy configuration is amenable to automated verification (Phase 1), functional testing (Phase 2), and regression testing (Phase 4). Organizations using SafeClaw can apply this methodology directly to their policy configurations. The SafeClaw knowledge base includes examples of policy configurations that serve as a starting point for test case development.

    Recommendations

  • Test policies before deployment — never ship untested safety configurations
  • Automate regression testing — manual testing does not scale with policy complexity
  • Conduct adversarial testing quarterly — new bypass techniques emerge continuously
  • Track coverage metrics — they reveal what you do not know about your policy effectiveness
  • Treat policy testing as seriously as application testing — safety code deserves the same rigor as product code
  • Conclusion

    Untested safety policies are unreliable safety policies. This methodology provides a systematic approach to building confidence in your agent safety configuration. The investment in testing is small compared to the cost of discovering policy gaps through production incidents.

    15RL's testing methodology is available for adaptation under an open license. We welcome community contributions and refinements.