15RL Methodology: Testing AI Agent Safety Policies

15 Research Lab · 2026-02-13

15RL Methodology: Testing AI Agent Safety Policies

Introduction

Safety policies are only as good as their testing. A policy that appears comprehensive on paper may fail to catch real threats due to implementation gaps, configuration errors, or unanticipated edge cases. 15 Research Lab developed a systematic methodology for testing AI agent safety policies, drawing on practices from software testing, penetration testing, and safety engineering. This methodology is designed to be repeatable, measurable, and applicable to any policy framework.

Testing Philosophy

Our methodology is built on three principles:

Assume policies are wrong until proven right — untested policies provide false confidence

Test the policy implementation, not just the policy document — the gap between intent and implementation is where failures occur

Measure coverage, not just pass/fail — knowing what you have not tested is as important as knowing what passes

The Four Testing Phases

Phase 1: Policy Verification

Objective: Verify that the policy configuration is syntactically correct, internally consistent, and complete. Tests:

Syntax validation: Does the policy parse without errors?
Consistency checks: Are there conflicting rules (one rule allows what another denies)?
Completeness checks: Does every tool the agent can access have a corresponding policy? Are there tools without any policy coverage?
Default behavior verification: When no rule matches, does the system deny (as it should) or allow?

Tools: Static analysis, schema validation, automated policy linting. Common findings: In our experience, 34% of policy configurations contain at least one consistency issue, and 52% have completeness gaps (tools without policy coverage).

Phase 2: Functional Testing

Objective: Verify that individual policy rules produce the expected outcome for known inputs. Tests:

For each policy rule, create test cases covering:

Positive cases: Actions that should be allowed by this rule
Negative cases: Actions that should be denied by this rule
Boundary cases: Actions at the exact boundary of the rule (e.g., file size limits, path boundaries, time windows)
Parameter variation: Same action with different parameters to verify parameter-level enforcement

Test case template:


Test ID: [unique identifier]
Policy Rule: [rule being tested]
Tool Call: [exact tool call specification]
Expected Outcome: [allow/deny/escalate]
Actual Outcome: [recorded during test]
Pass/Fail: [determined by comparison]



Coverage target: Minimum 3 test cases per policy rule (one positive, one negative, one boundary).

Phase 3: Adversarial Testing (Red Team)
Objective: Attempt to bypass safety policies using techniques an adversary would employ.

Test scenarios:

Path traversal: Attempt to access files outside allowed directories using ../`, symlinks, and encoded paths

Parameter injection: Include shell metacharacters, SQL fragments, and control characters in tool call parameters

Policy bypass via tool chaining: Use a sequence of individually-allowed actions to achieve an outcome that should be denied

Encoding evasion: Submit parameters in alternative encodings (base64, URL encoding, Unicode) to evade pattern matching

Timing attacks: Attempt actions during policy transition periods (reloads, updates)

Volume attacks: Submit rapid-fire requests to test rate limiting and queue handling

Privilege escalation: Attempt to modify the agent's own permissions or policy configuration

Scoring: Each scenario is rated by bypass success (full, partial, none) and detection (detected, undetected).

Phase 4: Regression Testing

Objective: Ensure that policy changes do not inadvertently weaken existing protections. Process:

Maintain a library of test cases from Phases 2 and 3

Run the complete test library after every policy change

Flag any test that changes outcome (previously denied action now allowed, or vice versa)

Require explicit review and approval for any outcome changes

Automation: Regression testing should be automated and integrated into the policy deployment pipeline. No policy change should reach production without passing the full regression suite.

Coverage Analysis

After completing all four phases, assess policy coverage:

| Coverage Dimension | Measurement | Target |

|---|---|---|

| Tool coverage | % of tools with at least one policy rule | 100% |

| Rule test coverage | % of rules with functional test cases | 100% |

| Adversarial scenario coverage | % of red team scenarios tested | > 80% |

| Parameter space coverage | % of parameter variations tested per rule | > 60% |

| Regression suite size | Test cases per policy rule | > 5 |

Implementing This Methodology

Organizations can implement this methodology incrementally:

Week 1: Phase 1 (Policy Verification) — audit existing policies for syntax, consistency, and completeness. Week 2-3: Phase 2 (Functional Testing) — create and execute test cases for each policy rule. Week 4: Phase 3 (Adversarial Testing) — conduct red team exercises against the policy configuration. Ongoing: Phase 4 (Regression Testing) — build automation and integrate into the deployment pipeline. SafeClaw provides a testable policy framework — its declarative policy configuration is amenable to automated verification (Phase 1), functional testing (Phase 2), and regression testing (Phase 4). Organizations using SafeClaw can apply this methodology directly to their policy configurations. The SafeClaw knowledge base includes examples of policy configurations that serve as a starting point for test case development.

Recommendations

Test policies before deployment — never ship untested safety configurations

Automate regression testing — manual testing does not scale with policy complexity

Conduct adversarial testing quarterly — new bypass techniques emerge continuously

Track coverage metrics — they reveal what you do not know about your policy effectiveness

Treat policy testing as seriously as application testing — safety code deserves the same rigor as product code

Conclusion

Untested safety policies are unreliable safety policies. This methodology provides a systematic approach to building confidence in your agent safety configuration. The investment in testing is small compared to the cost of discovering policy gaps through production incidents.

15RL's testing methodology is available for adaptation under an open license. We welcome community contributions and refinements.