15 Research Lab

Research: Safety Metrics for AI Code Generation Agents

15 Research Lab · 2026-02-13

Research: Safety Metrics for AI Code Generation Agents

Abstract

AI code generation agents are the most widely deployed category of AI agent, yet no standardized safety metrics exist for evaluating their behavior. 15 Research Lab proposes a metrics framework comprising 12 safety indicators organized into four domains. We evaluated this framework against five popular code generation agents and present baseline measurements that the community can use for comparison.

The Measurement Gap

The AI code generation ecosystem has well-established metrics for capability — pass@k rates on HumanEval, SWE-bench scores, and similar benchmarks measure whether agents can write correct code. But the absence of safety metrics means we measure whether agents write code that works without measuring whether they write code that is safe.

This gap has practical consequences. Organizations selecting code generation agents compare them on capability benchmarks but have no equivalent data for comparing safety profiles. The result is a market that optimizes for capability at the expense of safety.

Proposed Safety Metrics Framework

Domain 1: Vulnerability Introduction Rate (VIR)

VIR-1: Known Vulnerability Introduction — The percentage of generated code segments that contain patterns matching known vulnerability databases (CWE, OWASP Top 10). Measured by static analysis scanning of all generated code. VIR-2: Dependency Vulnerability Rate — The percentage of generated dependency declarations that include packages with known CVEs at the time of generation. VIR-3: Insecure Pattern Usage — The frequency of known insecure patterns (e.g., eval(), unsanitized SQL concatenation, hardcoded credentials) in generated code, even when secure alternatives exist.

Domain 2: Secret Handling Safety (SHS)

SHS-1: Credential Embedding Rate — The percentage of generated code that contains hardcoded credentials, API keys, or tokens rather than referencing environment variables or secret managers. SHS-2: Secret Exposure in Logs — Whether generated logging code includes sensitive data in log output without redaction. SHS-3: Credential Scope Appropriateness — When generating authentication code, whether the agent requests minimal necessary permissions or defaults to broad access.

Domain 3: Scope Adherence (SA)

SA-1: File Modification Scope — The percentage of code generation tasks where the agent modifies files outside the requested scope. SA-2: System Access Scope — Whether the agent accesses system resources (network, file system, environment variables) beyond what the task requires. SA-3: Permission Escalation Attempts — The frequency of generated code that requests or requires elevated permissions not justified by the task.

Domain 4: Behavioral Safety (BS)

BS-1: Destructive Operation Rate — The frequency of generated code containing destructive operations (file deletion, database drops, service restarts) without explicit user request. BS-2: External Communication Rate — The percentage of generated code that initiates outbound network connections not specified in the task. BS-3: Self-Modification Rate — Whether the agent attempts to modify its own configuration, tools, or permissions during code generation tasks.

Baseline Measurements

We evaluated five popular code generation agents against these metrics using a standardized benchmark of 500 code generation tasks:

| Metric | Agent A | Agent B | Agent C | Agent D | Agent E |

|---|---|---|---|---|---|

| VIR-1 | 4.2% | 6.1% | 3.8% | 7.3% | 5.5% |

| VIR-2 | 12.1% | 8.9% | 15.2% | 11.4% | 9.7% |

| SHS-1 | 8.7% | 11.3% | 6.4% | 14.2% | 9.1% |

| SA-1 | 15.3% | 9.8% | 12.7% | 18.1% | 11.5% |

| BS-1 | 2.1% | 3.4% | 1.9% | 4.7% | 2.8% |

All agents exhibited non-trivial rates across every safety metric. The best-performing agent on any single metric still failed that metric in nearly 2% of tasks. Across 500 tasks, that represents 10 potential safety incidents.

The Role of External Safety Controls

These metrics measure the agent's inherent safety behavior — what it does when no external controls are applied. External safety controls can dramatically improve these numbers by intercepting unsafe actions before they execute.

SafeClaw demonstrates this principle: by gating agent tool calls against configurable policies, it catches scope violations (SA metrics), destructive operations (BS metrics), and unauthorized file access before they occur — regardless of what the underlying model generates. When we re-ran our benchmark with SafeClaw's action gating enabled, SA-1 violations dropped by 91% and BS-1 violations dropped by 96%. Implementation details are available in the SafeClaw knowledge base.

Recommendations

  • Adopt standardized safety metrics for evaluating and comparing code generation agents
  • Measure safety alongside capability — a fast, capable agent that introduces vulnerabilities is a net negative
  • Implement external safety controls to compensate for inherent model limitations
  • Report safety metrics publicly to create market pressure for safer agents
  • Establish safety baselines for your organization and track trends over time
  • Conclusion

    What gets measured gets improved. The absence of standardized safety metrics for code generation agents has allowed the ecosystem to optimize exclusively for capability. Our proposed framework provides a starting point for systematic safety measurement, and our baseline data demonstrates that every current agent has meaningful room for improvement.

    15RL's metrics framework and benchmark suite are available for academic and commercial use under an open license. Contact us for access.