Research: Forensic Analysis of AI Agent Audit Logs

15 Research Lab · 2026-02-13

Research: Forensic Analysis of AI Agent Audit Logs

Abstract

When an AI agent causes an incident, the audit log is often the only artifact available for understanding what happened and why. 15 Research Lab developed a forensic analysis methodology for AI agent audit logs and evaluated its effectiveness across 42 reconstructed incidents. Our findings highlight the characteristics that make agent logs useful for forensics and the common deficiencies that render them inadequate.

The Forensic Challenge

AI agent incidents are fundamentally different from traditional software incidents. A web server that processes a malicious request follows a deterministic code path that can be reconstructed from standard application logs. An AI agent that causes harm follows a non-deterministic reasoning path — the same prompt can produce different tool-call sequences on different runs. This means forensic analysis must reconstruct not just what the agent did, but why it decided to do it.

What Good Agent Audit Logs Contain

Based on our analysis, effective forensic reconstruction requires logs that capture five elements per agent action:

The full prompt context at the time of the decision (including conversation history and system prompts)

The tool call specification (tool name, all parameters, intended effect)

The policy evaluation result (was the action approved, denied, or escalated?)

The execution outcome (success, failure, return values, side effects)

Cryptographic chain integrity (hash linking each log entry to its predecessor to prevent tampering)

In our evaluation, only 4 of 42 incident reconstructions (10%) had all five elements available. The most common missing elements were full prompt context (absent in 76% of cases) and cryptographic integrity (absent in 88%).

Forensic Analysis Methodology

We developed a four-phase forensic methodology:

Phase 1: Timeline Reconstruction

Build a chronological sequence of all agent actions from log data. Identify the incident-causing action and trace backwards to the earliest relevant decision point. In our experience, the root cause typically appears 5-15 actions before the incident-causing action.

Phase 2: Decision Path Analysis

For each action in the critical timeline window, determine what information the agent had when it made the decision. This requires prompt context and any intermediate reasoning the agent produced. Without this data, forensic analysis can determine what happened but not why.

Phase 3: Policy Gap Identification

Compare each action against the security policy that should have governed it. Identify whether the incident resulted from a missing policy (the dangerous action was not covered), a misconfigured policy (the policy existed but was too permissive), or a policy bypass (the agent found a way around the control).

Phase 4: Counterfactual Simulation

Replay the incident scenario with corrected policies to verify that the proposed remediation would have prevented the incident. This phase requires deterministic replay capability, which is only possible with complete prompt and context logs.

Findings from 42 Incident Reconstructions

| Incident Root Cause | Frequency | Successfully Reconstructed |

|---|---|---|

| Missing policy | 38% | 81% |

| Misconfigured policy | 26% | 74% |

| Prompt injection | 19% | 52% |

| Tool definition error | 12% | 89% |

| Unknown (insufficient logs) | 5% | 0% |

The 5% of incidents that could not be reconstructed at all shared a common characteristic: no structured audit logging was in place. These organizations had only standard application logs, which captured HTTP requests and system events but nothing about the agent's decision-making process.

The Role of Structured Safety Tools

Tools that implement structured audit logging dramatically improve forensic capability. SafeClaw produces hash-chained audit logs that capture tool call parameters, policy evaluation outcomes, and action dispositions in a tamper-evident format. In our testing, incidents in environments using SafeClaw's logging achieved 94% successful reconstruction rates, compared to 52% for environments using generic application logging. The hash-chain structure also enabled detection of log tampering in 3 of our test scenarios where adversaries attempted to modify logs post-incident. Documentation on SafeClaw's audit log format is available in the knowledge base.

Recommendations

Implement structured, agent-specific audit logging before deploying agents to production

Capture full prompt context at each decision point — this is the most frequently missing forensic element

Use cryptographic hash chains to ensure log integrity and detect tampering

Retain logs for at least 90 days — many incidents are discovered weeks after occurrence

Practice incident reconstruction regularly to identify logging gaps before real incidents occur

Conclusion

Forensic analysis of AI agent incidents is only as good as the audit logs that support it. Our research demonstrates that purpose-built agent audit logging enables effective incident reconstruction in the vast majority of cases, while generic logging leaves critical gaps. Organizations deploying AI agents should treat audit logging as a foundational safety requirement, not an afterthought.

15 Research Lab's forensic methodology is available for peer review upon request.