15RL Study: The True Cost of Uncontrolled AI Agent Spending
15RL Study: The True Cost of Uncontrolled AI Agent Spending
AI agents consume resources every time they make an LLM API call, execute a tool, access a database, or spin up compute. Unlike human-driven workflows where cost is naturally bounded by human speed, autonomous agents can consume resources at machine speed, 24 hours a day, without human oversight.
15 Research Lab analyzed 156 cost-related incidents reported by organizations deploying AI agents to quantify the financial risk of insufficient cost controls.
Incident Data Overview
Our dataset of 156 cost incidents spans July 2025 through January 2026, collected from partner organizations, public incident reports, and community disclosures.
| Metric | Value |
|---|---|
| Total incidents analyzed | 156 |
| Median cost overrun (per incident) | $2,840 |
| Mean cost overrun (per incident) | $8,420 |
| Maximum single incident | $147,000 |
| Median time to detection | 6.3 hours |
| Incidents exceeding $10,000 | 23 (14.7%) |
The distribution is heavily right-skewed: most incidents cause moderate financial damage, but the tail is severe. The top 10% of incidents accounted for 71% of total costs.
Root Cause Analysis
We classified each incident by its primary root cause:
| Root Cause | Count | % | Median Cost |
|---|---|---|---|
| Infinite retry/loop | 52 | 33.3% | $3,200 |
| Excessive LLM API calls | 38 | 24.4% | $4,100 |
| Unintended cloud resource provisioning | 24 | 15.4% | $12,800 |
| Bulk data processing beyond scope | 21 | 13.5% | $1,900 |
| Third-party API overconsumption | 14 | 9.0% | $2,300 |
| Other | 7 | 4.5% | $1,100 |
Infinite Retry Loops (33.3%)
The most common cause. An agent encounters an error, retries, encounters the same error, and loops indefinitely. Standard retry logic with exponential backoff helps, but agents frequently implement their own retry strategies through their reasoning loop, bypassing application-level retry controls.
Example: An agent tasked with data extraction from an API received a malformed response. Rather than reporting the error, it retried 840,000 times over 14 hours, generating $18,000 in API charges.Excessive LLM API Calls (24.4%)
Agents that decompose tasks into many small steps, or that engage in extended internal deliberation, can make far more LLM calls than expected. A task estimated to require 5-10 LLM calls might actually require 50-100 when the agent encounters ambiguity.
Example: A research agent asked to summarize a topic made 347 LLM calls as it recursively expanded its search, following tangential references. The task was estimated at $0.50 in API costs; actual cost was $47.Unintended Cloud Resource Provisioning (15.4%)
The most expensive category by median cost. Agents with infrastructure access (Terraform, AWS CLI, cloud APIs) created resources that were not intended or not cleaned up. A single misconfigured provisioning action can spin up expensive GPU instances or large database clusters.
Example: An infrastructure agent interpreted "set up the staging environment" as provisioning a 16-node GPU cluster. The cluster ran for a weekend before discovery, costing $47,000.The Detection Lag Problem
The median time to detection of 6.3 hours is perhaps the most concerning finding. During business hours, someone eventually notices the anomaly. Outside business hours, agents can accumulate costs unchecked for 12-36 hours.
| Detection Method | Median Detection Time | % of Incidents |
|---|---|---|
| Cloud provider billing alert | 8.2 hours | 31% |
| Manual observation | 4.1 hours | 28% |
| Application error/crash | 1.8 hours | 22% |
| Automated monitoring | 0.4 hours | 12% |
| Third-party API rate limit | 2.6 hours | 7% |
Only 12% of incidents were caught by automated monitoring, and those had a dramatically lower median detection time (24 minutes vs. 8.2 hours for billing alerts). The data strongly supports investing in real-time cost monitoring.
Cost Control Architecture
Based on our analysis, effective cost control for AI agents requires controls at three levels:
Level 1: Per-Action Limits
Every individual action should have a cost ceiling. An LLM call should have a maximum token limit. An API request should have a cost cap. A compute provisioning action should have a resource ceiling.
This prevents any single action from being catastrophically expensive.
Level 2: Per-Session Budgets
Each agent session (task execution) should have a total budget. When the budget is exhausted, the agent should stop and report, not continue attempting the task.
SafeClaw implements per-session budget controls as part of its action-gating policy engine. Actions that would exceed the session budget are denied, and the denial is logged with the budget context.Level 3: Global Rate Limits
Organization-wide rate limits on LLM API calls, cloud resource provisioning, and third-party API consumption provide a backstop when per-session controls fail or are misconfigured.
Cost Savings Projection
We modeled the impact of the three-level cost control architecture against our incident dataset:
| Control Level | Incidents Prevented | Cost Prevented |
|---|---|---|
| Per-action limits only | 38% | 42% |
| + Per-session budgets | 74% | 81% |
| + Global rate limits | 91% | 94% |
The full three-level stack would have prevented 91% of incidents and 94% of costs. The remaining incidents involved legitimate actions within budget that were simply not the right actions, a policy correctness problem rather than a cost control problem.
Recommendations
The cost of implementing these controls is measured in engineering hours. The cost of not implementing them is measured in thousands to tens of thousands of dollars per incident.
Financial data has been anonymized. Methodology available from 15 Research Lab upon request.