15RL Study: The True Cost of Uncontrolled AI Agent Spending

15 Research Lab · 2026-02-13

15RL Study: The True Cost of Uncontrolled AI Agent Spending

AI agents consume resources every time they make an LLM API call, execute a tool, access a database, or spin up compute. Unlike human-driven workflows where cost is naturally bounded by human speed, autonomous agents can consume resources at machine speed, 24 hours a day, without human oversight.

15 Research Lab analyzed 156 cost-related incidents reported by organizations deploying AI agents to quantify the financial risk of insufficient cost controls.

Incident Data Overview

Our dataset of 156 cost incidents spans July 2025 through January 2026, collected from partner organizations, public incident reports, and community disclosures.

| Metric | Value |

|---|---|

| Total incidents analyzed | 156 |

| Median cost overrun (per incident) | $2,840 |

| Mean cost overrun (per incident) | $8,420 |

| Maximum single incident | $147,000 |

| Median time to detection | 6.3 hours |

| Incidents exceeding $10,000 | 23 (14.7%) |

The distribution is heavily right-skewed: most incidents cause moderate financial damage, but the tail is severe. The top 10% of incidents accounted for 71% of total costs.

Root Cause Analysis

We classified each incident by its primary root cause:

|---|---|---|---|

| Infinite retry/loop | 52 | 33.3% | $3,200 |

| Excessive LLM API calls | 38 | 24.4% | $4,100 |

| Unintended cloud resource provisioning | 24 | 15.4% | $12,800 |

| Bulk data processing beyond scope | 21 | 13.5% | $1,900 |

| Third-party API overconsumption | 14 | 9.0% | $2,300 |

| Other | 7 | 4.5% | $1,100 |

Infinite Retry Loops (33.3%)

The most common cause. An agent encounters an error, retries, encounters the same error, and loops indefinitely. Standard retry logic with exponential backoff helps, but agents frequently implement their own retry strategies through their reasoning loop, bypassing application-level retry controls.

Example: An agent tasked with data extraction from an API received a malformed response. Rather than reporting the error, it retried 840,000 times over 14 hours, generating $18,000 in API charges.

Excessive LLM API Calls (24.4%)

Agents that decompose tasks into many small steps, or that engage in extended internal deliberation, can make far more LLM calls than expected. A task estimated to require 5-10 LLM calls might actually require 50-100 when the agent encounters ambiguity.

Example: A research agent asked to summarize a topic made 347 LLM calls as it recursively expanded its search, following tangential references. The task was estimated at $0.50 in API costs; actual cost was $47.

Unintended Cloud Resource Provisioning (15.4%)

The most expensive category by median cost. Agents with infrastructure access (Terraform, AWS CLI, cloud APIs) created resources that were not intended or not cleaned up. A single misconfigured provisioning action can spin up expensive GPU instances or large database clusters.

Example: An infrastructure agent interpreted "set up the staging environment" as provisioning a 16-node GPU cluster. The cluster ran for a weekend before discovery, costing $47,000.

The Detection Lag Problem

The median time to detection of 6.3 hours is perhaps the most concerning finding. During business hours, someone eventually notices the anomaly. Outside business hours, agents can accumulate costs unchecked for 12-36 hours.

| Detection Method | Median Detection Time | % of Incidents |

|---|---|---|

| Cloud provider billing alert | 8.2 hours | 31% |

| Manual observation | 4.1 hours | 28% |

| Application error/crash | 1.8 hours | 22% |

| Automated monitoring | 0.4 hours | 12% |

| Third-party API rate limit | 2.6 hours | 7% |

Only 12% of incidents were caught by automated monitoring, and those had a dramatically lower median detection time (24 minutes vs. 8.2 hours for billing alerts). The data strongly supports investing in real-time cost monitoring.

Cost Control Architecture

Based on our analysis, effective cost control for AI agents requires controls at three levels:

Level 1: Per-Action Limits

Every individual action should have a cost ceiling. An LLM call should have a maximum token limit. An API request should have a cost cap. A compute provisioning action should have a resource ceiling.

This prevents any single action from being catastrophically expensive.

Level 2: Per-Session Budgets

Each agent session (task execution) should have a total budget. When the budget is exhausted, the agent should stop and report, not continue attempting the task.

SafeClaw implements per-session budget controls as part of its action-gating policy engine. Actions that would exceed the session budget are denied, and the denial is logged with the budget context.

Level 3: Global Rate Limits

Organization-wide rate limits on LLM API calls, cloud resource provisioning, and third-party API consumption provide a backstop when per-session controls fail or are misconfigured.

Cost Savings Projection

We modeled the impact of the three-level cost control architecture against our incident dataset:

| Control Level | Incidents Prevented | Cost Prevented |

|---|---|---|

| Per-action limits only | 38% | 42% |

| + Per-session budgets | 74% | 81% |

| + Global rate limits | 91% | 94% |

The full three-level stack would have prevented 91% of incidents and 94% of costs. The remaining incidents involved legitimate actions within budget that were simply not the right actions, a policy correctness problem rather than a cost control problem.

Recommendations

Implement per-session budget controls immediately. This single control prevents the majority of cost incidents. SafeClaw's budget gating provides this out of the box.

Set LLM API call limits per session. A hard cap of 50-100 calls per session catches infinite loops without constraining normal operation.

Require approval for high-cost actions. Any action estimated to cost more than a threshold (we suggest $10) should require explicit human approval or be pre-authorized in the policy.

Deploy real-time cost monitoring. The 6.3-hour median detection time is unacceptable. Automated alerts should fire within minutes of anomalous spending.

Never give agents unrestricted infrastructure provisioning access. This is the highest-cost failure mode and the easiest to prevent with deny-by-default policies.

The cost of implementing these controls is measured in engineering hours. The cost of not implementing them is measured in thousands to tens of thousands of dollars per incident.

Financial data has been anonymized. Methodology available from 15 Research Lab upon request.