You know your AI bill is too high. You might even know which model is costing you the most. But do you know which requests are burning through your budget — and which ones are returning value?
Most teams cannot answer that question. They see a monthly invoice from OpenAI or Anthropic, wince, and move on. The actual token-level economics of their application remain invisible.
This is the core problem. You cannot reduce what you cannot measure. And the gap between "we use GPT-4o" and "we understand exactly where every token goes" is where 40–60% of most AI budgets quietly disappear.
This guide walks through the full stack of token waste elimination: instrumenting your requests, identifying the biggest offenders, setting hard budget guardrails, and building the feedback loops that keep costs down permanently.
Why Token Bills Are Opaque by Default
LLM providers give you a dashboard showing total usage. OpenAI shows you daily token counts and costs. Anthropic gives you a usage page. But none of them break costs down by:
- Which feature in your app triggered the request
- How many tokens were system prompt vs. user input vs. completion
- Whether the completion was actually used or discarded
- What the cost-per-successful-outcome is
This is like running a SaaS business where your only metric is "total AWS bill." Technically accurate, practically useless.
The first step to reducing AI costs is building visibility into where tokens actually go.
Step 1: Instrument Every Request
Before optimizing anything, wrap your LLM calls with a lightweight telemetry layer. Here's a minimal implementation:
interface TokenMetrics {
requestId: string;
feature: string; // Which app feature triggered this
model: string;
inputTokens: number;
outputTokens: number;
systemPromptTokens: number;
cacheHit: boolean;
latencyMs: number;
cost: number; // Calculated from token counts + pricing
timestamp: number;
}
// Pricing per million tokens (update as providers change rates)
const PRICING: Record<string, { input: number; output: number }> = {
'gpt-4o': { input: 2.50, output: 10.00 },
'gpt-4o-mini': { input: 0.15, output: 0.60 },
'claude-sonnet': { input: 3.00, output: 15.00 },
'claude-haiku': { input: 0.25, output: 1.25 },
};
function calculateCost(model: string, inputTokens: number, outputTokens: number): number {
const price = PRICING[model];
if (!price) return 0;
return (inputTokens * price.input + outputTokens * price.output) / 1_000_000;
}
async function trackedCompletion(
feature: string,
model: string,
messages: Message[],
): Promise<{ result: string; metrics: TokenMetrics }> {
const start = Date.now();
const requestId = crypto.randomUUID();
const response = await openai.chat.completions.create({ model, messages });
const usage = response.usage!;
const metrics: TokenMetrics = {
requestId,
feature,
model,
inputTokens: usage.prompt_tokens,
outputTokens: usage.completion_tokens,
systemPromptTokens: estimateSystemTokens(messages),
cacheHit: usage.prompt_tokens_details?.cached_tokens > 0,
latencyMs: Date.now() - start,
cost: calculateCost(model, usage.prompt_tokens, usage.completion_tokens),
timestamp: Date.now(),
};
// Fire and forget — don't slow down the request
logMetrics(metrics).catch(console.error);
return { result: response.choices[0].message.content!, metrics };
}
The key insight: tag every request with the feature that triggered it. "Chat" is too broad. You want "chat:summarize", "chat:translate", "support:classify", "support:respond". Granularity here determines whether you can actually find waste later.
Step 2: Find Your Top Token Consumers
Once you have even 48 hours of instrumented data, run this analysis:
-- Top 10 most expensive features (last 7 days)
SELECT
feature,
COUNT(*) as request_count,
SUM(cost) as total_cost,
AVG(input_tokens) as avg_input,
AVG(output_tokens) as avg_output,
AVG(system_prompt_tokens) as avg_system_prompt,
ROUND(AVG(system_prompt_tokens)::numeric / AVG(input_tokens)::numeric * 100, 1) as system_prompt_pct
FROM token_metrics
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY feature
ORDER BY total_cost DESC
LIMIT 10;
In nearly every production app we've analyzed, the results follow a predictable pattern:
System prompts account for 30–50% of input tokens. The same 2,000-token system prompt gets sent with every single request, and most of it is irrelevant to the specific query.
One or two features dominate costs. Usually something with a large context window — RAG retrieval, document analysis, or conversation history that grows unbounded.
Output tokens are surprisingly expensive. A feature that generates verbose responses (like detailed explanations or full code files) can cost 4–10x more per request than one that returns short classifications.
Step 3: Attack System Prompt Bloat
System prompts are the single most common source of token waste. They tend to grow over time as developers add instructions, examples, and edge case handling — and they never shrink.
Here's how to audit yours:
// Analyze system prompt efficiency
function auditSystemPrompt(systemPrompt: string, recentRequests: TokenMetrics[]) {
const encoder = new Tiktoken(); // or use tiktoken library
const promptTokens = encoder.encode(systemPrompt).length;
const totalRequests = recentRequests.length;
const totalSystemCost = recentRequests.reduce((sum, r) => {
const price = PRICING[r.model];
return sum + (promptTokens * price.input / 1_000_000);
}, 0);
return {
tokenCount: promptTokens,
dailyRequests: totalRequests,
dailyCostFromSystemPrompt: totalSystemCost,
monthlyProjection: totalSystemCost * 30,
suggestion: promptTokens > 500
? 'Consider splitting into base + feature-specific prompts'
: 'System prompt size is reasonable',
};
}
The Fix: Modular System Prompts
Instead of one monolithic system prompt, split it into composable segments:
const BASE_PROMPT = `You are a customer support agent for Acme Corp.
Respond concisely. Use the customer's name when available.`;
const SEGMENTS: Record<string, string> = {
refund: `Refund policy: Full refunds within 30 days. Partial refunds within 90 days.
Process via Stripe. Always confirm the order ID before proceeding.`,
technical: `For technical issues, collect: OS, browser version, error message.
Check known issues list before escalating. Link to docs when applicable.`,
billing: `Billing inquiries: Verify customer identity first (email + last 4 of card).
Never share full payment details. Offer to resend invoices.`,
};
function buildSystemPrompt(feature: string): string {
const segment = SEGMENTS[feature] || '';
return segment ? `${BASE_PROMPT}\n\n${segment}` : BASE_PROMPT;
}
This approach typically reduces system prompt tokens by 40–60%. At scale, that's significant — a system prompt that drops from 1,200 tokens to 500 tokens saves $0.00175 per request on GPT-4o. At 100,000 requests per day, that's $5,250 per month from this single change.
Step 4: Set Token Budgets Per Request
Most applications have no upper bound on how many tokens a single request can consume. This is how you get surprise $200 days in the middle of an otherwise normal month.
Implement hard ceilings:
interface TokenBudget {
maxInputTokens: number;
maxOutputTokens: number;
maxCostPerRequest: number; // Hard dollar ceiling
}
const BUDGETS: Record<string, TokenBudget> = {
'chat:casual': { maxInputTokens: 2000, maxOutputTokens: 500, maxCostPerRequest: 0.01 },
'chat:analysis': { maxInputTokens: 8000, maxOutputTokens: 2000, maxCostPerRequest: 0.05 },
'support:classify':{ maxInputTokens: 1000, maxOutputTokens: 100, maxCostPerRequest: 0.005 },
'support:respond': { maxInputTokens: 4000, maxOutputTokens: 1000, maxCostPerRequest: 0.03 },
'rag:query': { maxInputTokens: 16000, maxOutputTokens: 2000, maxCostPerRequest: 0.10 },
};
async function budgetedCompletion(feature: string, model: string, messages: Message[]) {
const budget = BUDGETS[feature];
if (!budget) throw new Error(`No budget defined for feature: ${feature}`);
const estimatedInput = estimateTokens(messages);
if (estimatedInput > budget.maxInputTokens) {
// Truncate conversation history, not system prompt
messages = truncateHistory(messages, budget.maxInputTokens);
}
const estimatedCost = calculateCost(model, estimatedInput, budget.maxOutputTokens);
if (estimatedCost > budget.maxCostPerRequest) {
// Downgrade model instead of blocking
model = findCheaperModel(model, estimatedInput, budget.maxCostPerRequest);
}
return trackedCompletion(feature, model, messages, {
max_tokens: budget.maxOutputTokens,
});
}
The critical design decision: when a request exceeds its budget, downgrade the model instead of failing. Users don't notice the difference between GPT-4o and GPT-4o-mini for most queries. They absolutely notice an error message.
Step 5: Trim Conversation History Intelligently
Unbounded conversation history is one of the fastest ways to blow through token budgets. Every message in the history gets re-sent with every new request. A 50-message conversation can easily hit 20,000+ input tokens per turn.
The naive approach is to keep the last N messages. The smarter approach is to keep relevant messages:
function trimConversationHistory(
messages: Message[],
maxTokens: number,
strategy: 'recent' | 'smart' = 'smart'
): Message[] {
const systemMessages = messages.filter(m => m.role === 'system');
const conversationMessages = messages.filter(m => m.role !== 'system');
if (strategy === 'recent') {
// Simple: keep system prompt + most recent messages
let kept: Message[] = [];
let tokenCount = estimateTokens(systemMessages);
for (let i = conversationMessages.length - 1; i >= 0; i--) {
const msgTokens = estimateTokens([conversationMessages[i]]);
if (tokenCount + msgTokens > maxTokens) break;
kept.unshift(conversationMessages[i]);
tokenCount += msgTokens;
}
return [...systemMessages, ...kept];
}
// Smart: keep system + first message (sets context) +
// last N messages + any message containing key entities
const first = conversationMessages.slice(0, 2);
const recent = conversationMessages.slice(-6);
const middle = conversationMessages.slice(2, -6);
// Summarize the middle if it exists
if (middle.length > 4) {
const summary = {
role: 'system' as const,
content: `[Previous conversation summary: ${middle.length} messages exchanged covering: ${extractTopics(middle).join(', ')}]`
};
return [...systemMessages, ...first, summary, ...recent];
}
return [...systemMessages, ...conversationMessages];
}
This "smart trim" approach keeps the beginning (which sets context), the end (which has the current question), and replaces the middle with a one-line summary. In testing, this preserves response quality while cutting input tokens by 50–70% on long conversations.
Step 6: Build a Cost Dashboard
Metrics without visibility are just logs nobody reads. Build a simple dashboard that surfaces the numbers that matter:
// Daily cost summary — run this as a cron job
async function dailyCostReport(): Promise<CostReport> {
const today = await db.query(`
SELECT
SUM(cost) as total_cost,
COUNT(*) as total_requests,
AVG(cost) as avg_cost_per_request,
MAX(cost) as max_single_request,
SUM(CASE WHEN cost > 0.10 THEN 1 ELSE 0 END) as expensive_requests,
SUM(system_prompt_tokens) as total_system_tokens,
SUM(input_tokens) as total_input_tokens,
ROUND(SUM(system_prompt_tokens)::numeric /
NULLIF(SUM(input_tokens), 0)::numeric * 100, 1) as system_prompt_waste_pct
FROM token_metrics
WHERE timestamp > NOW() - INTERVAL '24 hours'
`);
const byFeature = await db.query(`
SELECT feature, SUM(cost) as cost, COUNT(*) as requests
FROM token_metrics
WHERE timestamp > NOW() - INTERVAL '24 hours'
GROUP BY feature
ORDER BY cost DESC
LIMIT 5
`);
return {
summary: today.rows[0],
topFeatures: byFeature.rows,
alerts: generateAlerts(today.rows[0]),
};
}
function generateAlerts(summary: any): string[] {
const alerts: string[] = [];
if (summary.total_cost > 50) alerts.push('Daily cost exceeded $50');
if (summary.system_prompt_waste_pct > 40) alerts.push('System prompts consuming >40% of input tokens');
if (summary.expensive_requests > 10) alerts.push(`${summary.expensive_requests} requests exceeded $0.10`);
if (summary.max_single_request > 1.00) alerts.push(`Most expensive single request: $${summary.max_single_request}`);
return alerts;
}
Set up alerts for anomalies. The most useful ones:
- Daily cost exceeds 2x rolling average — something changed, investigate immediately
- Single request exceeds $1.00 — likely a runaway context window or recursive tool call
- System prompt percentage above 40% — your prompts need trimming
- Cache hit rate below 20% — prompt caching isn't working, which doubles your input costs
Step 7: Automate Waste Reduction With a Proxy
Everything above works, but it requires changes to your application code. If you're running OpenClaw (or any LLM proxy), you can implement most of these optimizations at the proxy layer instead — touching zero application code.
A compression proxy sits between your app and the LLM provider:
Your App → Compression Proxy → LLM Provider
↓
Token Metrics DB
The proxy can:
- Compress prompts before they hit the provider (removing redundant tokens, whitespace, and filler)
- Route to cheaper models based on request complexity
- Enforce token budgets at the infrastructure level
- Log detailed metrics without any application changes
- Cache identical requests to avoid paying twice for the same answer
This is exactly what claw.zip does for OpenClaw users. It sits in front of your LLM calls, compresses prompts by up to 70%, routes simple queries to cheaper models, and gives you a dashboard showing exactly where your tokens go — all without changing a single line of your application code.
For teams not using OpenClaw, the same architecture applies. Build a thin HTTP proxy that intercepts /v1/chat/completions requests, applies the optimizations described in this guide, and forwards the compressed request to the real provider.
Real-World Results: Before and After
Here's what these optimizations look like in practice, based on a production customer support application handling 50,000 requests per day:
| Optimization | Token Reduction | Monthly Savings |
|---|---|---|
| Modular system prompts | -45% input tokens | $2,100 |
| Conversation history trimming | -55% input tokens | $3,400 |
| Token budget enforcement | -20% output tokens | $1,800 |
| Model routing (simple queries) | -70% cost on 60% of requests | $4,200 |
| Prompt compression (claw.zip) | -50% remaining input tokens | $2,500 |
| Combined | -78% total cost | $14,000 |
The order matters. Model routing provides the largest single reduction because it changes the price per token, not just the token count. Prompt compression stacks on top by reducing the tokens that remain after all other optimizations.
The Checklist: Ship This in a Week
If you want to reduce your AI costs systematically, here's the implementation order that gives you the fastest ROI:
Day 1–2: Instrument
- Wrap all LLM calls with the telemetry layer
- Tag each call with the originating feature
- Log to a queryable store (Postgres, ClickHouse, even SQLite)
Day 3: Analyze
- Run the top-consumers SQL query
- Identify your biggest system prompt and measure its token count
- Calculate your system-prompt-to-input ratio
Day 4–5: Optimize
- Split your largest system prompt into modular segments
- Implement conversation history trimming
- Set token budgets for each feature
Day 6–7: Automate
- Set up the daily cost report cron
- Configure alerts for anomalies
- Deploy model routing or a compression proxy
Most teams see a 50–70% cost reduction within the first week. The remaining optimizations (prompt compression, advanced caching, output constraining) push that to 80–90% over the following month.
Stop Guessing, Start Measuring
The AI industry has a measurement problem. Teams spend thousands per month on LLM APIs and have less cost visibility than they would for a $50/month Heroku dyno.
Every optimization in this guide starts with the same prerequisite: know where your tokens go. Once you have that visibility, the fixes are straightforward engineering work — the same kind of performance optimization you'd do for any other expensive API.
The tools exist. claw.zip handles compression, routing, and metering automatically for OpenClaw users. For everyone else, the code examples in this guide are enough to build the same capabilities into your existing stack.
Either way, stop paying frontier model prices for simple queries. Stop sending 2,000-token system prompts with every request. Stop letting conversation histories grow without bounds.
Measure first. Cut second. Your AI budget will thank you.