How do I get started with claw.zip?

Run npx claw-zip init in your terminal. It will walk you through credential setup and configuration. You can also sign up at claw.zip and create your first API key from the dashboard.

Will compression change my model outputs?

Our compression is designed to preserve semantic meaning. For most use cases, outputs are identical. If you need guaranteed verbatim passthrough, you can disable compression per-key from the dashboard.

What happens if claw.zip goes down?

claw.zip runs on Cloudflare Workers across 300+ global edge locations. In the unlikely event of an outage, you can point your config back to the Anthropic API directly — no data is locked in our system.

Can I use claw.zip with non-Anthropic models?

Currently, claw.zip is optimized for the Anthropic API (Claude models). Support for additional providers is on our roadmap.

How to Measure and Eliminate Token Waste in Production AI Applications

You know your AI bill is too high. You might even know which model is costing you the most. But do you know which requests are burning through your budget — and which ones are returning value?

Most teams cannot answer that question. They see a monthly invoice from OpenAI or Anthropic, wince, and move on. The actual token-level economics of their application remain invisible.

This is the core problem. You cannot reduce what you cannot measure. And the gap between "we use GPT-4o" and "we understand exactly where every token goes" is where 40–60% of most AI budgets quietly disappear.

This guide walks through the full stack of token waste elimination: instrumenting your requests, identifying the biggest offenders, setting hard budget guardrails, and building the feedback loops that keep costs down permanently.

Why Token Bills Are Opaque by Default

LLM providers give you a dashboard showing total usage. OpenAI shows you daily token counts and costs. Anthropic gives you a usage page. But none of them break costs down by:

Which feature in your app triggered the request
How many tokens were system prompt vs. user input vs. completion
Whether the completion was actually used or discarded
What the cost-per-successful-outcome is

This is like running a SaaS business where your only metric is "total AWS bill." Technically accurate, practically useless.

The first step to reducing AI costs is building visibility into where tokens actually go.

Step 1: Instrument Every Request

Before optimizing anything, wrap your LLM calls with a lightweight telemetry layer. Here's a minimal implementation:

interface TokenMetrics {
  requestId: string;
  feature: string;          // Which app feature triggered this
  model: string;
  inputTokens: number;
  outputTokens: number;
  systemPromptTokens: number;
  cacheHit: boolean;
  latencyMs: number;
  cost: number;             // Calculated from token counts + pricing
  timestamp: number;
}

// Pricing per million tokens (update as providers change rates)
const PRICING: Record<string, { input: number; output: number }> = {
  'gpt-4o':        { input: 2.50, output: 10.00 },
  'gpt-4o-mini':   { input: 0.15, output: 0.60  },
  'claude-sonnet': { input: 3.00, output: 15.00 },
  'claude-haiku':  { input: 0.25, output: 1.25  },
};

function calculateCost(model: string, inputTokens: number, outputTokens: number): number {
  const price = PRICING[model];
  if (!price) return 0;
  return (inputTokens * price.input + outputTokens * price.output) / 1_000_000;
}

async function trackedCompletion(
  feature: string,
  model: string,
  messages: Message[],
): Promise<{ result: string; metrics: TokenMetrics }> {
  const start = Date.now();
  const requestId = crypto.randomUUID();

  const response = await openai.chat.completions.create({ model, messages });

  const usage = response.usage!;
  const metrics: TokenMetrics = {
    requestId,
    feature,
    model,
    inputTokens: usage.prompt_tokens,
    outputTokens: usage.completion_tokens,
    systemPromptTokens: estimateSystemTokens(messages),
    cacheHit: usage.prompt_tokens_details?.cached_tokens > 0,
    latencyMs: Date.now() - start,
    cost: calculateCost(model, usage.prompt_tokens, usage.completion_tokens),
    timestamp: Date.now(),
  };

  // Fire and forget — don't slow down the request
  logMetrics(metrics).catch(console.error);

  return { result: response.choices[0].message.content!, metrics };
}

The key insight: tag every request with the feature that triggered it. "Chat" is too broad. You want "chat:summarize", "chat:translate", "support:classify", "support:respond". Granularity here determines whether you can actually find waste later.

Step 2: Find Your Top Token Consumers

Once you have even 48 hours of instrumented data, run this analysis:

-- Top 10 most expensive features (last 7 days)
SELECT
  feature,
  COUNT(*) as request_count,
  SUM(cost) as total_cost,
  AVG(input_tokens) as avg_input,
  AVG(output_tokens) as avg_output,
  AVG(system_prompt_tokens) as avg_system_prompt,
  ROUND(AVG(system_prompt_tokens)::numeric / AVG(input_tokens)::numeric * 100, 1) as system_prompt_pct
FROM token_metrics
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY feature
ORDER BY total_cost DESC
LIMIT 10;

In nearly every production app we've analyzed, the results follow a predictable pattern:

System prompts account for 30–50% of input tokens. The same 2,000-token system prompt gets sent with every single request, and most of it is irrelevant to the specific query.
One or two features dominate costs. Usually something with a large context window — RAG retrieval, document analysis, or conversation history that grows unbounded.
Output tokens are surprisingly expensive. A feature that generates verbose responses (like detailed explanations or full code files) can cost 4–10x more per request than one that returns short classifications.

Step 3: Attack System Prompt Bloat

System prompts are the single most common source of token waste. They tend to grow over time as developers add instructions, examples, and edge case handling — and they never shrink.

Here's how to audit yours:

// Analyze system prompt efficiency
function auditSystemPrompt(systemPrompt: string, recentRequests: TokenMetrics[]) {
  const encoder = new Tiktoken(); // or use tiktoken library
  const promptTokens = encoder.encode(systemPrompt).length;
  const totalRequests = recentRequests.length;
  const totalSystemCost = recentRequests.reduce((sum, r) => {
    const price = PRICING[r.model];
    return sum + (promptTokens * price.input / 1_000_000);
  }, 0);

  return {
    tokenCount: promptTokens,
    dailyRequests: totalRequests,
    dailyCostFromSystemPrompt: totalSystemCost,
    monthlyProjection: totalSystemCost * 30,
    suggestion: promptTokens > 500
      ? 'Consider splitting into base + feature-specific prompts'
      : 'System prompt size is reasonable',
  };
}

The Fix: Modular System Prompts

Instead of one monolithic system prompt, split it into composable segments:

const BASE_PROMPT = `You are a customer support agent for Acme Corp.
Respond concisely. Use the customer's name when available.`;

const SEGMENTS: Record<string, string> = {
  refund: `Refund policy: Full refunds within 30 days. Partial refunds within 90 days.
Process via Stripe. Always confirm the order ID before proceeding.`,

  technical: `For technical issues, collect: OS, browser version, error message.
Check known issues list before escalating. Link to docs when applicable.`,

  billing: `Billing inquiries: Verify customer identity first (email + last 4 of card).
Never share full payment details. Offer to resend invoices.`,
};

function buildSystemPrompt(feature: string): string {
  const segment = SEGMENTS[feature] || '';
  return segment ? `${BASE_PROMPT}\n\n${segment}` : BASE_PROMPT;
}

This approach typically reduces system prompt tokens by 40–60%. At scale, that's significant — a system prompt that drops from 1,200 tokens to 500 tokens saves $0.00175 per request on GPT-4o. At 100,000 requests per day, that's $5,250 per month from this single change.

Step 4: Set Token Budgets Per Request

Most applications have no upper bound on how many tokens a single request can consume. This is how you get surprise $200 days in the middle of an otherwise normal month.

Implement hard ceilings:

interface TokenBudget {
  maxInputTokens: number;
  maxOutputTokens: number;
  maxCostPerRequest: number;  // Hard dollar ceiling
}

const BUDGETS: Record<string, TokenBudget> = {
  'chat:casual':     { maxInputTokens: 2000,  maxOutputTokens: 500,   maxCostPerRequest: 0.01 },
  'chat:analysis':   { maxInputTokens: 8000,  maxOutputTokens: 2000,  maxCostPerRequest: 0.05 },
  'support:classify':{ maxInputTokens: 1000,  maxOutputTokens: 100,   maxCostPerRequest: 0.005 },
  'support:respond': { maxInputTokens: 4000,  maxOutputTokens: 1000,  maxCostPerRequest: 0.03 },
  'rag:query':       { maxInputTokens: 16000, maxOutputTokens: 2000,  maxCostPerRequest: 0.10 },
};

async function budgetedCompletion(feature: string, model: string, messages: Message[]) {
  const budget = BUDGETS[feature];
  if (!budget) throw new Error(`No budget defined for feature: ${feature}`);

  const estimatedInput = estimateTokens(messages);
  if (estimatedInput > budget.maxInputTokens) {
    // Truncate conversation history, not system prompt
    messages = truncateHistory(messages, budget.maxInputTokens);
  }

  const estimatedCost = calculateCost(model, estimatedInput, budget.maxOutputTokens);
  if (estimatedCost > budget.maxCostPerRequest) {
    // Downgrade model instead of blocking
    model = findCheaperModel(model, estimatedInput, budget.maxCostPerRequest);
  }

  return trackedCompletion(feature, model, messages, {
    max_tokens: budget.maxOutputTokens,
  });
}

The critical design decision: when a request exceeds its budget, downgrade the model instead of failing. Users don't notice the difference between GPT-4o and GPT-4o-mini for most queries. They absolutely notice an error message.

Step 5: Trim Conversation History Intelligently

Unbounded conversation history is one of the fastest ways to blow through token budgets. Every message in the history gets re-sent with every new request. A 50-message conversation can easily hit 20,000+ input tokens per turn.

The naive approach is to keep the last N messages. The smarter approach is to keep relevant messages:

function trimConversationHistory(
  messages: Message[],
  maxTokens: number,
  strategy: 'recent' | 'smart' = 'smart'
): Message[] {
  const systemMessages = messages.filter(m => m.role === 'system');
  const conversationMessages = messages.filter(m => m.role !== 'system');

  if (strategy === 'recent') {
    // Simple: keep system prompt + most recent messages
    let kept: Message[] = [];
    let tokenCount = estimateTokens(systemMessages);

    for (let i = conversationMessages.length - 1; i >= 0; i--) {
      const msgTokens = estimateTokens([conversationMessages[i]]);
      if (tokenCount + msgTokens > maxTokens) break;
      kept.unshift(conversationMessages[i]);
      tokenCount += msgTokens;
    }
    return [...systemMessages, ...kept];
  }

  // Smart: keep system + first message (sets context) +
  // last N messages + any message containing key entities
  const first = conversationMessages.slice(0, 2);
  const recent = conversationMessages.slice(-6);
  const middle = conversationMessages.slice(2, -6);

  // Summarize the middle if it exists
  if (middle.length > 4) {
    const summary = {
      role: 'system' as const,
      content: `[Previous conversation summary: ${middle.length} messages exchanged covering: ${extractTopics(middle).join(', ')}]`
    };
    return [...systemMessages, ...first, summary, ...recent];
  }

  return [...systemMessages, ...conversationMessages];
}

This "smart trim" approach keeps the beginning (which sets context), the end (which has the current question), and replaces the middle with a one-line summary. In testing, this preserves response quality while cutting input tokens by 50–70% on long conversations.

Step 6: Build a Cost Dashboard

Metrics without visibility are just logs nobody reads. Build a simple dashboard that surfaces the numbers that matter:

// Daily cost summary — run this as a cron job
async function dailyCostReport(): Promise<CostReport> {
  const today = await db.query(`
    SELECT
      SUM(cost) as total_cost,
      COUNT(*) as total_requests,
      AVG(cost) as avg_cost_per_request,
      MAX(cost) as max_single_request,
      SUM(CASE WHEN cost > 0.10 THEN 1 ELSE 0 END) as expensive_requests,
      SUM(system_prompt_tokens) as total_system_tokens,
      SUM(input_tokens) as total_input_tokens,
      ROUND(SUM(system_prompt_tokens)::numeric /
            NULLIF(SUM(input_tokens), 0)::numeric * 100, 1) as system_prompt_waste_pct
    FROM token_metrics
    WHERE timestamp > NOW() - INTERVAL '24 hours'
  `);

  const byFeature = await db.query(`
    SELECT feature, SUM(cost) as cost, COUNT(*) as requests
    FROM token_metrics
    WHERE timestamp > NOW() - INTERVAL '24 hours'
    GROUP BY feature
    ORDER BY cost DESC
    LIMIT 5
  `);

  return {
    summary: today.rows[0],
    topFeatures: byFeature.rows,
    alerts: generateAlerts(today.rows[0]),
  };
}

function generateAlerts(summary: any): string[] {
  const alerts: string[] = [];
  if (summary.total_cost > 50) alerts.push('Daily cost exceeded $50');
  if (summary.system_prompt_waste_pct > 40) alerts.push('System prompts consuming >40% of input tokens');
  if (summary.expensive_requests > 10) alerts.push(`${summary.expensive_requests} requests exceeded $0.10`);
  if (summary.max_single_request > 1.00) alerts.push(`Most expensive single request: $${summary.max_single_request}`);
  return alerts;
}

Set up alerts for anomalies. The most useful ones:

Daily cost exceeds 2x rolling average — something changed, investigate immediately
Single request exceeds $1.00 — likely a runaway context window or recursive tool call
System prompt percentage above 40% — your prompts need trimming
Cache hit rate below 20% — prompt caching isn't working, which doubles your input costs

Step 7: Automate Waste Reduction With a Proxy

Everything above works, but it requires changes to your application code. If you're running OpenClaw (or any LLM proxy), you can implement most of these optimizations at the proxy layer instead — touching zero application code.

A compression proxy sits between your app and the LLM provider:

Your App → Compression Proxy → LLM Provider
                ↓
         Token Metrics DB

The proxy can:

Compress prompts before they hit the provider (removing redundant tokens, whitespace, and filler)
Route to cheaper models based on request complexity
Enforce token budgets at the infrastructure level
Log detailed metrics without any application changes
Cache identical requests to avoid paying twice for the same answer

This is exactly what claw.zip does for OpenClaw users. It sits in front of your LLM calls, compresses prompts by up to 70%, routes simple queries to cheaper models, and gives you a dashboard showing exactly where your tokens go — all without changing a single line of your application code.

For teams not using OpenClaw, the same architecture applies. Build a thin HTTP proxy that intercepts /v1/chat/completions requests, applies the optimizations described in this guide, and forwards the compressed request to the real provider.

Real-World Results: Before and After

Here's what these optimizations look like in practice, based on a production customer support application handling 50,000 requests per day:

Optimization	Token Reduction	Monthly Savings
Modular system prompts	-45% input tokens	$2,100
Conversation history trimming	-55% input tokens	$3,400
Token budget enforcement	-20% output tokens	$1,800
Model routing (simple queries)	-70% cost on 60% of requests	$4,200
Prompt compression (claw.zip)	-50% remaining input tokens	$2,500
Combined	-78% total cost	$14,000

The order matters. Model routing provides the largest single reduction because it changes the price per token, not just the token count. Prompt compression stacks on top by reducing the tokens that remain after all other optimizations.

The Checklist: Ship This in a Week

If you want to reduce your AI costs systematically, here's the implementation order that gives you the fastest ROI:

Day 1–2: Instrument

Wrap all LLM calls with the telemetry layer
Tag each call with the originating feature
Log to a queryable store (Postgres, ClickHouse, even SQLite)

Day 3: Analyze

Run the top-consumers SQL query
Identify your biggest system prompt and measure its token count
Calculate your system-prompt-to-input ratio

Day 4–5: Optimize

Split your largest system prompt into modular segments
Implement conversation history trimming
Set token budgets for each feature

Day 6–7: Automate

Set up the daily cost report cron
Configure alerts for anomalies
Deploy model routing or a compression proxy

Most teams see a 50–70% cost reduction within the first week. The remaining optimizations (prompt compression, advanced caching, output constraining) push that to 80–90% over the following month.

Stop Guessing, Start Measuring

The AI industry has a measurement problem. Teams spend thousands per month on LLM APIs and have less cost visibility than they would for a $50/month Heroku dyno.

Every optimization in this guide starts with the same prerequisite: know where your tokens go. Once you have that visibility, the fixes are straightforward engineering work — the same kind of performance optimization you'd do for any other expensive API.

The tools exist. claw.zip handles compression, routing, and metering automatically for OpenClaw users. For everyone else, the code examples in this guide are enough to build the same capabilities into your existing stack.

Either way, stop paying frontier model prices for simple queries. Stop sending 2,000-token system prompts with every request. Stop letting conversation histories grow without bounds.

Measure first. Cut second. Your AI budget will thank you.