How do I get started with claw.zip?

Run npx claw-zip init in your terminal. It will walk you through credential setup and configuration. You can also sign up at claw.zip and create your first API key from the dashboard.

Will compression change my model outputs?

Our compression is designed to preserve semantic meaning. For most use cases, outputs are identical. If you need guaranteed verbatim passthrough, you can disable compression per-key from the dashboard.

What happens if claw.zip goes down?

claw.zip runs on Cloudflare Workers across 300+ global edge locations. In the unlikely event of an outage, you can point your config back to the Anthropic API directly — no data is locked in our system.

Can I use claw.zip with non-Anthropic models?

Currently, claw.zip is optimized for the Anthropic API (Claude models). Support for additional providers is on our roadmap.

Intelligent Model Routing: How to Cut Your LLM API Costs by 70% Without Sacrificing Quality

You're paying GPT-4 prices for work that GPT-4o Mini could handle. Every developer running LLMs in production does this, and most don't realize how much it costs them.

The fix isn't prompt engineering. It isn't caching. It's model routing — automatically scoring each query by complexity and sending it to the cheapest model that can produce an acceptable result.

This post explains how model routing works, why it produces the largest single cost reduction for most AI applications, and how to implement it — from a basic DIY classifier to a production-grade system.

The Problem: One Model Fits None

Most AI applications are configured with a single model. Every query, regardless of complexity, hits the same endpoint:

# The expensive default
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": user_query}]
)

This is like hiring a senior engineer to answer every support ticket, including "how do I reset my password."

Here's what real production traffic looks like when you score queries by complexity:

Complexity Tier	% of Traffic	Typical Queries	Best Model
Simple	45-55%	Greetings, lookups, formatting, simple Q&A	GPT-4o Mini / Haiku
Medium	25-35%	Summarization, moderate analysis, code explanation	Sonnet / GPT-4o
Complex	10-20%	Multi-step reasoning, code generation, research	Opus / GPT-4
Frontier	2-5%	Novel reasoning, complex math, creative writing	o1 / Opus

Most production workloads send 45-55% of their traffic to a frontier model when a model costing 10-20x less would produce identical output. That's not a rounding error. That's the majority of your bill.

The Math

Let's say you process 100,000 queries per day with an average of 2,000 tokens each (input + output combined) using Claude Sonnet at $3 per million input tokens:

Single model (Sonnet for everything): ~$600/day → $18,000/month
With routing (50% to Haiku, 30% Sonnet, 20% Opus): ~$210/day → $6,300/month

That's a 65% reduction just from routing, before any prompt compression. Combine routing with lossless prompt compression and you're looking at 80-93% total savings.

How Model Routing Works

A model router sits between your application and your LLM providers. For each incoming query, it:

Scores the query complexity (simple, medium, complex, frontier)
Selects the cheapest model capable of handling that complexity tier
Forwards the request to the selected model
Monitors output quality and adjusts routing decisions over time

The critical insight: you don't need to classify queries perfectly. You need to classify the easy ones correctly. If you can reliably identify the 50% of queries that are simple, you've already cut your bill in half.

Approach 1: Rule-Based Routing

The simplest router uses heuristics — token count, keyword matching, and structural analysis:

import re
from enum import Enum

class Tier(Enum):
    SIMPLE = "simple"
    MEDIUM = "medium"
    COMPLEX = "complex"

MODEL_MAP = {
    Tier.SIMPLE: "claude-3-5-haiku-20241022",
    Tier.MEDIUM: "claude-sonnet-4-20250514",
    Tier.COMPLEX: "claude-opus-4-20250514",
}

# Cost per 1M input tokens (USD)
COST_MAP = {
    Tier.SIMPLE: 0.80,
    Tier.MEDIUM: 3.00,
    Tier.COMPLEX: 15.00,
}

COMPLEX_SIGNALS = [
    r"(?:analyze|compare|evaluate).*(?:and|versus|between)",
    r"(?:write|generate|create).*(?:code|function|class|script)",
    r"(?:step.by.step|chain.of.thought|reason through)",
    r"(?:multiple|several|all).*(?:options|approaches|solutions)",
]

SIMPLE_SIGNALS = [
    r"^(?:hi|hello|hey|thanks|ok|yes|no)\b",
    r"(?:what (?:is|are) (?:the )?(?:time|date|weather))",
    r"(?:translate|convert|format)\s+\w+",
    r"(?:how (?:do|can) I (?:reset|change|update))",
]

def classify_query(query: str) -> Tier:
    tokens = len(query.split())
    
    # Short queries are almost always simple
    if tokens < 15:
        for pattern in SIMPLE_SIGNALS:
            if re.search(pattern, query, re.IGNORECASE):
                return Tier.SIMPLE
    
    # Check for complexity signals
    complex_score = sum(
        1 for p in COMPLEX_SIGNALS 
        if re.search(p, query, re.IGNORECASE)
    )
    
    if complex_score >= 2 or tokens > 500:
        return Tier.COMPLEX
    elif complex_score == 1 or tokens > 100:
        return Tier.MEDIUM
    
    return Tier.SIMPLE

def route(query: str) -> dict:
    tier = classify_query(query)
    return {
        "model": MODEL_MAP[tier],
        "tier": tier.value,
        "estimated_cost_per_1m": COST_MAP[tier],
    }

Pros: Zero latency overhead, no additional API calls, fully deterministic.
Cons: Misclassifies nuanced queries, requires manual rule maintenance.

In practice, rule-based routing correctly classifies 70-80% of queries. That's good enough to save serious money, but you leave savings on the table for the ambiguous middle tier.

Approach 2: Embedding-Based Classification

A more accurate approach uses embeddings to classify queries against labeled examples:

import numpy as np
from openai import OpenAI

client = OpenAI()

# Pre-computed centroids from labeled training data
# Each centroid is the mean embedding of 100+ labeled queries
CENTROIDS = {
    "simple": np.array([...]),   # Pre-computed
    "medium": np.array([...]),
    "complex": np.array([...]),
}

def get_embedding(text: str) -> np.ndarray:
    response = client.embeddings.create(
        model="text-embedding-3-small",  # $0.02 per 1M tokens
        input=text
    )
    return np.array(response.data[0].embedding)

def classify_by_embedding(query: str) -> str:
    query_emb = get_embedding(query)
    
    similarities = {
        tier: np.dot(query_emb, centroid) / (
            np.linalg.norm(query_emb) * np.linalg.norm(centroid)
        )
        for tier, centroid in CENTROIDS.items()
    }
    
    return max(similarities, key=similarities.get)

Pros: Handles nuanced queries much better, 85-90% accuracy.
Cons: Adds latency (50-100ms per classification), requires labeled training data, costs money (though text-embedding-3-small is dirt cheap at $0.02/1M tokens).

The embedding cost for classification is negligible. Classifying 100,000 queries at an average of 50 tokens each costs $0.10 total. If it saves even 5% more queries from being routed to expensive models, it pays for itself thousands of times over.

Approach 3: Small-Model Classifier

Train a dedicated small model (or use a cheap LLM) as the classifier:

CLASSIFIER_PROMPT = """Score this query's complexity from 1-10.
1-3: Simple lookup, greeting, formatting, basic Q&A
4-6: Summarization, moderate analysis, standard code help
7-9: Multi-step reasoning, complex code gen, research
10: Novel problems requiring frontier reasoning

Query: {query}

Score (number only):"""

async def classify_with_llm(query: str) -> Tier:
    response = await client.chat.completions.create(
        model="gpt-4o-mini",  # Cheap classifier
        messages=[{
            "role": "user",
            "content": CLASSIFIER_PROMPT.format(query=query)
        }],
        max_tokens=3,
        temperature=0,
    )
    
    score = int(response.choices[0].message.content.strip())
    
    if score <= 3:
        return Tier.SIMPLE
    elif score <= 6:
        return Tier.MEDIUM
    else:
        return Tier.COMPLEX

Pros: Highest accuracy (90-95%), handles edge cases well, adapts to new query patterns.
Cons: Highest latency overhead (200-500ms), costs more than embedding approach.

The classification cost is still tiny compared to what it saves. A GPT-4o Mini classification call costs roughly $0.0001. If it prevents one query from unnecessarily hitting Opus (saving $0.03), it needs to be right only 1 in 300 times to break even. In practice, it's right 90%+ of the time.

Production Architecture

A production-grade router combines all three approaches in a cascade:

┌──────────────┐
│ Incoming Query│
└──────┬───────┘
       │
       ▼
┌──────────────┐     HIGH CONFIDENCE
│  Rule-Based  │──────────────────────► Route immediately
│  Classifier  │     (70% of traffic)
└──────┬───────┘
       │ LOW CONFIDENCE
       ▼
┌──────────────┐     HIGH CONFIDENCE
│  Embedding   │──────────────────────► Route
│  Classifier  │     (20% of traffic)
└──────┬───────┘
       │ AMBIGUOUS
       ▼
┌──────────────┐
│  LLM-based   │──────────────────────► Route
│  Classifier  │     (10% of traffic)
└──────────────┘

This cascade handles 70% of queries with zero latency, 20% with minimal latency, and only burns an LLM call on the truly ambiguous 10%. Total classification cost stays under $5/month for most workloads.

The Feedback Loop

Static routing leaves money on the table. A closed-loop system monitors actual output quality and adjusts:

import json
from datetime import datetime

class RoutingFeedback:
    def __init__(self, log_path: str = "routing_log.jsonl"):
        self.log_path = log_path
    
    def log_result(self, query: str, tier: str, model: str, 
                   quality_score: float, tokens_used: int):
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "query_hash": hash(query) % (10**10),
            "tier": tier,
            "model": model,
            "quality": quality_score,
            "tokens": tokens_used,
        }
        with open(self.log_path, "a") as f:
            f.write(json.dumps(entry) + "\n")
    
    def should_upgrade(self, tier: str, 
                       window: int = 100) -> bool:
        """Check if a tier's quality has dropped below threshold."""
        recent = self._get_recent(tier, window)
        if not recent:
            return False
        
        avg_quality = sum(r["quality"] for r in recent) / len(recent)
        
        # If average quality drops below 0.7, upgrade the model
        return avg_quality < 0.7
    
    def should_downgrade(self, tier: str,
                         window: int = 100) -> bool:
        """Check if a tier could use a cheaper model."""
        recent = self._get_recent(tier, window)
        if not recent:
            return False
        
        avg_quality = sum(r["quality"] for r in recent) / len(recent)
        
        # If quality is consistently above 0.95, try cheaper
        return avg_quality > 0.95

This is the "closed-loop" in claw.zip's design. Rather than routing based on static rules, the system continuously evaluates whether each model tier is over-serving or under-serving its queries, and adjusts the boundaries accordingly.

Common Mistakes

1. Routing Too Aggressively

The temptation is to route everything to the cheapest model. Don't. A query that produces a bad response costs more than the tokens you saved — the user retries, your application looks unreliable, and you may lose trust.

Rule of thumb: When uncertain, route UP one tier. The cost difference between Haiku and Sonnet on a single query is fractions of a cent. The cost of a bad response is measured in user experience.

2. Ignoring Output Token Costs

Most routing focuses on input complexity, but output tokens are often more expensive. A "simple" query that asks for a long-form response can cost more than a "complex" query that expects a short answer.

def estimate_output_length(query: str) -> str:
    """Rough heuristic for expected output length."""
    long_signals = [
        "explain in detail", "write a", "comprehensive",
        "full analysis", "step by step", "all possible"
    ]
    
    if any(s in query.lower() for s in long_signals):
        return "long"  # Route to model with better $/output-token
    return "standard"

3. Not Measuring Actual Savings

You can't optimize what you don't measure. Log every routed request with the model used, tokens consumed, and estimated cost. Compare weekly.

# Quick savings estimate from routing logs
cat routing_log.jsonl | \
  jq -s '
    group_by(.tier) | 
    map({
      tier: .[0].tier,
      count: length,
      avg_tokens: (map(.tokens) | add / length),
      model: .[0].model
    })
  '

How claw.zip Handles Model Routing

claw.zip implements a production-grade version of this architecture with two key additions:

Combined with prompt compression. Every query is compacted before routing. This means the router scores a tighter, more precise version of the query, which improves classification accuracy while also reducing the token cost of whichever model ultimately handles it.
Zero-config setup. Instead of building and maintaining your own classifier, training data, and feedback loops, claw.zip handles routing out of the box. Install it, point it at your API keys, and it starts routing immediately.

The compound effect of compression + routing is why most users see 80-93% savings rather than the 60-70% from routing alone.

# Install claw.zip — routing + compression in one step
npx claw-zip

No code changes. No model switching logic. No classifier training. It sits between your application and your LLM provider, and every query gets compressed and routed automatically.

When NOT to Use Model Routing

Model routing isn't always the right answer:

Safety-critical applications: If every response must come from the most capable model for liability reasons, don't route down.
Very low volume: If you make 100 API calls a day, the engineering effort of routing exceeds the savings. Just use a mid-tier model.
Homogeneous queries: If every query genuinely requires frontier reasoning (advanced math, novel code generation), routing won't find cheaper alternatives.

For everyone else — which is most production AI applications — model routing is the single highest-impact cost optimization you can implement.

Getting Started

If you want to build your own router:

Start with rules. Implement the rule-based classifier from this post. It takes an hour and immediately saves 30-40%.
Log everything. Capture the model used, tier assigned, and output quality for every request.
Add embeddings after 1,000 queries. Use your logged data to compute tier centroids, then add the embedding classifier for ambiguous cases.
Close the loop. Build the feedback mechanism to auto-adjust tier boundaries based on quality scores.

If you want the result without the work:

npx claw-zip

Routing + compression. One command. See what you'd save →