You're paying GPT-4 prices for work that GPT-4o Mini could handle. Every developer running LLMs in production does this, and most don't realize how much it costs them.
The fix isn't prompt engineering. It isn't caching. It's model routing — automatically scoring each query by complexity and sending it to the cheapest model that can produce an acceptable result.
This post explains how model routing works, why it produces the largest single cost reduction for most AI applications, and how to implement it — from a basic DIY classifier to a production-grade system.
The Problem: One Model Fits None
Most AI applications are configured with a single model. Every query, regardless of complexity, hits the same endpoint:
# The expensive default
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": user_query}]
)
This is like hiring a senior engineer to answer every support ticket, including "how do I reset my password."
Here's what real production traffic looks like when you score queries by complexity:
| Complexity Tier | % of Traffic | Typical Queries | Best Model |
|---|---|---|---|
| Simple | 45-55% | Greetings, lookups, formatting, simple Q&A | GPT-4o Mini / Haiku |
| Medium | 25-35% | Summarization, moderate analysis, code explanation | Sonnet / GPT-4o |
| Complex | 10-20% | Multi-step reasoning, code generation, research | Opus / GPT-4 |
| Frontier | 2-5% | Novel reasoning, complex math, creative writing | o1 / Opus |
Most production workloads send 45-55% of their traffic to a frontier model when a model costing 10-20x less would produce identical output. That's not a rounding error. That's the majority of your bill.
The Math
Let's say you process 100,000 queries per day with an average of 2,000 tokens each (input + output combined) using Claude Sonnet at $3 per million input tokens:
- Single model (Sonnet for everything): ~$600/day → $18,000/month
- With routing (50% to Haiku, 30% Sonnet, 20% Opus): ~$210/day → $6,300/month
That's a 65% reduction just from routing, before any prompt compression. Combine routing with lossless prompt compression and you're looking at 80-93% total savings.
How Model Routing Works
A model router sits between your application and your LLM providers. For each incoming query, it:
- Scores the query complexity (simple, medium, complex, frontier)
- Selects the cheapest model capable of handling that complexity tier
- Forwards the request to the selected model
- Monitors output quality and adjusts routing decisions over time
The critical insight: you don't need to classify queries perfectly. You need to classify the easy ones correctly. If you can reliably identify the 50% of queries that are simple, you've already cut your bill in half.
Approach 1: Rule-Based Routing
The simplest router uses heuristics — token count, keyword matching, and structural analysis:
import re
from enum import Enum
class Tier(Enum):
SIMPLE = "simple"
MEDIUM = "medium"
COMPLEX = "complex"
MODEL_MAP = {
Tier.SIMPLE: "claude-3-5-haiku-20241022",
Tier.MEDIUM: "claude-sonnet-4-20250514",
Tier.COMPLEX: "claude-opus-4-20250514",
}
# Cost per 1M input tokens (USD)
COST_MAP = {
Tier.SIMPLE: 0.80,
Tier.MEDIUM: 3.00,
Tier.COMPLEX: 15.00,
}
COMPLEX_SIGNALS = [
r"(?:analyze|compare|evaluate).*(?:and|versus|between)",
r"(?:write|generate|create).*(?:code|function|class|script)",
r"(?:step.by.step|chain.of.thought|reason through)",
r"(?:multiple|several|all).*(?:options|approaches|solutions)",
]
SIMPLE_SIGNALS = [
r"^(?:hi|hello|hey|thanks|ok|yes|no)\b",
r"(?:what (?:is|are) (?:the )?(?:time|date|weather))",
r"(?:translate|convert|format)\s+\w+",
r"(?:how (?:do|can) I (?:reset|change|update))",
]
def classify_query(query: str) -> Tier:
tokens = len(query.split())
# Short queries are almost always simple
if tokens < 15:
for pattern in SIMPLE_SIGNALS:
if re.search(pattern, query, re.IGNORECASE):
return Tier.SIMPLE
# Check for complexity signals
complex_score = sum(
1 for p in COMPLEX_SIGNALS
if re.search(p, query, re.IGNORECASE)
)
if complex_score >= 2 or tokens > 500:
return Tier.COMPLEX
elif complex_score == 1 or tokens > 100:
return Tier.MEDIUM
return Tier.SIMPLE
def route(query: str) -> dict:
tier = classify_query(query)
return {
"model": MODEL_MAP[tier],
"tier": tier.value,
"estimated_cost_per_1m": COST_MAP[tier],
}
Pros: Zero latency overhead, no additional API calls, fully deterministic.
Cons: Misclassifies nuanced queries, requires manual rule maintenance.
In practice, rule-based routing correctly classifies 70-80% of queries. That's good enough to save serious money, but you leave savings on the table for the ambiguous middle tier.
Approach 2: Embedding-Based Classification
A more accurate approach uses embeddings to classify queries against labeled examples:
import numpy as np
from openai import OpenAI
client = OpenAI()
# Pre-computed centroids from labeled training data
# Each centroid is the mean embedding of 100+ labeled queries
CENTROIDS = {
"simple": np.array([...]), # Pre-computed
"medium": np.array([...]),
"complex": np.array([...]),
}
def get_embedding(text: str) -> np.ndarray:
response = client.embeddings.create(
model="text-embedding-3-small", # $0.02 per 1M tokens
input=text
)
return np.array(response.data[0].embedding)
def classify_by_embedding(query: str) -> str:
query_emb = get_embedding(query)
similarities = {
tier: np.dot(query_emb, centroid) / (
np.linalg.norm(query_emb) * np.linalg.norm(centroid)
)
for tier, centroid in CENTROIDS.items()
}
return max(similarities, key=similarities.get)
Pros: Handles nuanced queries much better, 85-90% accuracy.
Cons: Adds latency (50-100ms per classification), requires labeled training data, costs money (though text-embedding-3-small is dirt cheap at $0.02/1M tokens).
The embedding cost for classification is negligible. Classifying 100,000 queries at an average of 50 tokens each costs $0.10 total. If it saves even 5% more queries from being routed to expensive models, it pays for itself thousands of times over.
Approach 3: Small-Model Classifier
Train a dedicated small model (or use a cheap LLM) as the classifier:
CLASSIFIER_PROMPT = """Score this query's complexity from 1-10.
1-3: Simple lookup, greeting, formatting, basic Q&A
4-6: Summarization, moderate analysis, standard code help
7-9: Multi-step reasoning, complex code gen, research
10: Novel problems requiring frontier reasoning
Query: {query}
Score (number only):"""
async def classify_with_llm(query: str) -> Tier:
response = await client.chat.completions.create(
model="gpt-4o-mini", # Cheap classifier
messages=[{
"role": "user",
"content": CLASSIFIER_PROMPT.format(query=query)
}],
max_tokens=3,
temperature=0,
)
score = int(response.choices[0].message.content.strip())
if score <= 3:
return Tier.SIMPLE
elif score <= 6:
return Tier.MEDIUM
else:
return Tier.COMPLEX
Pros: Highest accuracy (90-95%), handles edge cases well, adapts to new query patterns.
Cons: Highest latency overhead (200-500ms), costs more than embedding approach.
The classification cost is still tiny compared to what it saves. A GPT-4o Mini classification call costs roughly $0.0001. If it prevents one query from unnecessarily hitting Opus (saving $0.03), it needs to be right only 1 in 300 times to break even. In practice, it's right 90%+ of the time.
Production Architecture
A production-grade router combines all three approaches in a cascade:
┌──────────────┐
│ Incoming Query│
└──────┬───────┘
│
▼
┌──────────────┐ HIGH CONFIDENCE
│ Rule-Based │──────────────────────► Route immediately
│ Classifier │ (70% of traffic)
└──────┬───────┘
│ LOW CONFIDENCE
▼
┌──────────────┐ HIGH CONFIDENCE
│ Embedding │──────────────────────► Route
│ Classifier │ (20% of traffic)
└──────┬───────┘
│ AMBIGUOUS
▼
┌──────────────┐
│ LLM-based │──────────────────────► Route
│ Classifier │ (10% of traffic)
└──────────────┘
This cascade handles 70% of queries with zero latency, 20% with minimal latency, and only burns an LLM call on the truly ambiguous 10%. Total classification cost stays under $5/month for most workloads.
The Feedback Loop
Static routing leaves money on the table. A closed-loop system monitors actual output quality and adjusts:
import json
from datetime import datetime
class RoutingFeedback:
def __init__(self, log_path: str = "routing_log.jsonl"):
self.log_path = log_path
def log_result(self, query: str, tier: str, model: str,
quality_score: float, tokens_used: int):
entry = {
"timestamp": datetime.utcnow().isoformat(),
"query_hash": hash(query) % (10**10),
"tier": tier,
"model": model,
"quality": quality_score,
"tokens": tokens_used,
}
with open(self.log_path, "a") as f:
f.write(json.dumps(entry) + "\n")
def should_upgrade(self, tier: str,
window: int = 100) -> bool:
"""Check if a tier's quality has dropped below threshold."""
recent = self._get_recent(tier, window)
if not recent:
return False
avg_quality = sum(r["quality"] for r in recent) / len(recent)
# If average quality drops below 0.7, upgrade the model
return avg_quality < 0.7
def should_downgrade(self, tier: str,
window: int = 100) -> bool:
"""Check if a tier could use a cheaper model."""
recent = self._get_recent(tier, window)
if not recent:
return False
avg_quality = sum(r["quality"] for r in recent) / len(recent)
# If quality is consistently above 0.95, try cheaper
return avg_quality > 0.95
This is the "closed-loop" in claw.zip's design. Rather than routing based on static rules, the system continuously evaluates whether each model tier is over-serving or under-serving its queries, and adjusts the boundaries accordingly.
Common Mistakes
1. Routing Too Aggressively
The temptation is to route everything to the cheapest model. Don't. A query that produces a bad response costs more than the tokens you saved — the user retries, your application looks unreliable, and you may lose trust.
Rule of thumb: When uncertain, route UP one tier. The cost difference between Haiku and Sonnet on a single query is fractions of a cent. The cost of a bad response is measured in user experience.
2. Ignoring Output Token Costs
Most routing focuses on input complexity, but output tokens are often more expensive. A "simple" query that asks for a long-form response can cost more than a "complex" query that expects a short answer.
def estimate_output_length(query: str) -> str:
"""Rough heuristic for expected output length."""
long_signals = [
"explain in detail", "write a", "comprehensive",
"full analysis", "step by step", "all possible"
]
if any(s in query.lower() for s in long_signals):
return "long" # Route to model with better $/output-token
return "standard"
3. Not Measuring Actual Savings
You can't optimize what you don't measure. Log every routed request with the model used, tokens consumed, and estimated cost. Compare weekly.
# Quick savings estimate from routing logs
cat routing_log.jsonl | \
jq -s '
group_by(.tier) |
map({
tier: .[0].tier,
count: length,
avg_tokens: (map(.tokens) | add / length),
model: .[0].model
})
'
How claw.zip Handles Model Routing
claw.zip implements a production-grade version of this architecture with two key additions:
Combined with prompt compression. Every query is compacted before routing. This means the router scores a tighter, more precise version of the query, which improves classification accuracy while also reducing the token cost of whichever model ultimately handles it.
Zero-config setup. Instead of building and maintaining your own classifier, training data, and feedback loops, claw.zip handles routing out of the box. Install it, point it at your API keys, and it starts routing immediately.
The compound effect of compression + routing is why most users see 80-93% savings rather than the 60-70% from routing alone.
# Install claw.zip — routing + compression in one step
npx claw-zip
No code changes. No model switching logic. No classifier training. It sits between your application and your LLM provider, and every query gets compressed and routed automatically.
When NOT to Use Model Routing
Model routing isn't always the right answer:
- Safety-critical applications: If every response must come from the most capable model for liability reasons, don't route down.
- Very low volume: If you make 100 API calls a day, the engineering effort of routing exceeds the savings. Just use a mid-tier model.
- Homogeneous queries: If every query genuinely requires frontier reasoning (advanced math, novel code generation), routing won't find cheaper alternatives.
For everyone else — which is most production AI applications — model routing is the single highest-impact cost optimization you can implement.
Getting Started
If you want to build your own router:
- Start with rules. Implement the rule-based classifier from this post. It takes an hour and immediately saves 30-40%.
- Log everything. Capture the model used, tier assigned, and output quality for every request.
- Add embeddings after 1,000 queries. Use your logged data to compute tier centroids, then add the embedding classifier for ambiguous cases.
- Close the loop. Build the feedback mechanism to auto-adjust tier boundaries based on quality scores.
If you want the result without the work:
npx claw-zip
Routing + compression. One command. See what you'd save →