You are sending every query to the same model. That model is probably Claude Opus, GPT-4o, or something equivalently expensive. And for roughly 70% of those queries, you are paying 10–40x more than you need to.
This is not a guess. It is a pattern we see across every OpenClaw deployment we have analyzed. The median user sends classification tasks, simple lookups, reformatting jobs, and basic Q&A to the same $15/million-token model they use for complex multi-step reasoning. It is like taking a helicopter to the grocery store.
Model routing fixes this. It scores each incoming query by complexity, picks the cheapest model that can actually handle it, and sends it there instead. No manual switching. No quality loss. Just a dramatically smaller bill.
This post explains exactly how model routing works, why it produces such large savings, and how to implement it — whether you build your own or let claw.zip handle it.
The Core Problem: One Model Fits None
Here is a typical OpenClaw setup. You have configured Claude 3.5 Sonnet (or Opus, or GPT-4o) as your default model. Every message goes through that model — from "what time is it in Tokyo" to "refactor this 2000-line module and explain the architectural tradeoffs."
The pricing difference between models is not incremental. It is an order of magnitude:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Relative Cost |
|---|---|---|---|
| Claude 3.5 Haiku | $0.80 | $4.00 | 1x |
| Claude 3.5 Sonnet | $3.00 | $15.00 | ~3.75x |
| Claude 3.5 Opus | $15.00 | $75.00 | ~18.75x |
| GPT-4o | $2.50 | $10.00 | ~3.1x |
| GPT-4o mini | $0.15 | $0.60 | ~0.19x |
Sending "summarize this paragraph" to Opus costs 18.75x more than sending it to Haiku. Both models produce an equally good summary. You paid 18.75x for nothing.
Multiply that across hundreds or thousands of queries per day. That is where your bill comes from.
What Model Routing Actually Does
Model routing is a classifier that sits between your application and the LLM provider. For each incoming request, it:
- Analyzes the query — length, structural complexity, required capabilities (code generation, reasoning depth, tool use, creativity)
- Assigns a complexity score — typically a tier like "simple", "moderate", or "complex"
- Routes to the cheapest capable model — simple queries go to small/cheap models, complex ones go to large/expensive models
The key insight is that most queries are simple. Across production workloads, the distribution typically looks like this:
- ~40% trivial — classification, extraction, reformatting, simple Q&A
- ~30% moderate — summarization, basic analysis, standard code generation
- ~20% complex — multi-step reasoning, architectural decisions, nuanced writing
- ~10% hard — novel problem-solving, expert-level analysis, long-context synthesis
If you route that bottom 40% to a model that costs 1/19th the price, and the next 30% to one that costs 1/4th, you have just eliminated the majority of your spend — without touching the hard queries at all.
Building a Basic Router: The Naive Approach
The simplest model router is a rules-based classifier. Here is one you can build in an afternoon:
// basic-router.js — a starting point, not production-ready
const TIERS = {
simple: {
model: 'claude-3-5-haiku-20241022',
maxTokens: 500,
},
moderate: {
model: 'claude-3-5-sonnet-20241022',
maxTokens: 2000,
},
complex: {
model: 'claude-3-5-opus-20260620',
maxTokens: 4000,
},
};
function classifyQuery(query) {
const wordCount = query.split(/\s+/).length;
const hasCode = /```[\s\S]*```/.test(query) || /function |class |const |import /.test(query);
const hasReasoning = /explain why|compare|analyze|tradeoff|architect/i.test(query);
const hasMultiStep = /step by step|first.*then.*finally|chain of thought/i.test(query);
let score = 0;
// Length signals
if (wordCount > 500) score += 2;
else if (wordCount > 150) score += 1;
// Capability signals
if (hasCode) score += 2;
if (hasReasoning) score += 2;
if (hasMultiStep) score += 3;
// Classify
if (score >= 5) return 'complex';
if (score >= 2) return 'moderate';
return 'simple';
}
function routeQuery(query) {
const tier = classifyQuery(query);
const config = TIERS[tier];
console.log(`Routing to ${config.model} (tier: ${tier})`);
return config;
}
This works. Sort of. It catches the obvious cases — short simple prompts go to Haiku, long multi-step prompts go to Opus. But it has real problems:
- Keyword matching is brittle. "Explain why the sky is blue" scores the same as "explain why our distributed consensus algorithm fails under network partition with Byzantine nodes."
- No feedback loop. When routing goes wrong (Haiku produces garbage for a query it should not have handled), there is no mechanism to learn from that.
- Static thresholds. The scoring weights are guesses. They do not adapt to your actual workload.
This is where most DIY routing stalls. The classifier is easy to build but hard to make reliable.
The Feedback Loop: What Makes Routing Actually Work
The difference between a toy router and a production router is the closed loop. Here is the architecture:
User Query
│
▼
┌──────────────┐
│ Complexity │
│ Classifier │──── scores query ────┐
└──────────────┘ │
▼
┌─────────────┐
│ Model Select │
│ (tier → │
│ model) │
└──────┬──────┘
│
▼
┌─────────────┐
│ LLM Call │
└──────┬──────┘
│
▼
┌─────────────────┐
│ Quality Check │
│ (did it work?) │
└────────┬────────┘
│
┌────────┴────────┐
│ │
✅ Pass ❌ Fail
│ │
Return result Escalate to
higher tier
│
Log mismatch
(train classifier)
The quality check is the critical piece. After the cheap model responds, you validate the output before returning it. If validation fails, you escalate to the next tier. Every escalation is logged and fed back into the classifier, so it learns which query patterns need more expensive models.
Here is what a quality check looks like in practice:
async function routeWithFallback(query) {
const tier = classifyQuery(query);
const tierOrder = ['simple', 'moderate', 'complex'];
let currentIndex = tierOrder.indexOf(tier);
while (currentIndex < tierOrder.length) {
const config = TIERS[tierOrder[currentIndex]];
const response = await callModel(config.model, query);
const quality = await assessQuality(query, response);
if (quality.pass) {
logRouting({
query: hashQuery(query),
assignedTier: tier,
executedTier: tierOrder[currentIndex],
escalated: currentIndex > tierOrder.indexOf(tier),
cost: response.usage.totalCost,
});
return response;
}
// Escalate
console.warn(`Quality check failed at ${tierOrder[currentIndex]}, escalating...`);
currentIndex++;
}
// Should not reach here — complex tier should always pass
throw new Error('All tiers exhausted');
}
The assessQuality function can be as simple as checking for refusal patterns and minimum response length, or as sophisticated as running a lightweight evaluator. In production, most teams use a combination:
async function assessQuality(query, response) {
const text = response.content;
// Hard failures — instant escalation
if (text.length < 20) return { pass: false, reason: 'too_short' };
if (/I cannot|I'm unable|I don't have/i.test(text)) {
return { pass: false, reason: 'refusal' };
}
// Structural checks for specific query types
if (query.includes('```') && !text.includes('```')) {
return { pass: false, reason: 'missing_code' };
}
return { pass: true };
}
Real Numbers: What Routing Saves
We analyzed 30 days of traffic across 50 OpenClaw deployments running claw.zip. Here is the average routing distribution and its cost impact:
Before routing (everything to Sonnet):
- 100% of queries → Claude 3.5 Sonnet
- Average cost: $0.0042 per query
After routing:
- 38% of queries → Haiku ($0.80/1M input) — avg cost: $0.00031/query
- 34% of queries → Sonnet ($3.00/1M input) — avg cost: $0.0042/query
- 22% of queries → GPT-4o mini ($0.15/1M input) — avg cost: $0.00008/query
- 6% of queries → Opus ($15.00/1M input) — avg cost: $0.019/query
Blended average: $0.0019 per query — a 55% reduction from routing alone.
Combined with prompt compression (which reduces token count before routing), the compound savings reach 80–93%. Compression shrinks the input; routing picks the cheapest model for that shrunken input. Each optimization multiplies the other.
The Escalation Tax: Why It's Worth It
A common objection: "If routing fails and escalates, I'm paying for two model calls instead of one. Doesn't that cancel out the savings?"
No. Here is why.
In a well-tuned router, escalation happens on 3–5% of queries. For those queries, you pay for the cheap call (which failed) plus the expensive call. But for the other 95% of queries, you are paying a fraction of what you were before.
The math:
Without routing:
1,000 queries × $0.0042 = $4.20
With routing (5% escalation rate):
950 queries routed correctly × $0.0019 = $1.81
50 queries escalated (pay for both tiers) × $0.0061 = $0.31
Total: $2.12
Savings: 49.5% — even accounting for escalation overhead
As the classifier improves from escalation feedback, the escalation rate drops. After 30 days of learning, we typically see escalation rates below 2%, pushing savings above 55% from routing alone.
Implementing Routing in OpenClaw
If you are running OpenClaw, you have two options for adding model routing.
Option 1: Manual Model Switching
OpenClaw supports per-message model selection. You can build routing logic in your client or middleware:
# openclaw config — define available models
agents:
defaults:
model: claude-3-5-sonnet-20241022
models:
- claude-3-5-haiku-20241022
- claude-3-5-sonnet-20241022
- claude-3-5-opus-20260620
- gpt-4o-mini
Then in your routing layer, set the model per-request based on your classifier output. This works but requires you to build and maintain the classifier, quality checks, and feedback loop yourself.
Option 2: Use claw.zip
claw.zip handles routing automatically. It sits between OpenClaw and your LLM providers, classifies every query, routes it, validates the output, and escalates when needed — all without configuration.
npx claw-zip
The routing classifier is pre-trained on millions of queries and continuously improves from aggregated (anonymized) feedback across all deployments. You do not need to tune thresholds, define tiers, or build quality checks. It works out of the box.
The closed-loop architecture means routing accuracy improves over time. Queries that cause escalations are automatically flagged, and the classifier adapts so similar queries route correctly on the next attempt.
When NOT to Route
Model routing is not appropriate for every workload. Skip it if:
- You need deterministic output. If your pipeline requires byte-identical responses across runs (e.g., for compliance auditing), switching models between calls will produce inconsistencies.
- You only make a few queries per day. The savings from routing are proportional to volume. At 10 queries/day, you are saving pennies.
- Your queries are uniformly complex. If every query is genuinely a multi-step reasoning task (e.g., you are running an agent framework with tool use on every call), there is nothing to route downward.
For everyone else — customer support bots, dev assistants, content pipelines, data extraction, coding tools — routing produces immediate, significant savings.
Key Takeaways
- Most queries are simple. ~70% of typical workloads can run on cheaper models without quality loss.
- The price gap between models is enormous. 10–19x between the cheapest and most expensive options in the same model family.
- Naive routing works but plateaus. Keyword-based classifiers catch obvious cases but miss nuance.
- The feedback loop is what matters. Escalation + logging + retraining is what turns a brittle router into a reliable one.
- Routing compounds with compression. Fewer tokens × cheaper model = multiplicative savings.
You are already paying for the most expensive model on every query. The only question is how much longer you want to keep doing that.
claw.zip handles model routing and prompt compression automatically for any OpenClaw deployment. Run npx claw-zip to start saving.