Context Window Optimization: Advanced Techniques to Slash LLM Costs by 60%
Every API call to an LLM costs money. But most developers don't realize that up to 70% of tokens sent to language models are redundant, poorly structured, or unnecessarily verbose. Context window optimization is the fastest way to cut costs without sacrificing quality.
In this guide, you'll learn battle-tested techniques to reduce token usage, implement smart caching, and structure context windows for maximum efficiency. These strategies have helped teams save thousands of dollars monthly on OpenAI, Anthropic, and other LLM providers.
Why Context Windows Matter for Cost
Modern LLMs charge per token for both input and output. With GPT-4, you're paying $0.03 per 1K input tokens and $0.06 per 1K output tokens. For Claude Sonnet, it's $0.003/$0.015 per 1K tokens. Small inefficiencies compound fast.
Consider a typical AI application making 10,000 requests daily with an average 2,000 input tokens:
- Unoptimized: 20M tokens/day = $600/day = $18,000/month (GPT-4)
- Optimized (60% reduction): 8M tokens/day = $240/day = $7,200/month
- Savings: $10,800/month
The math is clear. Let's dive into how to achieve these savings.
1. Implement Semantic Context Pruning
Most applications dump entire conversation histories or documents into the context window. This wastes tokens on irrelevant information.
Semantic pruning identifies and removes context that doesn't contribute to the current task.
Example: Smart Conversation History
Instead of sending the full chat history:
// ❌ Wasteful: sending 20 messages (avg 150 tokens each = 3,000 tokens)
const context = conversationHistory.map(msg =>
`${msg.role}: ${msg.content}`
).join('\n');
Use semantic similarity to keep only relevant messages:
// ✅ Optimized: send only relevant context (avg 800 tokens)
import { cosineSimilarity, embed } from './embeddings';
async function pruneContext(query, history, maxTokens = 1000) {
const queryEmbedding = await embed(query);
// Score each message by relevance
const scored = await Promise.all(
history.map(async msg => ({
...msg,
score: await cosineSimilarity(queryEmbedding, await embed(msg.content))
}))
);
// Keep most relevant messages within token budget
const sorted = scored.sort((a, b) => b.score - a.score);
const selected = [];
let tokenCount = 0;
for (const msg of sorted) {
const msgTokens = estimateTokens(msg.content);
if (tokenCount + msgTokens <= maxTokens) {
selected.push(msg);
tokenCount += msgTokens;
}
}
return selected.sort((a, b) => a.timestamp - b.timestamp);
}
Impact: 60-70% reduction in conversation history tokens while maintaining context quality.
2. Context Window Caching
Repetitive context (system prompts, documentation, common instructions) should never be sent multiple times. Modern LLM providers offer prompt caching, but you can also implement your own caching layer.
Anthropic Claude Prompt Caching
Claude supports prompt caching that reduces costs for repeated context:
import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic();
// Mark static context for caching
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-5',
max_tokens: 1024,
system: [
{
type: 'text',
text: 'You are a helpful coding assistant...',
cache_control: { type: 'ephemeral' } // Cache this system prompt
}
],
messages: [
{ role: 'user', content: 'Explain async/await' }
]
});
Pricing advantage:
- First request: Full input cost
- Cached requests (5min TTL): 90% discount on cached tokens
- For 10K requests with 1K cached system prompt: Save ~$270/day
DIY Context Hashing
For providers without native caching:
import { createHash } from 'crypto';
const contextCache = new Map();
function getCachedContext(contextKey, generator, ttl = 300000) {
const hash = createHash('sha256').update(contextKey).digest('hex');
if (contextCache.has(hash)) {
const { content, expires } = contextCache.get(hash);
if (Date.now() < expires) {
return content;
}
}
const content = generator();
contextCache.set(hash, {
content,
expires: Date.now() + ttl
});
return content;
}
// Usage
const systemPrompt = getCachedContext('v1-system-prompt', () =>
loadSystemPrompt()
);
3. Token-Aware Context Windowing
Dynamically adjust context based on available token budget and task complexity.
function buildContextWindow(task, budget = 4000) {
const priority = [
{ name: 'system', tokens: 200, required: true },
{ name: 'task', tokens: 300, required: true },
{ name: 'recentHistory', tokens: 800, required: false },
{ name: 'documentation', tokens: 2000, required: false },
{ name: 'examples', tokens: 600, required: false }
];
let remaining = budget;
const context = [];
// Add required context first
for (const item of priority.filter(p => p.required)) {
if (remaining >= item.tokens) {
context.push({ ...item, content: getContent(item.name) });
remaining -= item.tokens;
}
}
// Add optional context by priority
for (const item of priority.filter(p => !p.required)) {
if (remaining >= item.tokens) {
context.push({ ...item, content: getContent(item.name) });
remaining -= item.tokens;
}
}
return context;
}
4. Compress Without Losing Meaning
Prompt compression removes redundancy while preserving semantic content.
Rule-Based Compression
function compressPrompt(text) {
return text
// Remove excessive whitespace
.replace(/\s+/g, ' ')
// Remove filler words
.replace(/\b(very|really|actually|basically|literally)\b/gi, '')
// Shorten common phrases
.replace(/in order to/gi, 'to')
.replace(/due to the fact that/gi, 'because')
.replace(/at this point in time/gi, 'now')
// Remove redundant punctuation
.replace(/\.{2,}/g, '.')
.trim();
}
LLMLingua-Style Compression
For more aggressive compression:
import { encode } from 'gpt-tokenizer';
function aggressiveCompress(text, targetRatio = 0.5) {
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
const tokens = encode(text);
const targetTokens = Math.floor(tokens.length * targetRatio);
// Score sentences by information density
const scored = sentences.map(s => ({
text: s,
tokens: encode(s).length,
score: calculateInfoScore(s)
}));
// Keep highest-scoring sentences within budget
const sorted = scored.sort((a, b) => b.score - a.score);
const selected = [];
let tokenCount = 0;
for (const sent of sorted) {
if (tokenCount + sent.tokens <= targetTokens) {
selected.push(sent);
tokenCount += sent.tokens;
}
}
return selected
.sort((a, b) => sentences.indexOf(a.text) - sentences.indexOf(b.text))
.map(s => s.text)
.join(' ');
}
function calculateInfoScore(sentence) {
const hasNumbers = /\d/.test(sentence) ? 1.2 : 1;
const hasCode = /[{}\[\]()<>]/.test(sentence) ? 1.3 : 1;
const length = Math.log(sentence.length + 1);
return hasNumbers * hasCode * length;
}
Typical compression: 30-50% token reduction with minimal information loss.
5. Model Routing for Context Size
Not every query needs GPT-4's 128K context window. Route requests to cost-appropriate models based on actual context needs.
const MODEL_CONFIGS = {
'gpt-4o': { maxTokens: 128000, costPer1k: 0.005, quality: 10 },
'gpt-4o-mini': { maxTokens: 128000, costPer1k: 0.00015, quality: 7 },
'claude-haiku': { maxTokens: 200000, costPer1k: 0.00025, quality: 6 }
};
function selectModel(contextTokens, complexityScore) {
// Simple queries with small context → cheap models
if (contextTokens < 2000 && complexityScore < 5) {
return 'gpt-4o-mini';
}
// Medium complexity → Haiku
if (contextTokens < 10000 && complexityScore < 8) {
return 'claude-haiku';
}
// Complex or large context → GPT-4o
return 'gpt-4o';
}
async function routedCompletion(prompt, context) {
const tokens = estimateTokens(context + prompt);
const complexity = analyzeComplexity(prompt);
const model = selectModel(tokens, complexity);
console.log(`Routing to ${model} (${tokens} tokens, complexity ${complexity})`);
return await callLLM(model, prompt, context);
}
Real-world impact: 40-60% cost reduction by avoiding GPT-4 for simple queries.
6. Streaming and Early Termination
Stop generation when you have enough information:
async function streamWithEarlyStop(prompt, stopCondition) {
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: prompt }],
stream: true
});
let response = '';
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || '';
response += content;
// Stop early if condition met
if (stopCondition(response)) {
stream.controller.abort();
break;
}
}
return response;
}
// Example: stop after finding the answer
const answer = await streamWithEarlyStop(
'Extract the error code from this log...',
(text) => /ERROR-\d{4}/.test(text)
);
Savings: Reduce output tokens by 20-40% when full responses aren't needed.
7. Batch Processing for Efficiency
Group similar requests to amortize context overhead:
async function batchProcess(tasks, sharedContext) {
const batched = `
${sharedContext}
Process these tasks:
${tasks.map((t, i) => `${i + 1}. ${t}`).join('\n')}
Respond with numbered results.
`;
const response = await llm.complete(batched);
return parseNumberedResults(response);
}
// Instead of 10 calls with repeated context
const results = await batchProcess([
'Summarize doc A',
'Summarize doc B',
// ... 8 more
], sharedDocumentationContext);
Efficiency gain: 50-70% token reduction vs. individual calls.
8. Monitor and Alert on Token Waste
Track token usage to identify optimization opportunities:
import { createLogger } from './logger';
const tokenLogger = createLogger('token-usage');
function trackTokenUsage(model, inputTokens, outputTokens, cost) {
tokenLogger.info({
model,
inputTokens,
outputTokens,
totalTokens: inputTokens + outputTokens,
cost,
timestamp: Date.now()
});
// Alert on unusually high usage
if (inputTokens > 8000) {
console.warn(`⚠️ High input token count: ${inputTokens}`);
}
}
// Middleware wrapper
async function monitoredCompletion(prompt, context) {
const startTokens = estimateTokens(prompt + context);
const response = await llm.complete(prompt, context);
const outputTokens = estimateTokens(response);
const cost = calculateCost(startTokens, outputTokens, 'gpt-4o');
trackTokenUsage('gpt-4o', startTokens, outputTokens, cost);
return response;
}
Real-World Results
Here's what teams achieved using these techniques:
| Company | Before | After | Savings | Method |
|---|---|---|---|---|
| SaaS Startup | $12K/mo | $4.8K/mo | 60% | Context pruning + caching |
| E-commerce | $8K/mo | $3.2K/mo | 60% | Model routing + compression |
| Support Chatbot | $15K/mo | $5.5K/mo | 63% | Batch processing + early stop |
Getting Started Checklist
- Audit current usage: Track token counts per request type
- Implement caching: Start with system prompts and documentation
- Add context pruning: Remove irrelevant conversation history
- Enable model routing: Use cheaper models for simple queries
- Compress prompts: Apply rule-based compression first
- Monitor savings: Track cost reduction week-over-week
Tools to Accelerate Optimization
- claw.zip: Production-ready LLM cost optimization with automatic compression and routing
- tiktoken: Token counting for OpenAI models
- anthropic-tokenizer: Token counting for Claude
- LLMLingua: Advanced prompt compression library
Conclusion
Context window optimization isn't just about cutting costs—it's about building efficient, scalable AI applications. By implementing semantic pruning, caching, model routing, and compression, you can reduce LLM costs by 60% or more without sacrificing quality.
Start with the low-hanging fruit (caching and basic pruning), measure your results, then iterate on more advanced techniques. Your API bill will thank you.
Ready to optimize your LLM costs? Try claw.zip for automatic token reduction and intelligent model routing—no code changes required.