LLM Cost Optimization and Multi-Model Routing — How to Cut AI Costs by 60%

This article is part 3 of the Business Decisions of LLM Model Selection whitepaper series. Other parts: The 2026 LLM market map, Task-based model selection and benchmarks, Security, local models and decision matrix.

Cost Analysis and Optimization

Scenario: AI Assistant for a Service Company

Let's take a real, practical example. A service company operates an AI assistant for customer communication, internal task management and analytics. The usage pattern is as follows:

100 interactions per day, of which 30% are simple (FAQ, reminders), 50% are medium (CRM operations, email generation, calendar management), and 20% are complex (analytics, forecasts, recommendations).
Average token usage per interaction: for simple tasks 1,500 input + 800 output tokens, for medium tasks 3,000 + 2,000, and for complex tasks 5,000 + 4,000 tokens.
Monthly total: ~3,000 interactions.

The question is: which model (or models) should we use, and how does this affect cost and quality?

A. Single Model Approach

The simpler path: use a single model for everything. Let's see what this means in terms of cost and quality:

Model	Monthly Cost	Quality
o3 (for everything)	~$1,680	Excellent, but overkill for routine tasks — using a sledgehammer to crack a nut
GPT-4o (for everything)	~$105	Good quality, but unnecessarily expensive for FAQ
GPT-4o-mini (for everything)	~$6.30	Good for routine tasks, weak on complex analysis
Gemini 2.0 Flash (for everything)	~$2.70	Very cheap, but tool calling reliability is uncertain

The dilemma is clear: top-tier models are expensive, while cheap models struggle with complex tasks. What if you didn't have to choose?

B. Task-Based Routing (Optimized)

The solution: assign the most suitable model to each task type. This is the essence of task-based routing.

Task	Model	Share	Tokens/mo	Monthly Cost
Simple (FAQ, reminders)	GPT-4o-mini	30%	~2M	~$0.63
Medium (CRM, email, tool calling)	GPT-4o-mini	50%	~7.5M	~$2.63
Complex (analysis, forecast)	Claude 3.7 Sonnet	20%	~5.4M	~$37.80
Total				~$41.06

The Comparison

Now let's put all approaches side by side so the difference is crystal clear:

Approach	Monthly Cost	Quality
Everything with o3	$1,680	Maximum, but unnecessary for routine tasks
Everything with GPT-4o	$105	Good, but Claude/o3 would be better for complex tasks
Everything with GPT-4o-mini	$6.30	Good for routine, weak on complex tasks
Task-based routing	$41	Optimal: routine tasks handled cheaply, complex tasks in excellent quality

Routing is 60% cheaper than uniform GPT-4o, and delivers better quality on complex tasks. The $41/month routed cost is only 39% of the $105 GPT-4o bill, while complex analyses are handled by Claude 3.7 Sonnet — which, according to comparative benchmarks, has stronger reasoning capabilities. We kill two birds with one stone: we save money and get better quality.

Token Optimization Techniques

Beyond model selection, optimizing token usage can also yield dramatic savings. The following techniques can result in 50–90% cost reduction:

Technique	Savings	Implementation
Prompt caching (OpenAI, Anthropic)	50–90% input cost	Automatic, especially effective with long system prompts
Batch API (non-real-time tasks)	50%	Overnight runs: reports, summaries, batch processing
Context pruning (RAG token budget)	30–50%	Maximum 3,000 token context — don't send the entire knowledge base
Streaming (early stopping)	10–20%	If the first 100 tokens of the response are already sufficient, stop generation
Summary-based context	40–60%	Automatic summarization instead of old messages, reducing context window load

Among these, prompt caching deserves special attention as one of the simplest to apply and highest-impact techniques. Here's how it works: when an AI agent regularly sends the same long system prompt — for example, a 2,000-token instruction set containing the company's rules, response style and descriptions of available tools — the provider (OpenAI or Anthropic) caches this prefix on the server side. For subsequent requests that start with the same prefix, the cached tokens are billed at a significant discount. OpenAI offers a 50% discount on cached input tokens, while Anthropic provides up to a 90% discount. For an AI agent handling 100 interactions per day with a long system prompt, this can mean 40–70% monthly savings on input token costs alone. Prompt caching is especially valuable in agent-based systems where the system prompt often contains detailed tool descriptions, few-shot examples and behavioral rules — all of which are cacheable.

Multi-Model Architecture — The Routing Strategy

How Does Model Routing Work?

The essence of routing is simple: instead of entrusting every task to a single model, an intermediary layer — the classifier — examines the incoming question, determines its complexity, and directs it to the appropriate model.

                       User question
                           │
                    ┌──────▼──────┐
                    │  Classifier  │
                    │  (what kind  │
                    │   of task?)  │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              ▼            ▼            ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │  Simple  │ │  Medium  │ │ Complex  │
        │          │ │          │ │          │
        │ GPT-4o-  │ │ GPT-4o-  │ │ Claude   │
        │ mini     │ │ mini     │ │ 3.7      │
        │          │ │ + tools  │ │ Sonnet   │
        └──────────┘ └──────────┘ └──────────┘

The classifier's job is to quickly and cheaply determine what type of task we're dealing with. Simple, factual questions (FAQ, opening hours, price list) go to the cheapest model. Medium-complexity tasks — which require tool use, such as CRM queries or email sending — are also routed to a cost-effective model that is reliable at tool calling. Complex, multi-step analyses and recommendations reach the model with the strongest reasoning capabilities. The result: every task gets exactly the "tier" of model it needs — no more, no less.

The 3 Routing Strategies

1. Rule-Based Routing (Simplest)

The most straightforward solution: route questions based on simple rules. No AI is needed for routing — a few keywords and conditions suffice:

If the question contains: "when" / "where" / "price list" / "opening hours"
  → GPT-4o-mini (simple FAQ)

If tool calling is required (CRM, email, calendar)
  → GPT-4o-mini (good tool calling, low cost)

If analysis / recommendation / forecast is needed
  → Claude 3.7 Sonnet (best reasoning)

Advantages: simple implementation, fast (zero extra latency), deterministic and predictable behavior. We know exactly which model will respond — no surprises.

Disadvantages: rigid, doesn't handle edge cases well. If a question contains both FAQ-type and analytical elements, the rules can easily misclassify it. Maintenance-heavy: every new task type requires a new rule.

2. LLM-Based Routing (More Sophisticated)

A small, fast model (such as GPT-4o-mini or Claude 3.5 Haiku) performs the classification. The classifier itself is an LLM, instructed to classify via a short system prompt:

System prompt for the classifier:
"Classify the user's question: SIMPLE, MEDIUM, COMPLEX.
 SIMPLE: factual, short answer, FAQ.
 MEDIUM: requires tool use, CRM/email/calendar.
 COMPLEX: analysis, recommendation, comparison, multi-step."

The classifier's cost is negligible: ~$0.0001 per request (a short input, a single-word output). At 3,000 monthly interactions, that's a total of ~$0.30 — practically zero.

Advantages: more flexible than rule-based routing, capable of interpreting the nuances of natural language. Adaptive: if the nature of tasks changes, there's no need to modify code — just fine-tune the classifier's prompt.

Disadvantages: extra latency (~200ms for the classifier's response), and the classifier can also make mistakes. It may classify a complex question as SIMPLE, leading to a weaker response. It's worth logging classifications and regularly checking accuracy.

3. Fallback-Based Routing (Most Robust)

This strategy focuses on availability: if the primary model doesn't respond in time, it automatically switches to a backup model:

1. GPT-4o-mini (primary, best price/performance)
   ↓ if no response within 10 seconds
2. Claude 3.5 Haiku (backup)
   ↓ if no response within 10 seconds
3. Error → human escalation

Advantages: high availability — a single provider outage doesn't cause a system shutdown. This is especially important for business-critical applications where the AI assistant's responsiveness directly affects business processes.

Disadvantages: the backup model may respond in a different style or quality, which can cause consistency issues. If a user asks the same question twice and gets a response from the primary model once and the backup model the other time, the differing answers can erode trust. This can be partially mitigated with unified system prompts.

The Recommended Solution: Hybrid Routing

In practice, you don't have to commit to a single strategy — the three approaches can be combined:

Rule-based routing for clear-cut cases: FAQ questions → GPT-4o-mini, analysis requests → Claude 3.7 Sonnet. These are fast, cheap and reliable.
LLM classifier for ambiguous cases: if the rules can't definitively categorize a question, a fast classifier model decides on the routing.
Fallback mechanism for handling provider outages: if the primary model doesn't respond, automatic switch to the backup — and as a last resort, human escalation.

This hybrid approach provides the best price-to-performance ratio for most enterprise use cases. The practical implementation is surprisingly simple: in most frameworks (LangChain, Semantic Kernel, custom Node.js/Python solutions), implementing a router takes 50–100 lines of code. Rule-based routing is a simple switch/case or if-else chain, the LLM classifier is an extra API call with the incoming question, and the fallback is a try/catch with a timeout. The ROI is immediate — the routing logic's execution cost is practically zero compared to the model usage savings, and it saves a significant amount from the very first month.

The final part of this series examines security, data residency and local models, and provides a summary decision matrix. Read on: Security, local models and decision matrix. Or see the full whitepaper: Business Decisions of LLM Model Selection.