Intelligent LLM Routing: Spending Compute Where It Matters

How route-switch uses MIPROv2 to automatically select the right model for each query — balancing cost, quality, and latency.

Intelligent LLM Routing: Spending Compute Where It Matters

Here is a number worth thinking about. If you send every query in a typical application to GPT-4-class models, your inference costs are roughly 30x higher than if you send the same queries to GPT-3.5-class models. The quality difference is real, but it is not uniformly distributed. For some queries — simple classification, format conversion, straightforward extraction — the smaller model produces equivalent results. For others — complex reasoning, nuanced generation, ambiguous instructions — the larger model is genuinely necessary.

The research question behind route-switch is: can we automatically determine which queries need the expensive model and which do not?

The Cost-Quality Frontier

Every LLM provider offers a spectrum of models at different price points. The relationship between cost and quality is not linear — it follows a frontier curve:

Quality
  ^
  |          * GPT-4 / Claude Opus
  |        *
  |      *   Claude Sonnet / GPT-4o
  |    *
  |  *       GPT-4o-mini / Claude Haiku
  | *
  |*         Small open models
  +---------------------------------> Cost

Each point on this frontier represents a trade-off. Moving right (more expensive) gets you better quality, but with diminishing returns. A model costing 10x more is not 10x better — it might be 20% better on hard tasks and indistinguishable on easy ones.

The optimal strategy is not to pick one point on this curve. It is to pick different points for different queries. Easy queries go to the bottom-left (cheap, fast, good enough). Hard queries go to the top-right (expensive, slower, necessary). The system-level cost is a weighted average, and the system-level quality is determined by whether each query reaches an adequate model.

This is the routing problem.

Why Static Rules Fail

The first approach most teams try is a rule-based router: if the query contains certain keywords, use the big model; if it is below a certain token count, use the small model; if it involves code, use the specialised model.

This works for a while and then breaks in predictable ways.

Coverage gaps. Rules encode the author’s intuition about what makes a query “hard.” That intuition is always incomplete. Novel query patterns fall through the rules and get misrouted.

Maintenance burden. As the application evolves and the query distribution shifts, the rules need constant updating. Each new edge case adds a condition. The rule set grows until nobody can reason about it holistically.

No quality signal. Rules are typically written based on input features alone. They do not incorporate feedback about whether the routing decision was actually correct — whether the chosen model produced an adequate result.

We needed an approach that learns from data, adapts to distribution shifts, and optimises for an explicit quality-cost objective.

Enter MIPROv2

MIPROv2 (Multi-prompt Instruction Proposal Optimiser v2) is a framework for automatic prompt optimisation developed in the DSPy ecosystem. Its original purpose is to tune prompts and few-shot examples to maximise a given metric. We adapted it for a different purpose: tuning the routing decision itself.

The core insight is that model selection can be framed as a prompt optimisation problem. The “prompt” is the routing instruction — the criteria that determine which model handles a given query. MIPROv2 can optimise this instruction against a quality metric while respecting a cost constraint.

Here is how it works in route-switch:

┌─────────────────────────────────────────────┐
│              Incoming Query                  │
├─────────────────────────────────────────────┤
│           Router (lightweight LLM)           │
│  "Given this query, which model tier         │
│   should handle it? Consider complexity,     │
│   required reasoning depth, and task type."  │
├──────────┬──────────────┬───────────────────┤
│  Tier 1  │    Tier 2    │      Tier 3       │
│  Small   │   Medium     │      Large        │
│  (fast,  │  (balanced)  │   (expensive,     │
│  cheap)  │              │    high quality)   │
└──────────┴──────────────┴───────────────────┘

The router itself is a small, fast model (typically GPT-4o-mini or a local 7B model). Its job is to classify the incoming query into a difficulty tier. The routing prompt — the instruction that tells the router how to classify — is optimised by MIPROv2.

The Optimisation Loop

MIPROv2 optimises the routing prompt through an iterative process:

Step 1: Seed evaluation set. Collect a representative sample of queries, along with gold-standard responses (or quality judgments from a strong model).

Step 2: Initial routing. Run the evaluation set through the current routing prompt. Each query gets assigned to a tier.

Step 3: Quality measurement. For each query, compare the output from its assigned tier against the gold standard. Compute a quality score.

Step 4: Cost measurement. Sum the inference costs across all queries, based on which tier handled each one.

Step 5: Objective computation. The objective function is:

objective = quality_score - lambda * cost

where lambda is a tuneable parameter that controls the cost-quality trade-off. Higher lambda means more aggressive cost reduction; lower lambda means more conservative routing (send more queries to expensive models).

Step 6: Prompt proposal. MIPROv2 generates candidate modifications to the routing prompt — different phrasings, different classification criteria, different examples — and evaluates each against the objective.

Step 7: Selection. The best-performing routing prompt becomes the new active prompt, and the loop continues.

After convergence, the routing prompt encodes a learned decision boundary that balances cost and quality for the specific query distribution of your application.

What the Router Learns

The optimised routing prompts are interpretable, which is one advantage of this approach over a neural classifier. Here is an example of a routing instruction that MIPROv2 produced for a customer support application:

Route to Tier 1 (small model) if the query is:
- A greeting or closing statement
- A request for factual information that can be looked up
- A simple format conversion (e.g., "rewrite this as a bullet list")
- A classification task with clear categories

Route to Tier 2 (medium model) if the query is:
- A multi-step request with clear instructions
- A summarisation task on structured input
- A comparison between well-defined options

Route to Tier 3 (large model) if the query is:
- An open-ended analysis or recommendation
- A query requiring reasoning about ambiguous or contradictory information
- A creative task with subjective quality criteria
- A query where factual accuracy is critical and the domain is specialised

This is not a hand-written rule set — it was generated by the optimisation process. But because it is expressed in natural language, it can be reviewed, modified, and version-controlled like any other prompt. If the system makes a routing mistake, you can read the routing criteria and understand why.

Empirical Results

We tested route-switch on three workloads, each with a different query distribution. The results are summarised below. Quality is measured as the percentage of queries where the routed response is rated equivalent to the all-Tier-3 baseline by an LLM judge.

Workload         | All Tier 3  | Static Rules | route-switch
                 | Cost  Qual  | Cost  Qual   | Cost   Qual
-----------------+-------------+--------------+-------------
Customer support | $1.00  100% | $0.42  89%   | $0.35   96%
Code generation  | $1.00  100% | $0.55  82%   | $0.48   94%
Research Q&A     | $1.00  100% | $0.61  78%   | $0.52   93%

(Costs are normalised relative to the all-Tier-3 baseline.)

Two patterns stand out. First, route-switch consistently achieves higher quality than static rules at similar or lower cost. The learned routing criteria are more nuanced than hand-written rules. Second, the savings vary by workload. Customer support has a large fraction of easy queries (greetings, lookups) that can safely go to Tier 1. Research Q&A has fewer easy queries, so the savings are smaller.

Limitations and Open Questions

Router overhead. The router itself consumes tokens and adds latency. For very short queries, the routing cost can exceed the savings from using a cheaper model. We mitigate this by bypassing the router for queries below a token threshold and sending them directly to Tier 1.

Distribution shift. The optimised routing prompt is tuned for a specific query distribution. When that distribution shifts — new features, seasonal patterns, different user populations — the routing accuracy degrades. We re-run MIPROv2 optimisation periodically (weekly in our current setup) to track distribution changes.

Quality measurement is hard. The optimisation loop requires a quality signal. Using an LLM-as-judge introduces its own biases and costs. Using human evaluation is expensive and slow. The quality of the routing optimisation is bounded by the quality of the quality measurement, which is a somewhat uncomfortable recursion.

Cold start. MIPROv2 needs an evaluation set to optimise against. For new applications without historical queries, the initial routing prompt must be hand-written and is unlikely to be optimal. The system improves as data accumulates, but the cold start period may last days or weeks depending on query volume.

Where This Goes

Routing is a necessary component of any system that uses multiple LLMs. The question is whether it is done implicitly (developers hardcoding model choices per endpoint), explicitly but statically (rule-based routers), or adaptively (learned routers like route-switch).

We think the adaptive approach is the right one for most non-trivial applications. The cost savings are significant, the quality trade-off is manageable, and the maintenance burden is lower than hand-tuned rules. The main barrier to adoption is the evaluation infrastructure — you need a way to measure quality to optimise for it. That investment pays for itself, but it is a real upfront cost.

route-switch is written in Go and is available on GitHub. We welcome contributions, particularly around alternative optimisation strategies and evaluation methodologies.