Your agent uses GPT-4o for everything — weather lookups, SQL generation, "what time is it in Tokyo?", and a forty-turn debugging session. Capable models excel at hard tasks and burn budget on trivial ones. Cheaper models save money until a complex turn fails and the user retries twice.
Hardcoding one model is the same mistake as sending all 50 tools every turn: a static default that ignores what this request actually needs.
The one-model problem
Agent workloads are bimodal. Most turns are moderate: one tool call, short context, straightforward intent. A minority are heavy: long history, many tools, multi-step reasoning, code generation.
| Turn type | Example | Model needed |
|---|---|---|
| Lookup | "Weather in Oslo" | Fast / cheap tier |
| Tool loop | 3-call invoice workflow | Mid tier + good routing |
| Deep reasoning | "Refactor this module step by step" | Capable tier |
| Long session | Turn 40+ with compression | Capable + context headroom |
Payload optimization (tool routing, compression) reduces tokens — see Introducing Orqen. Model routing reduces price per token on the turns that can tolerate it. The two stack multiplicatively.
orqen/auto and siblings
Set model="orqen/auto" (or Bedrock/Anthropic equivalent) and Orqen substitutes a real provider model before forwarding upstream. Your SDK code, response parsing, and tool formats stay unchanged.
# Drop-in model strings — Orqen resolves before forwarding upstream:
#
# orqen/auto Match capability to task complexity (default)
# orqen/cheap Cheapest model meeting minimum requirements
# orqen/fast Lowest-latency model meeting requirements
# orqen/capable Most capable model in your connected pool
#
# Your agent code keeps model="orqen/auto" — the response shape is unchanged.Orqen picks from models your customer account has connected — Anthropic, OpenAI, Bedrock, Groq, etc. It does not route to providers you have not configured.
Task complexity classification
Orqen scores each request's complexity using fast heuristics — no ML, under 1ms, off the critical path:
# Orqen classifies each request by complexity (1–5):
#
# Signals include context size, tool count, tool-result depth,
# and keywords that suggest hard reasoning or code work.
#
# Complexity maps to capability tiers in your connected pool —
# simple lookups resolve cheaper; agent loops resolve more capable.- Context size — character count and message count across the full history.
- Tool surface — tools in the request and tool-result messages from prior turns (agentic loop depth).
- Keyword signals — "implement", "debug", "analyze", "proof" bump complexity; simple creative tasks bump moderately.
Complexity maps to capability tiers in your connected pool. A complexity-1 weather query on orqen/auto might resolve to Haiku or GPT-4o-mini; complexity-4 agent loop might resolve to Sonnet or GPT-4o.
cheap, fast, capable modes
Override the default tradeoff with explicit modes:
- orqen/cheap — minimize cost; acceptable for batch jobs and low-stakes lookups.
- orqen/fast — minimize latency; good for interactive UI where TTFT matters.
- orqen/capable — maximize quality; use when you would otherwise hardcode GPT-4o or Opus for everything.
- orqen/auto — balance per turn based on complexity score.
Model routing ≠ tool routing. orqen/auto picks which LLM processes the request. Orqen still prunes tools and compresses payload independently — a cheap model with 4 relevant tools beats a capable model with 50.
Performance feedback over time
Orqen records per-model rolling averages on your account:
- Latency (p50 / p95 trends)
- Success rate (HTTP outcomes)
- recall@K when tools are pruned
- Tool call count per request
Over time, the router favors models with proven performance on your workloads — not just static price sheets. If a cheap tier underperforms on a specific workflow, routing adjusts.
Model stats appear in dashboard analytics alongside payload savings and optimization traces.
Stack with payload optimization
Maximum savings on a production agent:
model="orqen/auto"— right-size model per turn.- Route through Orqen — prune tools, compress results, tier history.
- Enable provider caching — stable prefixes discount repeated context (context caching).
Typical tool-heavy agents see 50–70% fewer prompt tokens from optimization alone. Model routing adds another 3–10× cost spread between tiers on the turns that resolve to cheaper models.
Switch one line
from openai import OpenAI
client = OpenAI(
api_key="sk-orq-YOUR_KEY",
base_url="https://api.orqen.app/v1",
)
# Same agent code — Orqen picks the upstream model per request
response = client.chat.completions.create(
model="orqen/auto", # or orqen/cheap, orqen/fast, orqen/capable
messages=messages,
tools=tools,
)client = anthropic.Anthropic(
api_key="sk-orq-YOUR_KEY",
base_url="https://api.orqen.app",
)
response = client.messages.create(
model="orqen/auto",
max_tokens=4096,
messages=messages,
tools=tools,
)- Connect multiple providers in the dashboard (e.g. Haiku + Sonnet, or GPT-4o-mini + GPT-4o).
- Replace hardcoded model with
orqen/auto. - Run mixed workloads — lookups and hard tasks in one session.
- Check Usage for resolved model per request and cost spread.
Next step: Sign up free · Multi-provider docs · Introducing Orqen