Stop Hardcoding GPT-4o: Task-Aware Model Routing

Your agent uses GPT-4o for everything — weather lookups, SQL generation, "what time is it in Tokyo?", and a forty-turn debugging session. Capable models excel at hard tasks and burn budget on trivial ones. Cheaper models save money until a complex turn fails and the user retries twice.

Hardcoding one model is the same mistake as sending all 50 tools every turn: a static default that ignores what this request actually needs.

The one-model problem

Agent workloads are bimodal. Most turns are moderate: one tool call, short context, straightforward intent. A minority are heavy: long history, many tools, multi-step reasoning, code generation.

Turn type	Example	Model needed
Lookup	"Weather in Oslo"	Fast / cheap tier
Tool loop	3-call invoice workflow	Mid tier + good routing
Deep reasoning	"Refactor this module step by step"	Capable tier
Long session	Turn 40+ with compression	Capable + context headroom

Payload optimization (tool routing, compression) reduces tokens — see Introducing Orqen. Model routing reduces price per token on the turns that can tolerate it. The two stack multiplicatively.

orqen/auto and siblings

Set model="orqen/auto" (or Bedrock/Anthropic equivalent) and Orqen substitutes a real provider model before forwarding upstream. Your SDK code, response parsing, and tool formats stay unchanged.

# Drop-in model strings — Orqen resolves before forwarding upstream:
#
#   orqen/auto      Match capability to task complexity (default)
#   orqen/cheap     Cheapest model meeting minimum requirements
#   orqen/fast      Lowest-latency model meeting requirements
#   orqen/capable   Most capable model in your connected pool
#
# Your agent code keeps model="orqen/auto" — the response shape is unchanged.

Orqen picks from models your customer account has connected — Anthropic, OpenAI, Bedrock, Groq, etc. It does not route to providers you have not configured.

Task complexity classification

Orqen scores each request's complexity using fast heuristics — no ML, under 1ms, off the critical path:

# Orqen classifies each request by complexity (1–5):
#
# Signals include context size, tool count, tool-result depth,
# and keywords that suggest hard reasoning or code work.
#
# Complexity maps to capability tiers in your connected pool —
# simple lookups resolve cheaper; agent loops resolve more capable.

Context size — character count and message count across the full history.
Tool surface — tools in the request and tool-result messages from prior turns (agentic loop depth).
Keyword signals — "implement", "debug", "analyze", "proof" bump complexity; simple creative tasks bump moderately.

Complexity maps to capability tiers in your connected pool. A complexity-1 weather query on orqen/auto might resolve to Haiku or GPT-4o-mini; complexity-4 agent loop might resolve to Sonnet or GPT-4o.

cheap, fast, capable modes

Override the default tradeoff with explicit modes:

orqen/cheap — minimize cost; acceptable for batch jobs and low-stakes lookups.
orqen/fast — minimize latency; good for interactive UI where TTFT matters.
orqen/capable — maximize quality; use when you would otherwise hardcode GPT-4o or Opus for everything.
orqen/auto — balance per turn based on complexity score.

Model routing ≠ tool routing. orqen/auto picks which LLM processes the request. Orqen still prunes tools and compresses payload independently — a cheap model with 4 relevant tools beats a capable model with 50.

Performance feedback over time

Orqen records per-model rolling averages on your account:

Latency (p50 / p95 trends)
Success rate (HTTP outcomes)
recall@K when tools are pruned
Tool call count per request

Over time, the router favors models with proven performance on your workloads — not just static price sheets. If a cheap tier underperforms on a specific workflow, routing adjusts.

Model stats appear in dashboard analytics alongside payload savings and optimization traces.

Stack with payload optimization

Maximum savings on a production agent:

model="orqen/auto" — right-size model per turn.
Route through Orqen — prune tools, compress results, tier history.
Enable provider caching — stable prefixes discount repeated context (context caching).

Typical tool-heavy agents see 50–70% fewer prompt tokens from optimization alone. Model routing adds another 3–10× cost spread between tiers on the turns that resolve to cheaper models.

Switch one line

from openai import OpenAI

client = OpenAI(
    api_key="sk-orq-YOUR_KEY",
    base_url="https://api.orqen.app/v1",
)

# Same agent code — Orqen picks the upstream model per request
response = client.chat.completions.create(
    model="orqen/auto",  # or orqen/cheap, orqen/fast, orqen/capable
    messages=messages,
    tools=tools,
)

client = anthropic.Anthropic(
    api_key="sk-orq-YOUR_KEY",
    base_url="https://api.orqen.app",
)
response = client.messages.create(
    model="orqen/auto",
    max_tokens=4096,
    messages=messages,
    tools=tools,
)

Connect multiple providers in the dashboard (e.g. Haiku + Sonnet, or GPT-4o-mini + GPT-4o).
Replace hardcoded model with orqen/auto.
Run mixed workloads — lookups and hard tasks in one session.
Check Usage for resolved model per request and cost spread.

Next step: Sign up free · Multi-provider docs · Introducing Orqen

Stop Hardcoding GPT-4o: Task-Aware Model Routing

The one-model problem

orqen/auto and siblings

Task complexity classification

cheap, fast, capable modes

Performance feedback over time

Stack with payload optimization

Switch one line

MCP Gave Your Agent 50 Tools — Now What?

See Orqen optimize your agent payloads