Skip to content
All posts
Guide//4 MIN READ

Stop Hardcoding GPT-4o: Task-Aware Model Routing

One expensive model for every turn wastes money on lookups and underpowers hard tasks. Use orqen/auto and siblings to match model capability to each request.

O

Orqen Team

orqen.app

Your agent uses GPT-4o for everything — weather lookups, SQL generation, "what time is it in Tokyo?", and a forty-turn debugging session. Capable models excel at hard tasks and burn budget on trivial ones. Cheaper models save money until a complex turn fails and the user retries twice.

Hardcoding one model is the same mistake as sending all 50 tools every turn: a static default that ignores what this request actually needs.

The one-model problem

Agent workloads are bimodal. Most turns are moderate: one tool call, short context, straightforward intent. A minority are heavy: long history, many tools, multi-step reasoning, code generation.

Turn typeExampleModel needed
Lookup"Weather in Oslo"Fast / cheap tier
Tool loop3-call invoice workflowMid tier + good routing
Deep reasoning"Refactor this module step by step"Capable tier
Long sessionTurn 40+ with compressionCapable + context headroom

Payload optimization (tool routing, compression) reduces tokens — see Introducing Orqen. Model routing reduces price per token on the turns that can tolerate it. The two stack multiplicatively.

orqen/auto and siblings

Set model="orqen/auto" (or Bedrock/Anthropic equivalent) and Orqen substitutes a real provider model before forwarding upstream. Your SDK code, response parsing, and tool formats stay unchanged.

# Drop-in model strings — Orqen resolves before forwarding upstream:
#
#   orqen/auto      Match capability to task complexity (default)
#   orqen/cheap     Cheapest model meeting minimum requirements
#   orqen/fast      Lowest-latency model meeting requirements
#   orqen/capable   Most capable model in your connected pool
#
# Your agent code keeps model="orqen/auto" — the response shape is unchanged.

Orqen picks from models your customer account has connected — Anthropic, OpenAI, Bedrock, Groq, etc. It does not route to providers you have not configured.

Task complexity classification

Orqen scores each request's complexity using fast heuristics — no ML, under 1ms, off the critical path:

# Orqen classifies each request by complexity (1–5):
#
# Signals include context size, tool count, tool-result depth,
# and keywords that suggest hard reasoning or code work.
#
# Complexity maps to capability tiers in your connected pool —
# simple lookups resolve cheaper; agent loops resolve more capable.
  • Context size — character count and message count across the full history.
  • Tool surface — tools in the request and tool-result messages from prior turns (agentic loop depth).
  • Keyword signals — "implement", "debug", "analyze", "proof" bump complexity; simple creative tasks bump moderately.

Complexity maps to capability tiers in your connected pool. A complexity-1 weather query on orqen/auto might resolve to Haiku or GPT-4o-mini; complexity-4 agent loop might resolve to Sonnet or GPT-4o.

cheap, fast, capable modes

Override the default tradeoff with explicit modes:

  • orqen/cheap — minimize cost; acceptable for batch jobs and low-stakes lookups.
  • orqen/fast — minimize latency; good for interactive UI where TTFT matters.
  • orqen/capable — maximize quality; use when you would otherwise hardcode GPT-4o or Opus for everything.
  • orqen/auto — balance per turn based on complexity score.

Model routing ≠ tool routing. orqen/auto picks which LLM processes the request. Orqen still prunes tools and compresses payload independently — a cheap model with 4 relevant tools beats a capable model with 50.

Performance feedback over time

Orqen records per-model rolling averages on your account:

  • Latency (p50 / p95 trends)
  • Success rate (HTTP outcomes)
  • recall@K when tools are pruned
  • Tool call count per request

Over time, the router favors models with proven performance on your workloads — not just static price sheets. If a cheap tier underperforms on a specific workflow, routing adjusts.

Model stats appear in dashboard analytics alongside payload savings and optimization traces.

Stack with payload optimization

Maximum savings on a production agent:

  1. model="orqen/auto" — right-size model per turn.
  2. Route through Orqen — prune tools, compress results, tier history.
  3. Enable provider caching — stable prefixes discount repeated context (context caching).

Typical tool-heavy agents see 50–70% fewer prompt tokens from optimization alone. Model routing adds another 3–10× cost spread between tiers on the turns that resolve to cheaper models.

Switch one line

from openai import OpenAI

client = OpenAI(
    api_key="sk-orq-YOUR_KEY",
    base_url="https://api.orqen.app/v1",
)

# Same agent code — Orqen picks the upstream model per request
response = client.chat.completions.create(
    model="orqen/auto",  # or orqen/cheap, orqen/fast, orqen/capable
    messages=messages,
    tools=tools,
)
client = anthropic.Anthropic(
    api_key="sk-orq-YOUR_KEY",
    base_url="https://api.orqen.app",
)
response = client.messages.create(
    model="orqen/auto",
    max_tokens=4096,
    messages=messages,
    tools=tools,
)
  1. Connect multiple providers in the dashboard (e.g. Haiku + Sonnet, or GPT-4o-mini + GPT-4o).
  2. Replace hardcoded model with orqen/auto.
  3. Run mixed workloads — lookups and hard tasks in one session.
  4. Check Usage for resolved model per request and cost spread.
Tagged:model-routingorqen-autollm-costagent-optimizationmulti-model
O

Orqen Team

We build the optimization layer for tool-heavy LLM agents. Our goal is to make agent costs predictable as your tool set grows.

Try Orqen free

250K saved tokens per month. Free forever. Two-line integration.

See your savings in the dashboard within seconds of your first request.