You built a useful agent. You gave it tools, memory, API calls, image inputs, database lookups, calendar access, and email. It works. Then you check your LLM bill.
Every single request can end up carrying too much: every tool schema, old conversation turns, bulky tool results, stale image context, and repeated prompt text. Tool schemas are often the first visible cost. With thirty tools at 150–400 tokens each, that's 4,500–12,000 tokens before the model even looks at history or tool output.
For an agent handling 10,000 requests a month on GPT-4o, unused tool schemas alone can be $1.12–$3.00 in wasted token spend per day. Long history and noisy tool results add their own cost on top.
The problem with growing agent payloads
The standard pattern for building agents is to define tools upfront, keep history around, pass tool results back in, and let the SDK send the whole request shape. The OpenAI, Anthropic, and Bedrock SDKs all make this easy, and it works fine at small scale.
The hidden cost emerges as agents grow. By the time you have 20+ tools, longer sessions, and richer tool output, two things happen:
- Costs compound fast. A 30-tool agent sending 150 tokens per schema adds 4,500 input tokens to every request. At GPT-4o pricing ($2.50/M), that's $0.011 per request — or $11/day at 1,000 req/day, purely from tool definitions the model doesn't use.
- Quality degrades. LLMs perform worse when given irrelevant context. Research from Anthropic and OpenAI consistently shows that tool selection accuracy drops as the tool list grows. More tools means more noise, more hallucinated tool calls, and more "I'll try this random tool" errors.
The fundamental mismatch: agents grow their tool sets over time, but most requests only need a small slice of the full payload. The context the modeldoesn't need is dead weight — you're paying to process it, and it can make the model worse.
What Orqen does
Orqen is a proxy that sits between your agent and your LLM provider. It reads each request, keeps the context the turn actually needs, compacts noisy payload sections, and forwards a smaller validated request to your LLM. The model's response streams straight back to you, unmodified.
From your agent's perspective, integration doesn't change — you point two config values at Orqen and keep writing code exactly as you were. From your LLM provider's perspective, the requests arrive leaner, more focused, and cheaper to process.
Three things happen to the payload on every request:
- Intent-aware request planning. Orqen analyzes the user's goal, budget pressure, recent context, available tools, and recovery signals before it changes the payload. The same plan coordinates tool routing, compression, history tiers, reconstruction, and validation.
- Payload cleanup. Repeated sentences, AI preamble filler, whitespace bloat, exact-duplicate turns, irrelevant tool schemas, old history, stale image context, and bulky tool results are reduced before the request leaves.
- Reconstruction and validation. Orqen assembles the final model-facing request, checks critical IDs, URLs, constraints, and tool schemas, then restores context if validation says something important was lost.
And one thing happens to the routing, but only if you ask for it. Call a real model (claude-sonnet-4-6) and Orqen forwards to exactly that model — it never substitutes a model behind your back. Call an orqen/* alias instead (orqen/auto, orqen/cheap, orqen/fast, orqen/capable) and you're handing Orqen the model choice on purpose: it picks the cheapest model that can handle the turn, the fastest, or the most capable, from the providers you've connected. The mode is yours to set; passthrough keeps your exact model every time.
Benchmark: what we measured
In typical agent workloads with 10–100 tools, we see 50–70% fewer prompt tokens on routing-heavy turns — more when most tools are irrelevant to the query, less when the user needs a wide tool surface. The run below is a reproducible upper end, not a guarantee for every request.
We ran our open weather agent — examples/bedrock_multi_tool_agent.py — with 51 tools against Bedrock Claude Haiku 4.5. Same question, same model, two round-trips each: one direct, one through Orqen. We hold the model fixed on purpose here — this benchmark isolates the payload savings, not model routing. You can rerun it yourself; the script and steps are in the docs.
| Metric | Direct | Via Orqen | This run |
|---|---|---|---|
| Prompt tokens (both calls) | 9,235 | 1,605 | −83% |
| Tools forwarded (round 1) | 51 | 1 | −98% |
| Pipeline overhead | — | <20ms | cached routing |
Reproducible result: same weather question, identical model answer, 7,630 fewer prompt tokens across two calls (9,235 → 1,605). Round 1 sent 1 tool instead of 51. Routing overhead on a warm cache was under 5ms.
Your mileage will vary. Savings depend on tool count, how many tools match each query, model pricing, and call pattern. If your workload looks like this test (~7.6k prompt tokens saved per two-call run on GPT-4o at $2.50/M), 1,000 similar requests/day is roughly $19/day in input-token savings — before compression. Smaller tool sets or broader routing windows mean smaller wins.
Designed for production agents
Orqen's pipeline is designed to be invisible — fast enough that it doesn't affect the latency your users feel. Here's the typical overhead:
Context cleanup < 1ms — strip exact repeats, preamble, whitespace
Tool result trim < 2ms — minify large JSON/HTML in tool responses
Tool routing 5-20ms — score tools, select the relevant subset
Schema trim < 2ms — compress verbose tool definitions after routing
──────
Total ~300ms typical (includes network + Redis roundtrips)Orqen's core routing and cleanup path runs within its own infrastructure — no required third-party calls for normal requests. Optional Pro features may use additional enrichment passes when the optimization plan calls for it; those paths are not part of the default hot path. Tool schemas are analyzed once and remembered, so the system gets faster as your agent accumulates history.
The optimizer improves over time. Orqen tracks which optimizations ran, which tools the LLM actually called, whether validation restored context, and when recovery widened the next request. Useful context gets surfaced more reliably; noisy context gets reduced with more confidence.
Two-line integration
Orqen accepts requests in the native format for Anthropic, OpenAI, and AWS Bedrock SDKs. There's no format translation required on your side.
# Before
client = anthropic.Anthropic(api_key="sk-ant-...")
# After — point at Orqen
client = anthropic.Anthropic(
api_key="sk-orq-YOUR_KEY",
base_url="https://api.orqen.app",
)
# Your messages, tools, and model name stay the sameOrqen stores your provider key (encrypted, AES-128) and decrypts it per request to forward to the actual provider. Your agent's code never changes — all three SDKs keep their native request shapes.
Who it's for
Orqen is built for developers whose agent requests are getting too large, expensive, or noisy as the product grows. Tool-heavy agents are a common fit, but long histories, bulky tool results, multimodal turns, and verbose schemas are part of the same problem.
The sweet spot is agents where:
- You're already paying meaningful LLM costs and noticing context-related noise
- You use Anthropic, OpenAI, or Bedrock (or Groq, Mistral, Google, DeepSeek)
- You want to scale the agent's tools, memory, and workflows without scaling cost linearly
Orqen is not for teams who need to rewrite their agent architecture, swap frameworks, or learn a new API. It's a one-day integration that adds an optimization layer to whatever you already have.
Pricing at a glance
Free forever — no trial clock, no credit card. Pro is a flat $39/month with unlimited optimization. If Orqen doesn't save you money, cancel in two minutes.
| Free | Pro | |
|---|---|---|
| Price | $0 / forever | $39 / month |
| Optimization limit | 250K saved tokens / mo | Unlimited |
| At free limit | Passthrough until reset | No cap — never pauses |
| Includes | Payload optimization, routing quality, sessions | + advanced compression, reranking, email support |
Free resets on the 1st each month. Anti-abuse caps apply on Free: 75K saved tokens per day and 500K per week. When you hit the monthly limit, requests still go through — Orqen stops optimising until the counter resets so your agents never hard-stop.
What's next
We're shipping fast. A few things in active development:
- Better optimization guidance. Orqen already tracks routing quality, compression strategy, validation fallback, and recovery signals. We're adding clearer dashboard suggestions so you can see exactly which payload sections are helping or hurting each agent workflow.
- Richer history tiers. For long agent loops, Orqen is deepening its hot/warm/cold history model so old conversation turns stay useful without occupying the same space as the current task.
- Deeper routing calibration.
orqen/autois available today — we're extending dashboard insights so teams can tune provider pools and recall targets with less guesswork.
We're building in public. Feedback, bug reports, and feature requests via support@orqen.app.
Get started in five minutes
If your agent already sends large tool lists, long history, tool results, images, or verbose schemas through Anthropic, OpenAI, or Bedrock, you can see savings on the first optimized request:
- Create a free account at the dashboard — no credit card.
- Add your provider key (Anthropic, OpenAI, or Bedrock) so Orqen can forward requests on your behalf.
- Copy your Orqen API key (
sk-orq-...) from Settings. - Change two lines in your SDK client:
api_keyandbase_url(see integration section above). - Send one real agent request and open the dashboard — saved tokens and payload savings should show within minutes.
Start now: Sign up free · Read the quickstart · Questions? Email us