Your LLM bill is inflated by tokens your agent doesn't need. Orqen cuts them out
Every LLM call your agent makes carries tools, schemas, and history the model doesn't need for that turn — and you pay for all of it. Orqen sits between your SDK and the provider: it prunes tools, compresses context, routes each call to the right model for the task, and activates provider caching automatically. Lower bill, better accuracy, zero code changes.
You're paying for tokens the model ignores
Without Orqen
Your agent's full context → LLM
Tools + history + schemas + images + tool results → LLM
→ Prompt grows each turn: stale history, repeated schemas, old results
→ The model sees context it does not need
→ Higher cost. More chances for the model to choose badly.
With Orqen
Your agent's full context → Orqen → LLM
Tools, history, schemas, images → cleaned and routed → LLM
→ 53.9% less tool context · history compressed
→ Schemas deduped · images reduced · tool results trimmed
→ 7,688,105 context tokens removed
→ $30.01 observed provider-bill reduction
Observed live Orqen traffic, May 2026: 437 calls, 379 successful. Recent Bedrock rows went 33 → 15 tools.
This is probably for you if
Your LLM bills are growing and you want to spend less without rewriting your agent
You need to preserve critical IDs, URLs, and constraints while cutting cost
Anthropic SDK, OpenAI SDK, AWS Bedrock, Groq — keep the SDK you know
Your requests are already tiny and never include tools, long history, or multimodal content
You've already handled tool selection, context compression, and cache optimisation in-house — and your bill reflects it
You want to rewrite your agent framework instead of adding a proxy
Sound familiar?
Six reasons your agent costs more than it should
Most teams patch one or two of these. Orqen addresses all six on every request — one plan, before the call leaves your stack, without a single change to your agent code.
“My agent ships every tool on every call.”
Orqen reads what the turn is actually asking for and forwards only the tools it needs — about 53.9% fewer schemas per call, and it still hands the model the tool it would have chosen.
“The prompt grows every turn until it's huge.”
Repeated schemas, verbose tool results, and stale images get compacted; older turns summarize in tiers while recent ones stay verbatim. Prompt size stays flat as the session runs long.
“I'm paying frontier prices for simple tasks.”
Set model to orqen/auto and Orqen classifies each request — data retrieval, code generation, analysis, and more — then routes simple operations to fast, cheap models and complex ones to capable models. Per-request, automatic, no rules to write.
“Provider caching never activates for my agent.”
Orqen auto-injects Anthropic cache_control markers, stabilises the tool prefix across turns so the cache actually hits, and recognises recurring context across sessions so caching works from the very first turn — not just the second.
“I can't tell what I'm actually saving.”
Every request shows what you paid versus what you'd have paid without Orqen — broken down by operation type, model, and provider. Cache savings are tracked separately and never double-counted.
“Every new session starts from scratch.”
Orqen fingerprints your context across sessions. When a new session reuses the same system prompt and tools, it skips redundant processing, activates caching immediately, and protects stable content from re-compression.
Trusted by teams building agents
See your savings
Saved on your LLM bill this month
$30.01
↓ 53.9% less tool context per call
Requests
437
this month
Tools forwarded
33→15
avg per call
Tokens saved
7.7M
this month
Before & after
Fewer tokens sent means a lower provider bill
Without Orqen
With Orqen
53.9% narrower tool context · ~300ms typical overhead
The integration
Switch the key and endpoint — keep your request shape.
Anthropic SDK, OpenAI SDK, and AWS Bedrock SDK can use their usual request shapes. Point them at Orqen with your Orqen key — your prompts, tools, model names, streaming, and multimodal inputs stay familiar.
import anthropic
# Before
# client = anthropic.Anthropic(api_key="sk-ant-...")
# After — point the client at Orqen
client = anthropic.Anthropic(
api_key="sk-orq-...", # your Orqen key
base_url="https://api.orqen.app",
)
# Your request body stays the same
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "..."}],
tools=[...], # Orqen optimizes the request before forwarding
)Works with major providers and agent frameworks
Under the hood
One plan per request — in about 300ms
Every call is intercepted, analyzed, and optimized before it reaches your provider — then forwarded with the answer-critical parts intact.
Your SDK calls Orqen
Orqen sits in front of your LLM provider. Anthropic, OpenAI, and Bedrock SDKs keep their request shapes — point them at Orqen with your key.
Isolate what this turn needs
The engine examines context, tools, schemas, and budget signals to decide what to keep, what to prune, and which model to route to — typically in ~300ms total.
Smaller request, lower bill
The provider receives a smaller request — fewer tools, compressed history, deduped schemas — so you pay for fewer input tokens. Critical IDs and constraints are verified before forwarding.
It stays safe while it gets smarter
Orqen tracks which optimizations ran, which tools the LLM actually called, whether critical terms survived, and when recovery widened the next request. The system learns from live traffic without storing raw prompts.
Real numbers
Real savings from real traffic — not a synthetic benchmark
These numbers come from live agent traffic through Orqen in May 2026: 437 calls, 379 successful. The savings are measured by comparing what your provider actually bills — not estimates.
0.00M
Tokens removed from provider bills
7,688,105 observed this month
0%
Lower bill per call
live test — 9,235 → 1,605 billed tokens
$0.00
Saved on the provider bill
based on what the provider actually billed
Recent request rows
Bedrock Opus turns
33 → 15 tools
Saved per recent call
15.0K–28.0K tokens
Observed latency
3.5–10.2s
Observed from recent us.anthropic.claude-opus-4-7 requests in the Orqen dashboard. Actual savings depend on your tools, model prices, traffic, and task mix.
Pricing
Free forever. Pro when you scale.
No trial. Free forever — 250K saved tokens per month, resets monthly. Free includes 75k saved tokens / day and 500k saved tokens / week caps. Pro is $39 / month — unlimited optimization, no token caps. Full comparison →
Free
$0
forever
Free forever — 250K saved tokens per month, resets monthly. When you hit the limit, your agents keep working — requests forward directly until the next monthly reset.
- 250K saved tokens / month — resets monthly
- Agents keep working at limit (no hard stop)
- Tool and context optimization
- Prompt reconstruction and validation traces
- Routing Quality analysis
- Session tracking
- Community support
Pro
$39
per month
Unlimited optimization, no monthly cap. Cancel any time — if Orqen doesn't save you money, it takes 2 minutes.
- Everything in Free
- Unlimited optimization — no monthly cap
- Advanced compression calibration
- Advanced reranking and intent enrichment
- Email support
Enterprise
Custom
volume pricing
Custom contracts, SLA, and dedicated support.
- Everything in Pro
- Custom pricing & volume discounts
- SSO / SAML (coming soon)
- Dedicated success manager
- Custom data retention
Estimate your savings
Savings calculator
Estimates tool routing, history compression, and deduplication savings on your input token bill.
Selected model
GPT-5.4
$2.50 in · $15.00 out/M
5,000 calls / month
20 tools → 12 pruned (60%)
10 turns → 2 compressed
Monthly savings breakdown
LLM bill reduction
11.7M input tokens × $2.5/M
$29.25/mo
Orqen Pro
$39.00
Flat monthly · cancel anytime
Net gain
<$0.001/mo
Start free — upgrade when savings exceed $39/mo
Input-token pricing from provider docs (May 2026). Tiered models use standard ≤200k context rates. Savings use input $/M only — output and reasoning tokens are not modeled. Assumes ~150 tokens/tool, K=8, ~400 tokens/turn, 60% history reduction beyond 8 turns.
Frequently asked questions
Plain questions, developer answers — browse the man page below.
orqen-faq(1)
bash
Orqen-FAQ(1)
Orqen Manual
Orqen-FAQ(1)
Name
orqen-faq - frequently asked questions about cutting your LLM bill
Synopsis
orqen --help [topic]
$orqen --help "What does Orqen actually do?"
Output
Orqen cuts your LLM bill. Your agent sends tools, history, and schemas on every call — most of which the model doesn't need for that turn. Orqen sits between your SDK and the provider: it prunes irrelevant tools, compresses stale context, auto-routes each request to the right model for the task, and activates provider caching automatically. It also classifies each request by operation type (code generation, data retrieval, analysis, etc.) to show you where your tokens go. Update your API key and endpoint URL; your agent code stays the same.
$orqen --help "How does Orqen work with provider caching (Anthropic cache_control)?"
Output
Orqen makes caching work better and activates it automatically. For Anthropic and Bedrock, Orqen auto-injects cache_control markers on stable prefixes — you don't need to set them yourself. It stabilises the forwarded tool set across turns so the prefix cache actually hits instead of being invalidated by changing tools. It also fingerprints your context across sessions, so when a new session starts with the same system prompt and tools, caching activates on the very first turn instead of waiting for the second. The dashboard shows your cache hit rate and how much Orqen's injection saved.
$orqen --help "What if Orqen drops a tool the model needed?"
Output
Orqen snapshots the original request, reconstructs the outbound prompt, validates critical IDs, URLs, constraints, and tool schemas, and restores context if validation fails. For tools, recall@K shows whether every tool the LLM actually used was in the forwarded set. The system widens tool selection on retries and fail-open paths so production agents stay reliable.
$orqen --help "How does Orqen detect and recover from quality degradation?"
Output
Orqen monitors each request for signs that compression hurt quality: tool recall misses, HTTP errors, and correction signals in the user's next message (phrases like 'that's wrong', 'you missed', 'as I said'). If a single request shows severe degradation, or two or more bad-quality turns occur within a 10-minute window, the system immediately writes a more conservative compression setting that takes effect on the very next request. It auto-recovers to calibrated levels after roughly two hours. A weekly calibration pass separately sets the long-term aggressiveness baseline for each key based on observed traffic.
$orqen --help "What is payload optimization?"
Output
Payload optimization is Orqen's core job: shrink the full agent payload (tools, tool results, history, images, and schemas) while keeping what the model needs for this turn. Orqen snapshots the original request, reconstructs the outbound prompt, validates critical IDs, URLs, constraints, and tool schemas, and restores context if validation fails.
$orqen --help "What is smart routing?"
Output
Smart routing picks which tools to forward based on what the user is asking for right now — not just keyword matching. Orqen reads the current message and intent, scores your tools, and forwards a smaller relevant set (often 1–4 tools instead of dozens).
$orqen --help "What is model routing?"
Output
Set model="orqen/auto" and Orqen classifies each request by operation type — code generation, data retrieval, analysis, communication, and more — then picks the right model from your connected providers. Simple operations like file lookups route to fast, cheap models; complex tasks like code generation route to capable ones. You can also force the trade-off with orqen/cheap, orqen/fast, or orqen/capable. The toggle is per-customer in the dashboard.
$orqen --help "Do I need to change my agent code?"
Output
Update your API key and endpoint URL to point at Orqen. Anthropic SDK, OpenAI SDK, and AWS Bedrock SDK keep their usual request shapes.
$orqen --help "What's the actual latency cost?"
Output
Orqen adds roughly 300ms to a typical request — mostly network roundtrips and Redis lookups, not computation. Your LLM call itself takes 2–15 seconds, so the overhead is well under 5% of total round-trip time. The pipeline is fail-open: if processing exceeds budget it forwards the original payload unchanged.
$orqen --help "Does it work with tool-calling frameworks like LangChain or LlamaIndex?"
Output
Yes. If your framework uses the Anthropic SDK, OpenAI SDK, or AWS Bedrock SDK, it should work with the same key and endpoint change. LangChain, LlamaIndex, Haystack, and custom agents are good fits.
$orqen --help "Does it work with MCP (Model Context Protocol) tools?"
Output
Yes — and MCP actually makes Orqen more valuable, not less. MCP lets agents connect to many tool servers at once, which means payloads grow quickly. Orqen routes each request to the tools that turn needs, compresses verbose schemas after routing, and caches schema analysis so MCP's dynamic nature adds minimal overhead.
$orqen --help "Does Orqen summarize long conversations?"
Output
Yes. Orqen manages conversation history in three tiers: recent turns are kept verbatim (hot), older turns are compressed and deduplicated (warm), and for long sessions Orqen summarizes earlier chunks into compact summaries (cold). For conversations with 100+ turns, summaries are merged in a hierarchical pass so no single LLM call sees unbounded input and early context is never silently truncated. This keeps prompt sizes from growing unboundedly as agent sessions run longer — without storing raw prompts.
$orqen --help "Are my provider API keys safe?"
Output
You store them once in your dashboard. They're encrypted with AES-128 before being written to our database. Each request decrypts them in memory to forward to your provider.
$orqen --help "What happens when I hit the free monthly limit?"
Output
Your agents keep working. When you reach the 250K saved tokens / month limit for the month, requests still forward directly to your LLM provider — payload optimization pauses until the 1st of next month when the allowance resets. There is no trial clock and no hard cutoff — the free tier is free forever.
$orqen --help "What is passthrough mode?"
Output
When you've used your free saved-token allowance for the month, Orqen stops optimizing payloads but still forwards requests to your LLM provider normally. Your agent doesn't break — you just don't get savings until the monthly reset. Pro is $39/month for unlimited optimization with no monthly cap.
$orqen --help "What's included on the free plan?"
Output
Free forever — 250K saved tokens per month, resets monthly. You get payload optimization, model routing, session tracking, and Routing quality in the dashboard. Pro is $39/month — unlimited optimization, no token caps.
$orqen --help "What if Orqen goes down?"
Output
For production-critical agents, keep the same provider fallback you would use with any API dependency. Orqen preserves your native request shape, so falling back to your provider directly is straightforward if you need it.
Orqen v1.0
2026-05-14
Orqen-FAQ(1)
Your next agent request will cost more than it needs to.
Orqen fixes that before it hits the provider.
Sign up, connect your provider keys, route your agent through Orqen. Free forever — 250K saved tokens per month, resets monthly, no credit card required.
Get started free250K saved tokens / month · Free forever · No credit card · Upgrade when ready