Your agent sends the same system prompt, tool schemas, and conversation history on every turn. The provider processes most of that context again and again — and bills you for it at full input-token rates. Context caching is the feature that says: if this prefix hasn't changed, charge me less.
Anthropic, OpenAI, and AWS Bedrock all support some form of prompt caching today. The discounts are large — often 50–90% off cached input tokens. Yet most production agents never turn it on, or they turn it on and then accidentally invalidate the cache every turn by reshuffling tools or compressing protected messages.
What context caching is
Context caching (also called prompt caching) lets an LLM provider store a stable prefix of your request — system instructions, tool definitions, earlier conversation turns — and reuse it on subsequent calls instead of re-processing those tokens from scratch.
Providers implement this differently:
- Anthropic and Bedrock (Claude) use explicit
cache_controlmarkers. You place{"type": "ephemeral"}on content blocks to define cache breakpoints. The first request writes the cache; later requests that match the prefix read from it. - OpenAI caches automatically via longest-prefix matching. No marker field — if the start of your prompt is byte-identical to a recent request, cached tokens show up in
usage.prompt_tokens_details.cached_tokens.
# Anthropic requires explicit cache_control markers
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system=[
{
"type": "text",
"text": long_system_prompt, # static across turns
"cache_control": {"type": "ephemeral"},
}
],
messages=messages,
tools=tools,
)
# cache_read_input_tokens in usage → ~90% cheaper on cache hitsThe mental model is simple: split each request into a stable prefix (everything that stays the same across turns) and a new suffix (the latest user message, fresh tool results). Caching makes the prefix cheap on repeat visits.
Cache invalidation is strict. Change a single character in the cached prefix — reorder tools, edit the system prompt, compress an old message — and the provider treats the whole prefix as new. You pay write costs again and lose the read discount until the cache rebuilds.
Why caching matters for agents
Agents are the ideal caching workload. Unlike one-shot chat, an agent loop sends the same system prompt and tool catalog on turn 1, turn 5, and turn 20. Conversation history grows, but everything before the latest user message is identical to what you sent one request ago.
Three benefits compound in production:
- Lower input cost. Cached tokens bill at a fraction of fresh input. On Anthropic Claude models, cache reads are typically ~90% cheaper than uncached input — the single largest per-token discount most providers offer.
- Lower latency. Providers skip re-processing cached prefixes. On long agent sessions with heavy tool schemas, time-to-first-token can drop noticeably on cache hits.
- Better economics at scale. Cost scales with turns, not just users. A support agent handling 500 multi-turn sessions/day multiplies any per-turn waste by thousands of requests.
Caching does not replace payload optimization. It discounts tokens you still send. If you're shipping 50 tool schemas on every turn, caching makes those 50 schemas cheaper — but routing 3 relevant tools still wins on both cost and model quality. The two levers stack.
How much money it saves
Savings depend on how much of each request is stable and how many turns reuse it. A rough illustration for a Claude Sonnet agent with an 8K-token stable prefix across 20 turns:
# Rough input-token pricing (illustrative — check your provider)
#
# Provider Fresh input Cached input Savings on hit
# Anthropic Claude $3.00 / 1M $0.30 / 1M ~90%
# OpenAI GPT-4o $2.50 / 1M $1.25 / 1M ~50%
#
# A 20-turn agent with 8K stable prefix tokens/turn:
# Without caching: 20 × 8K × $3/M ≈ $0.48 input
# With caching: 8K write + 19 × 8K read at $0.30/M ≈ $0.05 input| Scenario | Stable prefix | Turns | Typical effect |
|---|---|---|---|
| Short chat, no tools | <1K tokens | 1–3 | Below minimum — little benefit |
| Tool-heavy agent | 6K–15K tokens | 10–50 | Large — prefix is mostly tools + history |
| Long RAG session | 20K+ tokens | 5–20 | Very large — if prefix stays stable |
Anthropic requires a minimum cacheable prefix (~1,024 tokens). OpenAI's automatic cache also favors longer stable prefixes. Single-turn requests and tiny prompts won't see meaningful savings — multi-turn agents will.
Why teams skip it anyway
If the economics are this good, why isn't every agent using caching?
- Anthropic requires explicit setup.
cache_controlmarkers aren't on by default. SDK examples rarely show them. Teams ship agents without ever adding breakpoints. - Tool lists change every turn. Dynamic tool routing, MCP server reconnects, and "send all tools always" patterns reshuffle the prefix. Even a one-tool difference invalidates the cache.
- Compression fights caching. Middleware that rewrites old messages or trims system prompts saves tokens once — then breaks the prefix match forever. You trade a small compression win for a large cache miss.
- OpenAI looks automatic, so nobody checks. Prefix caching works silently, but only when the prompt start is identical. Mutating the system message between turns zeroes out hits.
The result: teams pay full input rates on context they've already paid to process — sometimes on every turn of a 30-turn agent loop.
What Orqen does with caching
Orqen treats provider prompt caching as a first-class concern — not an afterthought. When you route through Orqen, caching and payload optimization are coordinated so they don't work against each other.
Auto-injection for Anthropic and Bedrock
If your request goes to Anthropic or Bedrock without cache_control markers, Orqen can inject them automatically. It finds stable breakpoints — typically the end of system messages and the last assistant turn before the current user message — and adds ephemeral cache markers before the request reaches the provider.
# Point at Orqen — no cache_control required on your side
client = anthropic.Anthropic(
api_key="sk-orq-YOUR_KEY",
base_url="https://api.orqen.app",
)
# Orqen detects stable prefixes and injects breakpoints
# when the conversation is large enough (Anthropic/Bedrock).
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
messages=messages, # same code you already have
tools=all_tools,
)Injection only runs when:
- You haven't set
cache_controlyourself (Orqen won't override your markers) - The provider is Anthropic or Bedrock
- The conversation has at least two messages
- The cacheable prefix is large enough (~1,024 tokens — Anthropic's minimum)
Orqen logs these as cache_injection in the optimization trace. The dashboard tracks how many requests Orqen enabled caching on and how many tokens flowed through provider cache reads.
Cache-aware compression
Orqen's compression pipeline knows which messages are cache-protected:
- Any message with
cache_controlmarkers — and every message before the last cached breakpoint — is left untouched. Modifying them would invalidate the provider cache and cost more than compression saves. - On OpenAI models, system messages are always protected. OpenAI matches caches from the prompt start; changing the system prompt invalidates the entire prefix.
Cache-stable tool routing
Tool routing is where most agents accidentally break caching. Send 51 tools on turn 1 and 3 tools on turn 2 — the prefix changed, cache miss.
When Orqen detects cache_control in a request, it amplifies a cache-stable bonus during tool selection: tools that were in the previous turn's pruned set get a score boost so the forwarded subset stays consistent across turns. Combined with session hints (tools the LLM already called get boosted regardless of caching), routing stability improves without ignoring relevance.
Complementary, not competing: caching discounts repeated tokens; Orqen shrinks the tokens you send and improves tool selection quality. Orqen also increases prefix-cache hit rates by keeping tool subsets stable instead of reshuffling the full catalog every turn.
Requests that aren't caching yet
Not every request hits the provider cache. Here's what Orqen does in each case — and what you can expect.
When you already set cache_control
Orqen detects your markers before the optimization pipeline runs. It does not inject additional markers. Compression skips cache-protected messages. Tool routing applies a stronger cache-stability boost so your prefix stays valid across turns.
When auto-injection doesn't apply
Some requests won't get Orqen-injected markers:
- OpenAI and other providers — caching is automatic; Orqen focuses on keeping the system prompt stable and protected from compression instead of injecting markers.
- Short conversations — below ~1,024 cacheable tokens, Anthropic won't cache anyway, so Orqen skips injection.
- Single-message requests — no turn boundary to mark.
These requests still run the full Orqen pipeline: tool pruning, schema compression, deduplication, and the rest. You get payload savings even when provider caching isn't active yet.
When the provider cache misses on a later turn
A cache miss means the prefix changed — new tools, edited system prompt, middleware mutation, or the ephemeral cache expired (Anthropic's default TTL is five minutes of inactivity). Orqen doesn't treat that as a failure. The request forwards normally; optimization still applies on the fresh prefix.
Honest savings attribution
Orqen tracks provider_cached_tokens from upstream usage responses (Anthropic cache_read_input_tokens, OpenAI cached_tokens, Bedrock cacheReadInputTokenCount). When you compress tokens that the provider would have cached anyway, Orqen subtracts the overlap so savings aren't double-counted.
Exception: if Orqen injected the cache markers (cache_injected=True), the provider cache exists because of Orqen — so compression and cache savings are both credited honestly.
Passthrough mode is separate. If you hit your free-tier saved-token limit, Orqen stops optimizing but still forwards requests. That's a billing-cap behavior, not a caching behavior — your agent keeps running either way.
Stack caching with payload optimization
The highest-leverage setup for a multi-turn, tool-heavy agent:
- Route through Orqen so tool lists shrink per turn and stay stable when caching is active.
- Let Orqen inject cache markers (Anthropic/Bedrock) or keep system prompts identical (OpenAI) — don't mutate the prefix between turns unless you mean to.
- Check the dashboard for cache read tokens, cache-injected requests, and tools in → tools out per session.
Typical agent workloads with 10–100 tools see 50–70% fewer prompt tokens from routing and compression on top of whatever caching discounts the provider applies to the stable prefix that remains. The two savings multiply rather than replace each other.
For deeper context on tool routing and MCP sprawl, see MCP Gave Your Agent 50 Tools — Now What? and the sessions docs for how Orqen groups turns and applies session-consistent routing.
Try it on your agent
If you're running multi-turn agents on Anthropic, OpenAI, or Bedrock and haven't checked your cache hit rate, you're likely overpaying on input tokens you send every turn anyway:
- Create a free Orqen account and add your provider key.
- Point your SDK at
https://api.orqen.app— no changes to your message format required. - Run a multi-turn agent session with your real system prompt and tool list.
- Open Usage and look for provider cached tokens and cache-injected requests alongside payload savings.
Already using cache_control? Keep it — Orqen detects your markers, protects cached content from compression, and stabilizes tool routing around them. New to caching? Orqen injects breakpoints on qualifying Anthropic and Bedrock requests so you don't have to wire markers yourself.
Next step: Sign up free · Sessions & cache-stable routing · Introducing Orqen