Context Caching: The LLM Cost Lever Most Agents Skip

Your agent sends the same system prompt, tool schemas, and conversation history on every turn. The provider processes most of that context again and again — and bills you for it at full input-token rates. Context caching is the feature that says: if this prefix hasn't changed, charge me less.

Anthropic, OpenAI, and AWS Bedrock all support some form of prompt caching today. The discounts are large — often 50–90% off cached input tokens. Yet most production agents never turn it on, or they turn it on and then accidentally invalidate the cache every turn by reshuffling tools or compressing protected messages.

What context caching is

Context caching (also called prompt caching) lets an LLM provider store a stable prefix of your request — system instructions, tool definitions, earlier conversation turns — and reuse it on subsequent calls instead of re-processing those tokens from scratch.

Providers implement this differently:

Anthropic and Bedrock (Claude) use explicit cache_control markers. You place {"type": "ephemeral"} on content blocks to define cache breakpoints. The first request writes the cache; later requests that match the prefix read from it.
OpenAI caches automatically via longest-prefix matching. No marker field — if the start of your prompt is byte-identical to a recent request, cached tokens show up in usage.prompt_tokens_details.cached_tokens.

# Anthropic requires explicit cache_control markers
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": long_system_prompt,  # static across turns
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=messages,
    tools=tools,
)
# cache_read_input_tokens in usage → ~90% cheaper on cache hits

The mental model is simple: split each request into a stable prefix (everything that stays the same across turns) and a new suffix (the latest user message, fresh tool results). Caching makes the prefix cheap on repeat visits.

Cache invalidation is strict. Change a single character in the cached prefix — reorder tools, edit the system prompt, compress an old message — and the provider treats the whole prefix as new. You pay write costs again and lose the read discount until the cache rebuilds.

Why caching matters for agents

Agents are the ideal caching workload. Unlike one-shot chat, an agent loop sends the same system prompt and tool catalog on turn 1, turn 5, and turn 20. Conversation history grows, but everything before the latest user message is identical to what you sent one request ago.

Three benefits compound in production:

Lower input cost. Cached tokens bill at a fraction of fresh input. On Anthropic Claude models, cache reads are typically ~90% cheaper than uncached input — the single largest per-token discount most providers offer.
Lower latency. Providers skip re-processing cached prefixes. On long agent sessions with heavy tool schemas, time-to-first-token can drop noticeably on cache hits.
Better economics at scale. Cost scales with turns, not just users. A support agent handling 500 multi-turn sessions/day multiplies any per-turn waste by thousands of requests.

Caching does not replace payload optimization. It discounts tokens you still send. If you're shipping 50 tool schemas on every turn, caching makes those 50 schemas cheaper — but routing 3 relevant tools still wins on both cost and model quality. The two levers stack.

How much money it saves

Savings depend on how much of each request is stable and how many turns reuse it. A rough illustration for a Claude Sonnet agent with an 8K-token stable prefix across 20 turns:

# Rough input-token pricing (illustrative — check your provider)
#
# Provider          Fresh input    Cached input    Savings on hit
# Anthropic Claude  $3.00 / 1M     $0.30 / 1M      ~90%
# OpenAI GPT-4o     $2.50 / 1M     $1.25 / 1M      ~50%
#
# A 20-turn agent with 8K stable prefix tokens/turn:
#   Without caching: 20 × 8K × $3/M  ≈ $0.48 input
#   With caching:    8K write + 19 × 8K read at $0.30/M ≈ $0.05 input

Scenario	Stable prefix	Turns	Typical effect
Short chat, no tools	<1K tokens	1–3	Below minimum — little benefit
Tool-heavy agent	6K–15K tokens	10–50	Large — prefix is mostly tools + history
Long RAG session	20K+ tokens	5–20	Very large — if prefix stays stable

Anthropic requires a minimum cacheable prefix (~1,024 tokens). OpenAI's automatic cache also favors longer stable prefixes. Single-turn requests and tiny prompts won't see meaningful savings — multi-turn agents will.

Why teams skip it anyway

If the economics are this good, why isn't every agent using caching?

Anthropic requires explicit setup. cache_control markers aren't on by default. SDK examples rarely show them. Teams ship agents without ever adding breakpoints.
Tool lists change every turn. Dynamic tool routing, MCP server reconnects, and "send all tools always" patterns reshuffle the prefix. Even a one-tool difference invalidates the cache.
Compression fights caching. Middleware that rewrites old messages or trims system prompts saves tokens once — then breaks the prefix match forever. You trade a small compression win for a large cache miss.
OpenAI looks automatic, so nobody checks. Prefix caching works silently, but only when the prompt start is identical. Mutating the system message between turns zeroes out hits.

The result: teams pay full input rates on context they've already paid to process — sometimes on every turn of a 30-turn agent loop.

What Orqen does with caching

Orqen treats provider prompt caching as a first-class concern — not an afterthought. When you route through Orqen, caching and payload optimization are coordinated so they don't work against each other.

Auto-injection for Anthropic and Bedrock

If your request goes to Anthropic or Bedrock without cache_control markers, Orqen can inject them automatically. It finds stable breakpoints — typically the end of system messages and the last assistant turn before the current user message — and adds ephemeral cache markers before the request reaches the provider.

# Point at Orqen — no cache_control required on your side
client = anthropic.Anthropic(
    api_key="sk-orq-YOUR_KEY",
    base_url="https://api.orqen.app",
)

# Orqen detects stable prefixes and injects breakpoints
# when the conversation is large enough (Anthropic/Bedrock).
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    messages=messages,  # same code you already have
    tools=all_tools,
)

Injection only runs when:

You haven't set cache_control yourself (Orqen won't override your markers)
The provider is Anthropic or Bedrock
The conversation has at least two messages
The cacheable prefix is large enough (~1,024 tokens — Anthropic's minimum)

Orqen logs these as cache_injection in the optimization trace. The dashboard tracks how many requests Orqen enabled caching on and how many tokens flowed through provider cache reads.

Cache-aware compression

Orqen's compression pipeline knows which messages are cache-protected:

Any message with cache_control markers — and every message before the last cached breakpoint — is left untouched. Modifying them would invalidate the provider cache and cost more than compression saves.
On OpenAI models, system messages are always protected. OpenAI matches caches from the prompt start; changing the system prompt invalidates the entire prefix.

Cache-stable tool routing

Tool routing is where most agents accidentally break caching. Send 51 tools on turn 1 and 3 tools on turn 2 — the prefix changed, cache miss.

When Orqen detects cache_control in a request, it amplifies a cache-stable bonus during tool selection: tools that were in the previous turn's pruned set get a score boost so the forwarded subset stays consistent across turns. Combined with session hints (tools the LLM already called get boosted regardless of caching), routing stability improves without ignoring relevance.

Complementary, not competing: caching discounts repeated tokens; Orqen shrinks the tokens you send and improves tool selection quality. Orqen also increases prefix-cache hit rates by keeping tool subsets stable instead of reshuffling the full catalog every turn.

Requests that aren't caching yet

Not every request hits the provider cache. Here's what Orqen does in each case — and what you can expect.

When you already set cache_control

Orqen detects your markers before the optimization pipeline runs. It does not inject additional markers. Compression skips cache-protected messages. Tool routing applies a stronger cache-stability boost so your prefix stays valid across turns.

When auto-injection doesn't apply

Some requests won't get Orqen-injected markers:

OpenAI and other providers — caching is automatic; Orqen focuses on keeping the system prompt stable and protected from compression instead of injecting markers.
Short conversations — below ~1,024 cacheable tokens, Anthropic won't cache anyway, so Orqen skips injection.
Single-message requests — no turn boundary to mark.

These requests still run the full Orqen pipeline: tool pruning, schema compression, deduplication, and the rest. You get payload savings even when provider caching isn't active yet.

When the provider cache misses on a later turn

A cache miss means the prefix changed — new tools, edited system prompt, middleware mutation, or the ephemeral cache expired (Anthropic's default TTL is five minutes of inactivity). Orqen doesn't treat that as a failure. The request forwards normally; optimization still applies on the fresh prefix.

Honest savings attribution

Orqen tracks provider_cached_tokens from upstream usage responses (Anthropic cache_read_input_tokens, OpenAI cached_tokens, Bedrock cacheReadInputTokenCount). When you compress tokens that the provider would have cached anyway, Orqen subtracts the overlap so savings aren't double-counted.

Exception: if Orqen injected the cache markers (cache_injected=True), the provider cache exists because of Orqen — so compression and cache savings are both credited honestly.

Passthrough mode is separate. If you hit your free-tier saved-token limit, Orqen stops optimizing but still forwards requests. That's a billing-cap behavior, not a caching behavior — your agent keeps running either way.

Stack caching with payload optimization

The highest-leverage setup for a multi-turn, tool-heavy agent:

Route through Orqen so tool lists shrink per turn and stay stable when caching is active.
Let Orqen inject cache markers (Anthropic/Bedrock) or keep system prompts identical (OpenAI) — don't mutate the prefix between turns unless you mean to.
Check the dashboard for cache read tokens, cache-injected requests, and tools in → tools out per session.

Typical agent workloads with 10–100 tools see 50–70% fewer prompt tokens from routing and compression on top of whatever caching discounts the provider applies to the stable prefix that remains. The two savings multiply rather than replace each other.

For deeper context on tool routing and MCP sprawl, see MCP Gave Your Agent 50 Tools — Now What? and the sessions docs for how Orqen groups turns and applies session-consistent routing.

Try it on your agent

If you're running multi-turn agents on Anthropic, OpenAI, or Bedrock and haven't checked your cache hit rate, you're likely overpaying on input tokens you send every turn anyway:

Create a free Orqen account and add your provider key.
Point your SDK at https://api.orqen.app — no changes to your message format required.
Run a multi-turn agent session with your real system prompt and tool list.
Open Usage and look for provider cached tokens and cache-injected requests alongside payload savings.

Already using cache_control? Keep it — Orqen detects your markers, protects cached content from compression, and stabilizes tool routing around them. New to caching? Orqen injects breakpoints on qualifying Anthropic and Bedrock requests so you don't have to wire markers yourself.

Next step: Sign up free · Sessions & cache-stable routing · Introducing Orqen

Context Caching: The LLM Cost Lever Most Agents Skip

What context caching is

Why caching matters for agents

How much money it saves

Why teams skip it anyway

What Orqen does with caching

Auto-injection for Anthropic and Bedrock

Cache-aware compression

Cache-stable tool routing

Requests that aren't caching yet

When you already set cache_control

When auto-injection doesn't apply

When the provider cache misses on a later turn

Honest savings attribution

Stack caching with payload optimization

Try it on your agent

Turn 47 Hit the Context Window. Now What?

See Orqen optimize your agent payloads