Skip to content
All posts
Guide//5 MIN READ

Turn 47 Hit the Context Window. Now What?

Long agent sessions outgrow the context window. Why naive truncation drops task context, and how Orqen uses fill-ratio gating, hot/warm/cold history, and summarization.

O

Orqen Team

orqen.app

The agent has been running for forty-seven turns. Tool results, user corrections, and intermediate reasoning fill the context window. The next request either gets truncated, rejected by the provider, or silently drops the task definition from turn three. The model forgets what it was doing.

Long sessions are where agent costs and quality problems converge. You are paying to re-process everything every turn — and if you "fix" it by chopping the oldest messages, you often delete the very constraints the task depends on.

The turn 47 problem

Agent loops grow context in three directions:

  • Conversation history — every user message, assistant reply, and tool call adds tokens.
  • Tool results — API responses, database rows, and HTML pages accumulate in role: tool messages.
  • Tool schemas — still present every turn unless you route per turn (see MCP tool sprawl).

A coding agent, support bot, or research assistant can hit 80–120K estimated tokens within a single session. Models advertise 128K or 200K windows, but quality degrades well before the hard limit — and input pricing scales linearly with everything you send.

Context limits are soft and hard. Soft: model quality drops as irrelevant history grows. Hard: provider rejects or truncates the request. Good compression targets both.

Why naive truncation fails

The obvious fix — drop the oldest 50% of messages — is fast and wrong:

  • Task definitions live in early turns. "Use the Acme Corp template and don't touch prod" was turn 2. Turn 47 is a follow-up.
  • IDs and constraints scatter. Invoice INV-8842, repo acme/api, and "approved by legal" appear once each across thirty turns.
  • Tool results hold state. The model "remembers" a lookup because the JSON is still in history — delete it and the agent re-fetches or hallucinates.
  • Cache invalidation. Mutating cached prefixes destroys provider prompt-cache savings (see context caching).

What you need is tiered retention: keep recent turns verbatim, extract durable facts from the middle, and summarize the distant past — gated so light sessions pay zero compression latency.

Fill-ratio gating

Orqen does not run heavy compression on every request. The optimization_plan computes fill_ratio — estimated tokens divided by the model's context window — and sets urgency:

# fill_ratio = estimated tokens / model context window
#
# Orqen gates compression by how full the window is:
#   Low fill     → skip heavy compression (not worth latency)
#   Moderate     → tool result trim, schema compression
#   High         → conversation summary + warm fact extraction
#   Very high    → aggressive telegraphic + semantic compression
UrgencyContext fillTypical stages enabled
SkipLowDedup only; routing as usual
LightModerateTool result trim, schema compression
ModerateHighWarm facts + conversation summary
AggressiveVery highTelegraphic + semantic compression

The plan is built once per request and logged in latency_breakdown.optimization_plan so you can see exactly which stages ran and why — without storing raw prompts.

Hot, warm, and cold history

Orqen treats history in three tiers:

  • Hot turns — recent assistant turns stay verbatim. The model sees recent tool calls, errors, and user corrections exactly as they happened.
  • Warm facts — durable signals extracted from middle history: IDs, percentages, error keywords, and decisions. Injected as a compact system block — not a user message.
  • Cold history — everything older than the hot window gets summarized or dropped, depending on urgency.

Relevance-weighted selection can keep a high-scoring turn from turn 3 even when turn 45 is current — so task definitions survive without keeping all 45 turns verbatim.

Conversation summarization

Orqen summarizes cold history when turn count and context pressure warrant it. Key behaviors:

  • Incremental segments. Long histories are chunked and summarized per segment, then combined — avoiding single-pass truncation that loses most content.
  • Summary as system message. The model treats summarized context as ground truth, not as a prior user query to re-answer.
  • Existing summary preservation. If a prior request already injected a summary, it is incorporated rather than discarded.
  • Fail silent. If the internal summarizer is unavailable, original messages pass through unchanged — no failed requests.

Summarization uses Orqen's internal LLM providers, not the customer's key. It runs only when the optimization plan enables run_conversation_summary.

Telegraphic compression

At high fill ratios, Orqen shortens verbose assistant and user prose into telegraphic form — preserving named entities, code blocks, and structured data while cutting filler.

It skips messages that look like JSON, HTML, or stack traces. It respects preserve_manifest terms from the optimization plan (IDs, product names, constraints flagged as critical). Failures return the original text.

Telegraphic compression is off by default and only runs when urgency is aggressive — typically the sessions where you would otherwise hit the window wall.

Cache-aware compression

Compression and provider caching must cooperate. Orqen:

  • Never modifies messages with cache_control markers or anything before the last cache breakpoint.
  • Protects OpenAI system messages (prefix match starts there).
  • Uses cache-stable tool routing so pruned subsets stay consistent across turns when caching is active.

The goal: shrink the new suffix without breaking the stable prefix you already paid to cache. For the full caching picture, see Context Caching: The LLM Cost Lever Most Agents Skip.

Try it on long sessions

If your agent regularly exceeds 20+ turns or you see quality cliff past turn 30:

  1. Sign up for Orqen (Pro unlocks advanced compression tiers).
  2. Route through Orqen with your real long-running workflow.
  3. Check Usage for compression_tokens_saved, compression_techniques, and fill_ratio in the optimization trace.
  4. Compare answer quality on turn 40+ with and without Orqen on the same session shape.
Tagged:context-windowagent-optimizationcompressionllm-costhistory
O

Orqen Team

We build the optimization layer for tool-heavy LLM agents. Our goal is to make agent costs predictable as your tool set grows.

Try Orqen free

250K saved tokens per month. Free forever. Two-line integration.

See your savings in the dashboard within seconds of your first request.