The agent has been running for forty-seven turns. Tool results, user corrections, and intermediate reasoning fill the context window. The next request either gets truncated, rejected by the provider, or silently drops the task definition from turn three. The model forgets what it was doing.
Long sessions are where agent costs and quality problems converge. You are paying to re-process everything every turn — and if you "fix" it by chopping the oldest messages, you often delete the very constraints the task depends on.
The turn 47 problem
Agent loops grow context in three directions:
- Conversation history — every user message, assistant reply, and tool call adds tokens.
- Tool results — API responses, database rows, and HTML pages accumulate in
role: toolmessages. - Tool schemas — still present every turn unless you route per turn (see MCP tool sprawl).
A coding agent, support bot, or research assistant can hit 80–120K estimated tokens within a single session. Models advertise 128K or 200K windows, but quality degrades well before the hard limit — and input pricing scales linearly with everything you send.
Context limits are soft and hard. Soft: model quality drops as irrelevant history grows. Hard: provider rejects or truncates the request. Good compression targets both.
Why naive truncation fails
The obvious fix — drop the oldest 50% of messages — is fast and wrong:
- Task definitions live in early turns. "Use the Acme Corp template and don't touch prod" was turn 2. Turn 47 is a follow-up.
- IDs and constraints scatter. Invoice
INV-8842, repoacme/api, and "approved by legal" appear once each across thirty turns. - Tool results hold state. The model "remembers" a lookup because the JSON is still in history — delete it and the agent re-fetches or hallucinates.
- Cache invalidation. Mutating cached prefixes destroys provider prompt-cache savings (see context caching).
What you need is tiered retention: keep recent turns verbatim, extract durable facts from the middle, and summarize the distant past — gated so light sessions pay zero compression latency.
Fill-ratio gating
Orqen does not run heavy compression on every request. The optimization_plan computes fill_ratio — estimated tokens divided by the model's context window — and sets urgency:
# fill_ratio = estimated tokens / model context window
#
# Orqen gates compression by how full the window is:
# Low fill → skip heavy compression (not worth latency)
# Moderate → tool result trim, schema compression
# High → conversation summary + warm fact extraction
# Very high → aggressive telegraphic + semantic compression| Urgency | Context fill | Typical stages enabled |
|---|---|---|
| Skip | Low | Dedup only; routing as usual |
| Light | Moderate | Tool result trim, schema compression |
| Moderate | High | Warm facts + conversation summary |
| Aggressive | Very high | Telegraphic + semantic compression |
The plan is built once per request and logged in latency_breakdown.optimization_plan so you can see exactly which stages ran and why — without storing raw prompts.
Hot, warm, and cold history
Orqen treats history in three tiers:
- Hot turns — recent assistant turns stay verbatim. The model sees recent tool calls, errors, and user corrections exactly as they happened.
- Warm facts — durable signals extracted from middle history: IDs, percentages, error keywords, and decisions. Injected as a compact system block — not a user message.
- Cold history — everything older than the hot window gets summarized or dropped, depending on urgency.
Relevance-weighted selection can keep a high-scoring turn from turn 3 even when turn 45 is current — so task definitions survive without keeping all 45 turns verbatim.
Conversation summarization
Orqen summarizes cold history when turn count and context pressure warrant it. Key behaviors:
- Incremental segments. Long histories are chunked and summarized per segment, then combined — avoiding single-pass truncation that loses most content.
- Summary as system message. The model treats summarized context as ground truth, not as a prior user query to re-answer.
- Existing summary preservation. If a prior request already injected a summary, it is incorporated rather than discarded.
- Fail silent. If the internal summarizer is unavailable, original messages pass through unchanged — no failed requests.
Summarization uses Orqen's internal LLM providers, not the customer's key. It runs only when the optimization plan enables run_conversation_summary.
Telegraphic compression
At high fill ratios, Orqen shortens verbose assistant and user prose into telegraphic form — preserving named entities, code blocks, and structured data while cutting filler.
It skips messages that look like JSON, HTML, or stack traces. It respects preserve_manifest terms from the optimization plan (IDs, product names, constraints flagged as critical). Failures return the original text.
Telegraphic compression is off by default and only runs when urgency is aggressive — typically the sessions where you would otherwise hit the window wall.
Cache-aware compression
Compression and provider caching must cooperate. Orqen:
- Never modifies messages with
cache_controlmarkers or anything before the last cache breakpoint. - Protects OpenAI system messages (prefix match starts there).
- Uses cache-stable tool routing so pruned subsets stay consistent across turns when caching is active.
The goal: shrink the new suffix without breaking the stable prefix you already paid to cache. For the full caching picture, see Context Caching: The LLM Cost Lever Most Agents Skip.
Try it on long sessions
If your agent regularly exceeds 20+ turns or you see quality cliff past turn 30:
- Sign up for Orqen (Pro unlocks advanced compression tiers).
- Route through Orqen with your real long-running workflow.
- Check Usage for
compression_tokens_saved,compression_techniques, and fill_ratio in the optimization trace. - Compare answer quality on turn 40+ with and without Orqen on the same session shape.
Next step: Sign up free · Introducing Orqen · Tool result bloat