Claude's Context Compaction API: Infinite Conversations with One Parameter

Watch the Video
The Problem: Token Limits Kill Long-Running Agents
If you've built anything with an LLM API — a chatbot, a coding agent, an n8n workflow, anything with a loop — you've hit this wall. Your agent is 45 minutes into a complex task, 100,000+ tokens deep. Every single turn, you're re-sending the entire conversation history. Turn one's data is still riding along at turn 50.
The token count doesn't just grow — it compounds. Then one of two things happens:
- Hard crash: You hit the 200K wall and the API returns an error. Your agent dies mid-task. Game over, start from scratch.
- Silent degradation: The model loses track of decisions it made 30 turns ago and starts contradicting itself. It starts hallucinating, and you don't even notice until the output is garbage.
The workaround most developers use looks like this:
// The old hack
messages = messages.slice(-10);
You're literally praying that the last 10 messages contain enough context for the agent to keep going. Sometimes they do. Most of the time they don't. You're throwing away critical decisions, tool call results, user preferences — all gone. There's no intelligence in this truncation. It's a hack, and every developer building with these APIs knows it.
How Context Compaction Works
Context Compaction is a server-side API that handles this problem elegantly. Here's how it works:
- You set a token threshold (e.g., 100,000 tokens)
- When your conversation approaches that limit, the API itself (not your code) generates a structured summary of everything that came before
- It drops all raw messages before that summary
- It continues seamlessly from the compacted context
The key: this cycles. When it hits the threshold again, it compacts again — and again, and again. Anthropic calls this "effectively infinite conversations." That's a direct quote from their docs.
You're not building a RAG pipeline. You're not setting up a vector database. You're not writing your own summarization prompts and hoping they're good enough. You configure a trigger threshold and the API handles the rest.
It's currently in beta (header: compact-2026-12). It works across the Claude API, AWS Bedrock, Google Vertex AI, and Microsoft Foundry. And here's the kicker: the compaction itself adds no extra charge. You're paying normal token costs, and since you're sending fewer tokens per call, you actually save money.
Why This Works Now
Server-side summarization isn't a new idea. The reason this works now is because three things converged at the same time:
1. Model Quality
Opus 4.6 scores 76% on the MRC-R v2 benchmark — that's a needle-in-a-haystack test across 1 million tokens of context. The previous best (Sonnet 4.5) scored 18.5%. That's a 4x improvement. Practically, this means the summaries the compaction generates are actually high quality. The model can find and preserve important details across massive context.
2. Runtime Stability
Between February 10th and 19th, Anthropic shipped a wave of fixes to Claude Code: unbounded WeakMap memory growth, O(n²) message accumulation bugs, multiple memory leaks in circular buffers and child processes. Before these fixes, even if compaction worked perfectly, the runtime would crash under sustained load. Now it doesn't.
3. No Competition
OpenAI has no equivalent server-side compaction API. Google Gemini gives you a million tokens of context (which is great), but there's no automatic summarization when you exceed it — you still hit a wall, it's just a bigger wall. This is currently a Claude-only feature.
The Benchmark
Here's a real benchmark from Anthropic's own customer service eval:
| Metric | Without Compaction | With Compaction |
|---|---|---|
| Input tokens (multi-ticket workflow) | 200,400 | 82,000 |
| Token reduction | — | 58.6% |
| Max conversation length | ~200K | 10 million+ |
Same quality. Fewer tokens. Lower cost.
Implementation: Three Levels
Level 1: Basic Integration (30 seconds)
This is your existing messages.create call. The only change is two additions:
const response = await anthropic.messages.create({
model: "claude-opus-4-6-20260205",
max_tokens: 8096,
// Add the beta header
betas: ["compact-2026-12"],
// Add context management
context_management: {
enabled_tools: [{
type: "compact",
trigger: {
type: "input_tokens",
threshold: 100000
}
}]
},
messages: conversationHistory
});
That's the entire implementation. When your conversation exceeds 100K tokens, the API automatically generates a structured summary, drops the old messages, and keeps going.
Level 2: Long-Running Agent
Two key additions for production agents:
context_management: {
enabled_tools: [{
type: "compact",
trigger: {
type: "input_tokens",
threshold: 100000
},
// Pause after compaction so you can react
pause_after_compaction: true
}]
}
With pause_after_compaction: true, the API pauses after it compacts so you get a chance to react. You check if the stop reason is "compaction" and can track how many compactions have happened.
You can also set a total token budget (e.g., 3 million) and inject a graceful shutdown message when exceeded — instead of a hard crash.
Level 3: Custom Instructions
You can tell the compaction engine what to prioritize:
context_management: {
enabled_tools: [{
type: "compact",
trigger: {
type: "input_tokens",
threshold: 100000
},
instruction: "Focus on preserving code snippets, variable names, and technical decisions."
}]
}
Now when it summarizes, it knows to keep the code and drop the chit-chat. You can customize this for any use case: customer service, research, coding agents, whatever your agent does.
Key Facts
- Server-side API — no client-side logic needed
- Reduced tokens by 58.6% in Anthropic's benchmarks
- Enables conversations up to 10 million tokens
- Works across Claude API, Bedrock, Vertex, and Foundry
- ZDR eligible for enterprise
- One parameter change to implement
- Beta header:
compact-2026-12