Deep documentation on Anthropic's 1 million token context window for Claude Opus 4.6 and Sonnet 4.6 — architecture, degradation, cost, and strategies
Anthropic announced that the 1 million token context window is now generally available for both Claude Opus 4.6 and Claude Sonnet 4.6. Previously available as a limited beta, the expanded context window is now accessible to all API users and Claude Pro/Max subscribers.
The announcement carried several landmark details that distinguished it from competing offerings:
This move is widely interpreted as a competitive response to the broader industry trend toward massive context windows. OpenAI's GPT-4.5 offers a 1M token context, and Google's Gemini models have advertised 1M-2M token windows since mid-2024. However, raw window size alone is insufficient — what matters is how well the model actually uses the full context, a distinction that becomes critically important in the Context Degradation section below.
Anthropic's decision to remove the long-context surcharge is particularly significant for agentic workloads like Claude Code, where autonomous coding sessions can easily consume hundreds of thousands of tokens over the course of an investigation. The elimination of the premium makes long sessions financially viable in a way they were not before.
Large language models (LLMs) do not process raw text directly. Instead, text is broken into tokens — subword units that typically represent 3-4 characters of English text. A token might be a complete short word ("the", "cat"), a word fragment ("un-", "-tion"), or a single character for rare symbols. On average, 1 token is approximately 0.75 words in English, or conversely, 1 word is roughly 1.3 tokens.
The context window (also called context length or context size) is the maximum number of tokens a model can process in a single interaction. It includes everything: the system prompt, the user's message, all conversation history, any documents or code included in the prompt, and the model's own output. Think of it as the model's working memory — everything it can "see" and reason about at once.
A context window is like a desk. A small desk (4K tokens) lets you work with a single page. A large desk (1M tokens) lets you spread out an entire filing cabinet. But just because you can spread out 3,000 pages doesn't mean you'll remember the detail on page 1,487 while writing on page 2,903. The quality of attention across that desk matters enormously.
Context windows have grown exponentially over just a few years:
Different use cases require different context sizes:
| Use Case | Typical Tokens Needed | Why It Matters |
|---|---|---|
| Simple Q&A | 1K - 5K | Any model handles this. Context size irrelevant. |
| Document summarization | 10K - 50K | Full papers, contracts, or reports in one pass. |
| Codebase analysis | 50K - 200K | Understanding cross-file dependencies, refactoring patterns. |
| Agentic coding sessions | 100K - 500K | Multi-step investigations accumulate tool calls, file reads, and reasoning. |
| Book-length analysis | 200K - 800K | Analyzing entire novels, legislation, or patent portfolios. |
| Full repository + history | 500K - 1M | Large monorepo analysis with git history and documentation. |
A larger context window does not automatically mean better results. As documented extensively in the Context Degradation section, models exhibit significant quality loss as context fills up. The 1M window is best understood as a maximum capacity, not an optimal operating point.
This is the most critical topic in the entire 1M context discussion. A larger context window is only valuable if the model can actually attend to and reason about its contents. Research and practical experience demonstrate significant degradation as context fills up.
The "Lost in the Middle" phenomenon, first documented in a landmark 2023 paper by Liu et al. (Stanford/UC Berkeley), describes a fundamental limitation of transformer-based language models: they attend most strongly to information at the beginning and end of their context window, while progressively ignoring information in the middle.
Transformer models use a mechanism called self-attention to determine which tokens in the context are relevant to generating the next token. The attention weights form a distribution across all input tokens, and empirical analysis shows this distribution is not uniform:
The grid below shows a simplified attention distribution across the context window. Green cells indicate high attention; gray cells indicate low attention. Use the slider to see how the "middle dead zone" expands as context fills up.
Beyond the academic "Lost in the Middle" research, practitioners have identified a broader pattern called the "Dumb Zone" — a noticeable drop in overall model intelligence (not just recall) once context exceeds certain thresholds.
Dex Horthy published influential research suggesting that model performance begins to degrade meaningfully once context utilization exceeds approximately 40% of the window. For a 200K window, this means quality starts declining around 80K tokens. For a 1M window, the threshold is roughly 400K tokens.
Horthy proposed the RPI Framework (Research, Plan, Implement) as a mitigation strategy — a disciplined workflow that involves frequent intentional compaction to keep context lean rather than letting it grow unchecked. The framework is detailed in the Community Strategies section.
A crucial finding: degradation is determined by how many tokens have been consumed, not by the size of the window. 90,000 tokens consumed produces the same quality degradation whether you are in a 100K window or a 1M window. The larger window gives you more room, but it does not make the model smarter at any given fill level.
Based on community observations and testing, context can be roughly divided into three quality zones:
The community remains divided on whether recent model updates have addressed the dumb zone:
Reports from Hacker News and the Claude community paint a nuanced picture:
Compaction (also called context summarization or context pruning) is the process by which a model or tool summarizes older conversation history to free up context space. It is the primary mechanism for extending sessions beyond the context limit, but it comes with significant trade-offs.
The 1M window means compaction triggers far less frequently. Many coding sessions complete entirely within 400-700K tokens, avoiding compaction altogether. When compaction does trigger, there is more room for a generous summary.
For truly long sessions (multi-hour autonomous coding), even 1M tokens will eventually be consumed. The compaction quality issue remains, just deferred. Additionally, the cost of operating at 700K+ tokens is substantial (see Cost Analysis).
While the removal of the long-context premium is welcome, operating at high context utilization remains expensive simply due to the volume of tokens being processed.
| Metric | Rate | At 700K Context |
|---|---|---|
| Input tokens | $15.00 / 1M tokens | $10.50 per call |
| Cache reads | $1.50 / 1M tokens | $1.05 per call |
| Output tokens | $75.00 / 1M tokens | $0.30 per call (4K output) |
| Cache writes | $18.75 / 1M tokens | $13.13 (first call only) |
There has been confusion in the community about whether cache reads count as "paid input tokens." Per Anthropic's prompt caching documentation: cache reads are charged at a reduced rate ($1.50/1M for Opus vs. $15.00/1M for uncached input). This represents a 90% discount. In agentic workflows where the same large context is reused across many tool calls, the majority of context hits the cache, making the effective per-call cost much lower than the raw input rate.
However, even at the cache rate, 700K tokens per call at $1.50/1M = $1.05 per call. Over 100 tool calls in a session, that is still $105 in cache reads alone, before accounting for output tokens.
For Claude Max subscribers using Claude Code, the 1M context window burns through subscription usage rapidly. Reports indicate:
Unlike many computational costs that benefit from economies of scale, LLM context costs scale linearly with size. An 800K token call costs exactly 8x what a 100K token call costs. There is no volume discount within a single call. This makes the economic argument for disciplined context management compelling: keeping context at 100-200K tokens through strategic trimming can reduce costs by 4-8x compared to letting context grow to fill the window.
The developer community has developed sophisticated strategies for working effectively with large context windows. These approaches recognize that the best use of a 1M window is often to not fill it all at once.
Developer jeremychone popularized a code mapping strategy that dramatically reduces context requirements through intelligent summarization:
# module/auth/handler.rs
## Summary
Handles HTTP authentication requests including login,
logout, and token refresh endpoints.
## When to Use
Reference this file when working on authentication
flows, session management, or JWT token handling.
## Public Types
- AuthRequest { username: String, password: String }
- AuthResponse { token: String, expires_at: DateTime }
- AuthError { code: u16, message: String }
## Public Functions
- handle_login(req: AuthRequest) -> Result<AuthResponse>
- handle_logout(token: &str) -> Result<()>
- refresh_token(token: &str) -> Result<AuthResponse>
- validate_token(token: &str) -> Result<Claims>
This approach reduced 381 context files (1.62 MB) to 5 files (27.90 KB) — a 98% reduction. Typical context usage dropped to 30-80K tokens even for large codebases. The code maps use pure markdown formatting, which LLMs interpret naturally without requiring any sed/grep tooling.
Dex Horthy's Research, Plan, Implement framework structures AI-assisted development into distinct phases with intentional context boundaries:
The AI scans relevant files, reads documentation, and builds an understanding of the problem space. This is where context accumulates most rapidly.
Findings are documented in a detailed, step-by-step plan file (PLAN.md). This serves as the "external memory" that survives context clearing.
Clear context, then start a fresh session with only the plan file. The AI implements each step with a clean, undegraded context.
After implementation, review the work and document learnings. This creates institutional knowledge for future sessions.
A broader practice that emerged from the RPI framework involves using persistent plan files as a core part of the development workflow:
Several tools have emerged to help manage context more effectively:
A Rust-based runtime with Lua scripting for creating "AI packs" — curated, compressed representations of codebases. Automates the code mapping approach with configurable summarization rules.
An open-source alternative to Claude Code with configurable compaction. Users can define when and how compaction occurs, set custom summarization prompts, and control what gets preserved versus discarded.
A VSCode extension that leverages the Language Server Protocol (LSP) to provide intelligent context selection. It understands code structure and can automatically include only the relevant symbols, types, and functions.
Uses Tree-sitter AST parsing to create "skeleton" representations of code files — preserving structure (function signatures, type definitions, class hierarchies) while stripping implementation details. Achieves 80-95% compression depending on the language.
Claude Code — Anthropic's official CLI tool for AI-assisted software development — is one of the primary beneficiaries of the 1M context window. The tool operates as an autonomous agent, reading files, running commands, and building up context over the course of a session.
Despite the expanded window, the core challenge remains: the model must maintain coherence after 200K tokens. Claude Code's effectiveness depends not just on having the context available, but on the model actively using it — remembering file contents it read 300 exchanges ago, maintaining consistency between its plan and its implementation, and not repeating failed approaches.
For most Claude Code sessions, the optimal strategy remains keeping context lean (100-200K tokens) through disciplined file reading, clear prompting, and occasional manual compaction. The 1M window provides a safety net for sessions that grow unexpectedly large, but it should not be the target operating point.
The 1M context landscape includes several competitors, but their real-world performance varies dramatically from their marketed capabilities.
| Model | Advertised Window | Usable Range | Long-Context Premium | Community Rating |
|---|---|---|---|---|
| Claude Opus 4.6 | 1,000,000 | ~700K effective | None | Best long-context quality |
| Claude Sonnet 4.6 | 1,000,000 | ~500-600K effective | None | Best value for long tasks |
| GPT-4.5 | 1,000,000 | ~300-400K effective | Tiered pricing | Mixed reports |
| Gemini 2.0 Pro | 2,000,000 | ~200-400K effective | Premium tier | Significant degradation |
| Codex 5.2 | 200,000 | ~150K effective | None | Strong post-compaction |
Model comparisons are inherently subjective and depend heavily on the specific task, prompt structure, and how context is organized. These assessments reflect aggregate community sentiment as of early 2026 and may not reflect your specific use case. Always benchmark with your own workloads.
Community reaction to the 1M context window has been largely positive but nuanced. Three distinct camps have emerged:
These users emphasize the practical benefits of reduced compaction friction and the ability to tackle larger problems:
These users argue that large context windows are unnecessary for disciplined workflows and may actually harm quality:
This group acknowledges both the benefits and limitations, advocating for a middle path:
Use the tools below to explore the relationships between context size, quality, and cost.
Drag the slider to explore how different context utilization levels affect quality, cost, and equivalent content size.
Estimate the total cost of a coding session based on context size, number of calls, and model choice.
Generate per-file summaries (summary, when_to_use, public_types, public_functions) for your codebase. Load these maps instead of raw source files when starting a session. Only pull in full file contents when the model needs to edit or deeply understand specific code.
Structure work into Research, Plan, Implement phases with context clearing between them. Write detailed PLAN.md files during the planning phase. Start implementation with a fresh context that loads only the plan.
Stay aware of how much context you have consumed. If you notice the model starting to repeat itself, forget earlier instructions, or produce lower-quality output, it may be time to compact or restart with a focused context.
Remember that cost scales linearly with context size. For routine tasks, keep context lean (50-100K tokens). Reserve the deep context for genuinely complex, cross-cutting investigations where the accumulated understanding is irreplaceable.
Don't wait for automatic compaction. When you finish a phase of work, manually summarize the key findings and start fresh. This gives you control over what the model remembers vs. what it forgets.
Place the most important information (system prompts, architectural guidelines, critical constraints) at the very beginning of context where attention is strongest. Put reference material that only needs occasional lookup in the middle.
| Scenario | Recommended Approach | Target Context Size |
|---|---|---|
| Simple bug fix | Minimal context — relevant file + error | 10-30K tokens |
| Feature implementation | Code maps + plan + relevant files | 50-100K tokens |
| Cross-cutting refactor | Code maps + RPI framework | 100-200K tokens |
| Complex investigation | Let context grow naturally, compact at ~400K | 200-400K tokens |
| Full codebase migration | Phased approach with persistent plans | Multiple sessions, 200K each |
| Deep autonomous session | Full context with monitoring | Up to 700K tokens |
Looking ahead, several trends are likely to shape the context window landscape:
The 1M token context window for Claude Opus 4.6 and Sonnet 4.6, combined with the elimination of long-context surcharges, is a meaningful advance that removes friction and enables new workflows. However, it does not eliminate the fundamental challenges of attention degradation and cost scaling. The developers who benefit most will be those who treat the 1M window as headroom rather than a target — using disciplined context management while appreciating the safety net of extra capacity when complex problems demand it.