1M Token Context Window: Deep Documentation - Claude Opus 4.6 & Sonnet 4.6

The Announcement

General Availability

Anthropic announced that the 1 million token context window is now generally available for both Claude Opus 4.6 and Claude Sonnet 4.6. Previously available as a limited beta, the expanded context window is now accessible to all API users and Claude Pro/Max subscribers.

The announcement carried several landmark details that distinguished it from competing offerings:

Standard pricing applies across the full 1M window — there is no long-context surcharge or premium tier required. Previously, some providers charged 2-4x more for context beyond certain thresholds. Anthropic chose to absorb the cost entirely.
Media limits expanded to 600 images or PDF pages — enabling ingestion of entire codebases, research paper collections, legal document packages, or book-length manuscripts in a single prompt.
Applies to both Opus 4.6 and Sonnet 4.6 — the flagship reasoning model and the workhorse coding model both receive the full 1M window, making this a platform-wide upgrade rather than a single-model feature.

Strategic Context

This move is widely interpreted as a competitive response to the broader industry trend toward massive context windows. OpenAI's GPT-4.5 offers a 1M token context, and Google's Gemini models have advertised 1M-2M token windows since mid-2024. However, raw window size alone is insufficient — what matters is how well the model actually uses the full context, a distinction that becomes critically important in the Context Degradation section below.

Anthropic's decision to remove the long-context surcharge is particularly significant for agentic workloads like Claude Code, where autonomous coding sessions can easily consume hundreds of thousands of tokens over the course of an investigation. The elimination of the premium makes long sessions financially viable in a way they were not before.

The removal of the long-context surcharge is arguably a bigger deal than the 1M context itself. It changes the economics of agentic coding entirely. -- Hacker News community discussion

What is a Context Window?

Tokens: The Fundamental Unit

Large language models (LLMs) do not process raw text directly. Instead, text is broken into tokens — subword units that typically represent 3-4 characters of English text. A token might be a complete short word ("the", "cat"), a word fragment ("un-", "-tion"), or a single character for rare symbols. On average, 1 token is approximately 0.75 words in English, or conversely, 1 word is roughly 1.3 tokens.

The context window (also called context length or context size) is the maximum number of tokens a model can process in a single interaction. It includes everything: the system prompt, the user's message, all conversation history, any documents or code included in the prompt, and the model's own output. Think of it as the model's working memory — everything it can "see" and reason about at once.

Analogy: Working Memory

A context window is like a desk. A small desk (4K tokens) lets you work with a single page. A large desk (1M tokens) lets you spread out an entire filing cabinet. But just because you can spread out 3,000 pages doesn't mean you'll remember the detail on page 1,487 while writing on page 2,903. The quality of attention across that desk matters enormously.

Historical Progression of Context Windows

Context windows have grown exponentially over just a few years:

2022

4,096 tokens — GPT-3.5 launch. Roughly 3,000 words. Enough for a short conversation or a few pages of text. Multi-turn conversations would quickly overflow.

Early 2023

8,192 tokens — GPT-4 initial release. Doubled capacity enabled more complex reasoning chains, but still limited for document analysis.

Mid 2023

32,768 tokens — GPT-4-32K variant. A major step allowing ingestion of full research papers or medium-sized codebases.

Late 2023

100,000 tokens — Claude 2 introduced the 100K window, a 12x leap over GPT-4 base. Anthropic pioneered the idea that "you can just put the entire codebase in the prompt."

Early 2024

200,000 tokens — Claude 3 launched with 200K context across all model tiers. Google countered with Gemini 1.5 Pro at 1M tokens (limited availability).

2025-2026

1,000,000 tokens — Claude Opus 4.6 and Sonnet 4.6. Generally available. Standard pricing. 750,000 words or approximately 3,000 pages of text in a single prompt.

Why Larger Context Matters

Different use cases require different context sizes:

Use Case	Typical Tokens Needed	Why It Matters
Simple Q&A	1K - 5K	Any model handles this. Context size irrelevant.
Document summarization	10K - 50K	Full papers, contracts, or reports in one pass.
Codebase analysis	50K - 200K	Understanding cross-file dependencies, refactoring patterns.
Agentic coding sessions	100K - 500K	Multi-step investigations accumulate tool calls, file reads, and reasoning.
Book-length analysis	200K - 800K	Analyzing entire novels, legislation, or patent portfolios.
Full repository + history	500K - 1M	Large monorepo analysis with git history and documentation.

Critical Caveat

A larger context window does not automatically mean better results. As documented extensively in the Context Degradation section, models exhibit significant quality loss as context fills up. The 1M window is best understood as a maximum capacity, not an optimal operating point.

Key Statistics

1,000,000

Token Context Window

Long-Context Premium

600

Max Images / PDF Pages

~3,000

Pages of Text

~750K

Words Equivalent

Increase from 200K

What 1M Tokens Looks Like in Practice

Entire codebases: A typical medium-to-large project (50,000-100,000 lines of code) fits comfortably within 1M tokens, including comments, tests, and configuration files.
Book collections: Approximately 5-6 full novels or 10-15 non-fiction books depending on length.
Legal documents: An entire merger agreement package (main agreement + schedules + exhibits) typically runs 200K-400K tokens.
Research papers: 50-100 academic papers including figures and references.
Conversation history: Approximately 500-1,000 back-and-forth exchanges in a coding session, including tool calls and file reads.

Context Degradation ("Context Rot")

The Most Important Section

This is the most critical topic in the entire 1M context discussion. A larger context window is only valuable if the model can actually attend to and reason about its contents. Research and practical experience demonstrate significant degradation as context fills up.

The "Lost in the Middle" Phenomenon

The "Lost in the Middle" phenomenon, first documented in a landmark 2023 paper by Liu et al. (Stanford/UC Berkeley), describes a fundamental limitation of transformer-based language models: they attend most strongly to information at the beginning and end of their context window, while progressively ignoring information in the middle.

Why This Happens: Attention Mechanics

Transformer models use a mechanism called self-attention to determine which tokens in the context are relevant to generating the next token. The attention weights form a distribution across all input tokens, and empirical analysis shows this distribution is not uniform:

Primacy bias: The first tokens in context (system prompt, initial instructions) receive disproportionately high attention. This is partly because early tokens set the "framing" that all subsequent tokens are interpreted against.
Recency bias: The most recent tokens (the current question, the last few exchanges) also receive strong attention, as they represent the most immediately relevant context for generation.
Middle neglect: Tokens in the middle of a long context — which in a coding session might include file reads from 200 exchanges ago — receive progressively less attention. They are not forgotten entirely, but the model's ability to recall specific details from this region is measurably degraded.

Attention Map Visualization

The grid below shows a simplified attention distribution across the context window. Green cells indicate high attention; gray cells indicate low attention. Use the slider to see how the "middle dead zone" expands as context fills up.

Context Utilization: 20%

Start of context Middle End of context

High attention

Medium attention

Low attention

The "Dumb Zone" Phenomenon

Beyond the academic "Lost in the Middle" research, practitioners have identified a broader pattern called the "Dumb Zone" — a noticeable drop in overall model intelligence (not just recall) once context exceeds certain thresholds.

Key Research: The RPI Framework

Dex Horthy published influential research suggesting that model performance begins to degrade meaningfully once context utilization exceeds approximately 40% of the window. For a 200K window, this means quality starts declining around 80K tokens. For a 1M window, the threshold is roughly 400K tokens.

Horthy proposed the RPI Framework (Research, Plan, Implement) as a mitigation strategy — a disciplined workflow that involves frequent intentional compaction to keep context lean rather than letting it grow unchecked. The framework is detailed in the Community Strategies section.

Critical Insight: Degradation Depends on Tokens, Not Window Size

A crucial finding: degradation is determined by how many tokens have been consumed, not by the size of the window. 90,000 tokens consumed produces the same quality degradation whether you are in a 100K window or a 1M window. The larger window gives you more room, but it does not make the model smarter at any given fill level.

Degradation Zones

Based on community observations and testing, context can be roughly divided into three quality zones:

Safe Zone (0-40%)

Caution (40-70%)

Danger (70-100%)

0K 200K 400K 600K 800K 1M

Debate: Has the Dumb Zone Been Fixed?

The community remains divided on whether recent model updates have addressed the dumb zone:

Improvements Reported

Some users report that Codex 5.2 effectively eliminated the dumb zone, maintaining coherence well past 400K tokens.
Opus extended thinking mode appears to help the model maintain reasoning quality at higher context utilization.
Anthropic has reportedly improved internal attention mechanisms in the Opus 4.x series specifically for long-context scenarios.

Skeptics Remain

Others report seeing the same degradation patterns in Opus 4.6 as in earlier models.
Non-coding tasks (creative writing, analysis) may degrade faster than coding tasks due to less structured context.
The "needle in a haystack" benchmarks used to test context may not capture real-world degradation patterns.

Practical Degradation Observations

Reports from Hacker News and the Claude community paint a nuanced picture:

Things work remarkably well up to about 700K tokens. After that, quality drops off a cliff. It's not gradual — it's like the model suddenly can't keep track of what it was doing. -- HN user, Claude Code discussion thread

Even at 200K context, Claude starts producing gibberish for complex non-coding tasks. I was doing a long legal analysis and by 150K tokens it was contradicting its own earlier conclusions. -- HN user, long-context experience thread

Context decay happens at the same rate regardless of window size. A 1M window just means you can decay for longer before hitting the wall. -- HN user, context window discussion

Compaction: The Double-Edged Sword

Compaction (also called context summarization or context pruning) is the process by which a model or tool summarizes older conversation history to free up context space. It is the primary mechanism for extending sessions beyond the context limit, but it comes with significant trade-offs.

The Compaction Problem

Loss of nuance: Compaction necessarily discards detail. Subtle debugging observations, edge case notes, or architectural reasoning may be summarized away, leading the model to repeat previously-explored dead ends.
Timing friction: With the old 200K window, compaction triggered "when things were just getting good" — at the point where the model had finally accumulated enough context to understand the problem deeply, it would be forced to forget most of it.
Loop induction: Poor compaction can cause the model to forget that it already tried a particular approach, leading to frustrating debugging loops.
Variable quality: Some users report that Claude's compaction quality improved significantly approximately 2 months prior to the 1M announcement. Others report the opposite — that compaction "drops everything useful and keeps the boilerplate."

1M Context Helps Compaction

The 1M window means compaction triggers far less frequently. Many coding sessions complete entirely within 400-700K tokens, avoiding compaction altogether. When compaction does trigger, there is more room for a generous summary.

1M Context Does Not Fix Compaction

For truly long sessions (multi-hour autonomous coding), even 1M tokens will eventually be consumed. The compaction quality issue remains, just deferred. Additionally, the cost of operating at 700K+ tokens is substantial (see Cost Analysis).

Cost & Performance Analysis

While the removal of the long-context premium is welcome, operating at high context utilization remains expensive simply due to the volume of tokens being processed.

Opus 4.6 Pricing Breakdown

Metric	Rate	At 700K Context
Input tokens	$15.00 / 1M tokens	$10.50 per call
Cache reads	$1.50 / 1M tokens	$1.05 per call
Output tokens	$75.00 / 1M tokens	$0.30 per call (4K output)
Cache writes	$18.75 / 1M tokens	$13.13 (first call only)

Practical Cost Scenarios

~$1

Per tool call at 700K (cache reads)

$100+

100 tool calls at 700K

Cost of 800K vs 100K call

$500-1K

Long autonomous session cost

Cache Reads Clarification

Important: Cache Reads vs. Input Tokens

There has been confusion in the community about whether cache reads count as "paid input tokens." Per Anthropic's prompt caching documentation: cache reads are charged at a reduced rate ($1.50/1M for Opus vs. $15.00/1M for uncached input). This represents a 90% discount. In agentic workflows where the same large context is reused across many tool calls, the majority of context hits the cache, making the effective per-call cost much lower than the raw input rate.

However, even at the cache rate, 700K tokens per call at $1.50/1M = $1.05 per call. Over 100 tool calls in a session, that is still $105 in cache reads alone, before accounting for output tokens.

Subscription Impact: Claude Max

For Claude Max subscribers using Claude Code, the 1M context window burns through subscription usage rapidly. Reports indicate:

Sessions at high context utilization can exhaust daily limits in 1-2 hours of active use.
Response times reportedly reached approximately 1 hour per prompt for some users at very high context utilization, though this appears to be an edge case related to server load rather than a consistent issue.
The cost efficiency of the Max subscription becomes questionable when operating consistently above 500K tokens, as the per-token equivalent cost may exceed API rates.

The Linear Cost Problem

Unlike many computational costs that benefit from economies of scale, LLM context costs scale linearly with size. An 800K token call costs exactly 8x what a 100K token call costs. There is no volume discount within a single call. This makes the economic argument for disciplined context management compelling: keeping context at 100-200K tokens through strategic trimming can reduce costs by 4-8x compared to letting context grow to fill the window.

Community Strategies & Best Practices

The developer community has developed sophisticated strategies for working effectively with large context windows. These approaches recognize that the best use of a 1M window is often to not fill it all at once.

Code Mapping Approach

Developer jeremychone popularized a code mapping strategy that dramatically reduces context requirements through intelligent summarization:

Code Map Structure (Per File)

# module/auth/handler.rs

## Summary
Handles HTTP authentication requests including login,
logout, and token refresh endpoints.

## When to Use
Reference this file when working on authentication
flows, session management, or JWT token handling.

## Public Types
- AuthRequest { username: String, password: String }
- AuthResponse { token: String, expires_at: DateTime }
- AuthError { code: u16, message: String }

## Public Functions
- handle_login(req: AuthRequest) -> Result<AuthResponse>
- handle_logout(token: &str) -> Result<()>
- refresh_token(token: &str) -> Result<AuthResponse>
- validate_token(token: &str) -> Result<Claims>

Results: 98% Context Reduction

This approach reduced 381 context files (1.62 MB) to 5 files (27.90 KB) — a 98% reduction. Typical context usage dropped to 30-80K tokens even for large codebases. The code maps use pure markdown formatting, which LLMs interpret naturally without requiring any sed/grep tooling.

The RPI Framework

Dex Horthy's Research, Plan, Implement framework structures AI-assisted development into distinct phases with intentional context boundaries:

1. Research Phase

The AI scans relevant files, reads documentation, and builds an understanding of the problem space. This is where context accumulates most rapidly.

Scan project structure
Read relevant source files
Identify dependencies and patterns
Note potential issues and edge cases

2. Plan Phase

Findings are documented in a detailed, step-by-step plan file (PLAN.md). This serves as the "external memory" that survives context clearing.

Write detailed implementation steps
Document architectural decisions
List files to modify and changes needed
Include edge cases and testing criteria

3. Implement Phase

Clear context, then start a fresh session with only the plan file. The AI implements each step with a clean, undegraded context.

Start fresh session
Load only PLAN.md
Implement step by step
Keep context lean and focused

4. Reflect Phase

After implementation, review the work and document learnings. This creates institutional knowledge for future sessions.

Review implementation against plan
Document deviations and reasons
Write implementation notes
Commit plans to repository

Persistent Planning

A broader practice that emerged from the RPI framework involves using persistent plan files as a core part of the development workflow:

Create PLAN.md files with detailed, numbered implementation steps before writing any code.
Clear context before implementation — start each implementation phase with a fresh, undegraded context.
Team reviews on plans — human developers review the AI-generated plan before the coding phase begins, catching architectural issues early.
Commit plans to repository — plans become part of the project's documentation, providing context for future developers (both human and AI).

Tools & Extensions

Several tools have emerged to help manage context more effectively:

AIPack

A Rust-based runtime with Lua scripting for creating "AI packs" — curated, compressed representations of codebases. Automates the code mapping approach with configurable summarization rules.

OpenCode

An open-source alternative to Claude Code with configurable compaction. Users can define when and how compaction occurs, set custom summarization prompts, and control what gets preserved versus discarded.

Context Master

A VSCode extension that leverages the Language Server Protocol (LSP) to provide intelligent context selection. It understands code structure and can automatically include only the relevant symbols, types, and functions.

maki (Tree-sitter)

Uses Tree-sitter AST parsing to create "skeleton" representations of code files — preserving structure (function signatures, type definitions, class hierarchies) while stripping implementation details. Achieves 80-95% compression depending on the language.

Impact on Claude Code

Claude Code — Anthropic's official CLI tool for AI-assisted software development — is one of the primary beneficiaries of the 1M context window. The tool operates as an autonomous agent, reading files, running commands, and building up context over the course of a session.

Why It Matters for Claude Code

Elimination of the long-context surcharge is enormous for Claude Code users, where sessions routinely exceed 100K tokens. A typical investigation might involve reading 20+ files, running multiple test commands, and iterating on solutions.
Fewer compaction interruptions mean the model can maintain deeper understanding of the problem throughout a session. With the 200K window, compaction often triggered during complex debugging — precisely when the accumulated context was most valuable.
Larger investigations are possible — cross-cutting concerns that span dozens of files, refactoring tasks that touch the entire codebase, or migration projects that require understanding both old and new patterns simultaneously.

The Gas Tank Analogy

It's like having a 60-liter fuel tank versus a 15-liter tank. You don't need 60 liters for most trips. But the range anxiety is gone, and those few times you need to drive 500 km without stopping — that's where it matters. -- Community analogy for 1M context

200K Window

15L Tank — refuel frequently

1M Window

60L Tank — long-range capable

The Coherence Challenge

Despite the expanded window, the core challenge remains: the model must maintain coherence after 200K tokens. Claude Code's effectiveness depends not just on having the context available, but on the model actively using it — remembering file contents it read 300 exchanges ago, maintaining consistency between its plan and its implementation, and not repeating failed approaches.

Practical Reality

For most Claude Code sessions, the optimal strategy remains keeping context lean (100-200K tokens) through disciplined file reading, clear prompting, and occasional manual compaction. The 1M window provides a safety net for sessions that grow unexpectedly large, but it should not be the target operating point.

Model Comparisons

The 1M context landscape includes several competitors, but their real-world performance varies dramatically from their marketed capabilities.

Model	Advertised Window	Usable Range	Long-Context Premium	Community Rating
Claude Opus 4.6	1,000,000	~700K effective	None	Best long-context quality
Claude Sonnet 4.6	1,000,000	~500-600K effective	None	Best value for long tasks
GPT-4.5	1,000,000	~300-400K effective	Tiered pricing	Mixed reports
Gemini 2.0 Pro	2,000,000	~200-400K effective	Premium tier	Significant degradation
Codex 5.2	200,000	~150K effective	None	Strong post-compaction

Community Assessment

Opus is the only model that provides real, usable long context. Gemini's 2M context is marketing — try doing a 500K token coding session and watch it forget its own tool calls, loop endlessly, and produce hallucinated file paths. -- Developer experience report, HN thread

Notable Comparisons

Opus vs. Codex 5.2: Codex reportedly outperforms Opus specifically in post-compaction quality — after context has been summarized, Codex maintains better coherence. However, Opus's larger window means compaction triggers far less frequently, making the comparison nuanced.
Opus vs. Gemini: Despite Gemini's 2M token window, community reports consistently indicate severe degradation at deep context: forgetting tool call results, losing track of conversation flow, and entering loops. The larger number appears to be more of a marketing point than a practical advantage.
Opus vs. GPT-4.5: GPT-4.5's 1M context is competitive but reportedly comes with tiered pricing and less consistent quality at the high end. Some users report GPT-4.5 handles very specific "needle in a haystack" retrieval better, while Opus handles holistic reasoning over large contexts better.

Caveat on Comparisons

Model comparisons are inherently subjective and depend heavily on the specific task, prompt structure, and how context is organized. These assessments reflect aggregate community sentiment as of early 2026 and may not reflect your specific use case. Always benchmark with your own workloads.

Community Sentiment Analysis

Community reaction to the 1M context window has been largely positive but nuanced. Three distinct camps have emerged:

Pro-Large Context

45%

Skeptical

30%

Pragmatic

25%

Pro-Large Context Camp (45%)

These users emphasize the practical benefits of reduced compaction friction and the ability to tackle larger problems:

Autonomous coding tasks that require deep investigation
Long debugging sessions where accumulated context is critical
Avoidance of compaction-induced loops and memory loss
Reduction in "range anxiety" — not worrying about hitting limits
Ability to include entire test suites alongside source code

Skeptical Camp (30%)

These users argue that large context windows are unnecessary for disciplined workflows and may actually harm quality:

Tight scoping yields better results than large context
Context decay remains regardless of window size
Encourages lazy prompt engineering ("just dump everything in")
Cost-prohibitive for most practical use cases
"I rarely need more than 50K tokens if I structure my prompts well"

Pragmatic / Consensus Camp (25%)

This group acknowledges both the benefits and limitations, advocating for a middle path:

Larger windows reduce friction but don't eliminate the need for discipline
Use the full window as a safety net, not a target
Combine large context with strategic summarization
Different tasks have different optimal context sizes
The pricing change matters more than the size change

Interactive Visualizations

Use the tools below to explore the relationships between context size, quality, and cost.

Context Utilization Simulator

Drag the slider to explore how different context utilization levels affect quality, cost, and equivalent content size.

Context Size: 100K

Estimated Quality: 98% (Excellent)

98%

Severely Degraded Excellent

Cost per Call (Cached)

$0.15

Opus 4.6 cache read rate

Equivalent Words

75K

At ~0.75 words/token

Equivalent Pages

300

At ~250 words/page

Session Cost Calculator

Estimate the total cost of a coding session based on context size, number of calls, and model choice.

Context Size (tokens)

Number of Calls

Model

Use Prompt Caching

Input Cost

$15.00

Context tokens x calls

Output Cost

$15.00

~4K output tokens/call

Total Session Cost

$30.00

Input + Output

Cost per Call

$0.60

Total / number of calls

Recommendations & Conclusion

Best Practices for Using 1M Context Effectively

Create Code Maps

Generate per-file summaries (summary, when_to_use, public_types, public_functions) for your codebase. Load these maps instead of raw source files when starting a session. Only pull in full file contents when the model needs to edit or deeply understand specific code.

Use the RPI Framework

Structure work into Research, Plan, Implement phases with context clearing between them. Write detailed PLAN.md files during the planning phase. Start implementation with a fresh context that loads only the plan.

Monitor Context Usage

Stay aware of how much context you have consumed. If you notice the model starting to repeat itself, forget earlier instructions, or produce lower-quality output, it may be time to compact or restart with a focused context.

Be Cost-Conscious

Remember that cost scales linearly with context size. For routine tasks, keep context lean (50-100K tokens). Reserve the deep context for genuinely complex, cross-cutting investigations where the accumulated understanding is irreplaceable.

Compact Strategically

Don't wait for automatic compaction. When you finish a phase of work, manually summarize the key findings and start fresh. This gives you control over what the model remembers vs. what it forgets.

Front-Load Critical Context

Place the most important information (system prompts, architectural guidelines, critical constraints) at the very beginning of context where attention is strongest. Put reference material that only needs occasional lookup in the middle.

When to Use Full Context vs. Strategic Trimming

Scenario	Recommended Approach	Target Context Size
Simple bug fix	Minimal context — relevant file + error	10-30K tokens
Feature implementation	Code maps + plan + relevant files	50-100K tokens
Cross-cutting refactor	Code maps + RPI framework	100-200K tokens
Complex investigation	Let context grow naturally, compact at ~400K	200-400K tokens
Full codebase migration	Phased approach with persistent plans	Multiple sessions, 200K each
Deep autonomous session	Full context with monitoring	Up to 700K tokens

The Future of Context Windows

Looking ahead, several trends are likely to shape the context window landscape:

Architecture improvements: New attention mechanisms (sparse attention, sliding window attention, retrieval-augmented generation) may reduce the "lost in the middle" problem, making larger context windows genuinely useful rather than nominally available.
Adaptive context management: Models may learn to proactively manage their own context — summarizing old information, flagging when they are losing track of details, and requesting specific information from users rather than relying on having everything in context.
Hybrid retrieval approaches: Rather than stuffing everything into context, models may combine a moderate context window with on-demand retrieval from external stores (vector databases, code indices), achieving the benefits of large context without the degradation.
Context-aware pricing: As competition increases, providers may move toward more granular pricing that accounts for actual attention usage rather than raw token count.
The 10M+ horizon: Multiple labs are reportedly working on 10M+ token windows. If degradation can be solved, this would enable true "everything in context" workflows for even the largest codebases and document collections.

Bottom Line

The 1M token context window for Claude Opus 4.6 and Sonnet 4.6, combined with the elimination of long-context surcharges, is a meaningful advance that removes friction and enables new workflows. However, it does not eliminate the fundamental challenges of attention degradation and cost scaling. The developers who benefit most will be those who treat the 1M window as headroom rather than a target — using disciplined context management while appreciating the safety net of extra capacity when complex problems demand it.