I've been writing about context management for over a month now. How context loss quietly breaks AI sessions. How externalizing state into .md files gives you reliable cross-session memory. How those same patterns map onto formal memory architectures. How multi-agent teams need explicit context boundaries. And most recently, how context custodianship is becoming an organizational function.

All of that was about what we as practitioners can do. This post is about what the platforms are doing — because the infrastructure underneath is changing fast, and the choices each provider makes about context shape what's possible for everyone building on top.

I've spent the last few weeks going deep on how Gemini, ChatGPT, Claude and Grok handle context windows, caching, memory persistence, and agent orchestration as of March 2026. The differences are more significant than most people realize, and — if you're based in Europe like I am — some of them are deal-breakers.

The numbers have converged. The architectures haven't.

Two years ago, GPT-4 had a 4k token window. Today, every major platform offers at least 200k, and the leaders are at 1–2 million. That's a 500× expansion in two years. The context window arms race is effectively over — everyone has "enough" raw capacity for most workflows.

But raw capacity was never the real problem. I wrote about this in my first context post: having a large window and using it effectively are completely different things. The "lost in the middle" phenomenon means models prioritize information at the start and end of the context, while everything in the middle fades. Google's own Needle-in-Haystack testing on Gemini 1.5 Pro showed impressive 99.7% recall at 1M tokens, but that's a synthetic benchmark. In practice, with messy multi-turn conversations and mixed tool outputs, effective recall is always lower than theoretical capacity.

What actually differentiates the platforms now is how they manage the context they have. Caching strategies, compaction, memory persistence, awareness of remaining budget — the architectural decisions underneath the token count.

Here's where each platform stands.

Gemini: the raw capacity leader

Google has positioned Gemini as the context-first platform since 1.5 Pro introduced a million tokens in 2024. By March 2026, Gemini 3 Pro offers 1M natively, with the older 1.5 Pro still available at 2M via API — the largest window of any mainstream model. Flash variants trade window size for speed: Gemini 3 Flash runs at 200k but is significantly faster for agentic pipelines.

The real Gemini story, though, is caching. Google introduced context caching in May 2024 and added implicit caching a year later. The combination is the most developer-friendly in the industry. Implicit caching works automatically — if your request shares a common prefix with a recent one, you get a 75% token discount with zero code changes. Explicit caching lets you lock in the discount by pre-caching content like a large document or system prompt, with retention up to 24 hours.

In March 2026, Google shipped cross-tool context circulation, which is a significant step for agentic workflows. Every tool call and its response now stays in the model's context, so subsequent steps can reason over prior tool outputs without re-fetching. Combined with the ability to mix built-in tools (Google Search, Google Maps) with custom function declarations in a single request, this reduces orchestration overhead considerably.

The gap is reliability. In my experience and in consistent developer reports, Gemini's tool-calling behaviour can be unpredictable in complex chains. It's smart on paper, and I've seen impressive results, but you still want a fallback strategy for production agent workflows. If Gemini 3 Pro closes this gap, Google is in a very strong position.

ChatGPT: memory over window size

OpenAI took a different path. Rather than chasing raw token records, they focused on practical memory and making the existing window work harder.

GPT-5.4 defaults to 272k tokens with an opt-in path to 1M at double the cost. That's more than enough for most workflows. The more interesting move was in April 2025, when ChatGPT began indexing all past conversations for Plus and Pro subscribers — not just explicit "saved memories" but the actual content of every prior chat. Memory now operates on three layers: saved memories (explicit facts you tell it to remember), full conversation recall (automatic, Plus/Pro only), and project-scoped memory (isolated per project to prevent cross-contamination).

For long-running agents like OpenAI Codex, the platform uses auto-compaction: when context hits roughly 90% usage, it rewrites history into an initial context block plus a handoff summary, then resumes seamlessly. This is essentially the automated version of the HANDOVER.md pattern I've been doing manually — and it's nice to see it validated at the platform level.

Prompt caching works automatically for requests over 1,024 tokens, with up to 90% cost reduction and 80% latency improvement. Extended caching with the prompt_cache_key parameter pushes retention to 24 hours and can dramatically improve cache hit rates.

But here's the thing that matters most to me personally: full conversation memory is unavailable in the EU, UK, Switzerland, Norway, Iceland and Liechtenstein. GDPR constraints mean European users only get the explicit saved memories feature. For a Swedish practitioner, that's the headline — the most compelling feature of OpenAI's memory architecture is the one I can't use.

Claude: the context-aware agent

Anthropic's approach stands out architecturally. Rather than maximizing raw token count, Claude is designed to understand its own remaining context and act on that awareness.

Claude Opus 4.6 and Sonnet 4.6 both offer 1M native context. But the differentiator is the token budget system. At conversation start, the model receives its total budget. After each tool call, it gets a live update of remaining tokens. This trains the model to pace its work — using context precisely rather than padding responses or losing track of progress. For long-running agent sessions, this matters enormously. It's the difference between an agent that "runs until it hits the wall" and one that knows the wall is coming and adjusts.

The context management toolkit is the most sophisticated of the four platforms. Server-side compaction (in beta for Opus and Sonnet 4.6) automatically summarizes earlier turns to extend conversations beyond the window. Tool result clearing reclaims context budget by discarding old tool outputs. Thinking block management strips extended thinking from prior turns to save overhead. And interleaved thinking — reasoning between consecutive tool calls rather than only before the first one — enables multi-hop chains: think, call tool, think about the result, call the next tool, synthesize.

For cross-session memory, Claude takes a file-based approach: agents write to a /memories directory, and future sessions load relevant memories on demand. It's more developer-controlled than ChatGPT's automatic recall, but critically, it's GDPR-compliant and available to EU users. As someone who builds workflows where context needs to survive between sessions, this is the approach I can actually rely on from Sweden.

Anthropic's engineering blog has also published the most explicit treatment of agentic context design among the four platforms — including the structured note-taking pattern where agents maintain their own memory files. That's essentially what I've been doing with .md files since February, so I may be biased, but the philosophy resonates.

Grok: the price disruptor

xAI entered the context race late but went straight for the extremes. Grok 4.1 Fast, launched November 2025, delivers a verified 2 million token context window at $0.20 per million input tokens. That's roughly 15× cheaper than Claude Sonnet and 12× cheaper than GPT-5.4 for input tokens. The economics are almost absurd.

Grok 4 itself has a confirmed 256k window at $3.00/M — competitive but unremarkable. The value proposition is really about 4.1 Fast for workloads where raw capacity and cost matter more than architectural sophistication.

The trade-off is in context management. Grok uses a sliding window approach: when the session approaches the limit, oldest tokens get discarded. No summarization, no semantic compression — just a first-in, first-out queue. Critical instructions from early in a long session can fall off the window silently. The recommended practice is to periodically re-introduce key context, which is functional but primitive compared to what the other platforms offer.

Grok's multi-agent capability orchestrates parallel sub-agents for research tasks, with encrypted state returned only on explicit request. It's impressive for research workflows but currently doesn't support custom function calling in multi-agent mode — only built-in tools. That's a meaningful constraint for production agent systems.

And the same GDPR limitation applies: Grok's persistent memory (launched April 2025) is unavailable in the EU and UK.

The caching economics that change everything

Caching has gone from a nice-to-have to a business model shift. For applications that repeatedly pass large system prompts, codebases, or reference documents, the savings are transformative.

Consider a realistic enterprise scenario: a legal review tool processing a 600k-token document corpus with 200 queries per day. Without caching, Gemini 2.5 Pro costs roughly $2,100/month on input tokens alone. With explicit caching, that drops to around $525. Claude Sonnet 4.6 with prompt caching goes from about $1,800 to $180. Grok 4.1 Fast, even without caching, costs $120/month at that volume — less than any other platform's cached price.

The mechanisms differ in important ways. Gemini offers the most flexibility with implicit plus explicit caching. OpenAI's auto-caching is the most hands-off. Claude's prompt caching gives the deepest discount (90% on cached input tokens) but requires explicit setup and has shorter retention — up to an hour versus 24 hours for Gemini and OpenAI's extended caching. Grok's caching is session-based and less well-documented.

Monthly input cost comparison across platforms, with and without caching
Monthly input cost comparison across platforms, with and without caching

For my own use cases — relatively short sessions with heavy system prompts — the short cache lifetime matters less than the discount depth. But for enterprise workloads with continuous queries against stable document sets, Gemini's 24-hour explicit caching or OpenAI's extended caching are more practical.

The EU problem

This deserves its own section because it's shaping my platform choices more than any technical feature.

As of March 2026, cross-session memory — the ability for the AI to remember your preferences, project context, and past decisions from one conversation to the next — works differently depending on where you live. Both ChatGPT's full conversation recall and Grok's persistent memory are unavailable in the EEA, UK, Switzerland, Norway, and Iceland due to GDPR constraints. The only explicit memory you get is manual "saved memories" in ChatGPT.

Gemini Memory (in beta) and Claude's file-based memory tool are the two GDPR-compliant options for cross-session persistence. For AI practitioners in Europe who need agents that remember context between sessions — which is basically anyone doing serious agentic work — this narrows the field considerably.

It's a frustrating position. The best memory experience (ChatGPT's full recall) is geographically locked. The most architecturally sound approach (Claude's file-based memory) works everywhere but requires more developer involvement. Gemini's beta sits somewhere in between.

Regulatory clarity could change this. But "waiting for regulatory clarity" is not a deployment strategy, so for now, Claude and Gemini are the platforms I can build persistent workflows on from Sweden.

The comparison at a glance

For reference, here's where the four platforms stand across the dimensions that matter most for agent workflows.

Dimension Gemini ChatGPT Claude Grok
Max context 2M (1.5 Pro) / 1M (3 Pro) 1M (opt-in) / 272k default 1M (Opus/Sonnet 4.6) 2M (4.1 Fast) / 256k (4)
Cache discount 75% (implicit + explicit) Up to 90% (auto) 90% (prompt caching) 75% (API)
Cache retention Up to 24h Up to 24h (extended) Up to 1h Session-based
Cross-session memory Beta, GDPR-compliant Full recall (non-EU only) File-based, GDPR-compliant Persistent (non-EU only)
Agent context mgmt Cross-tool circulation Auto-compaction Token budget awareness + context editing Sliding window
Cheapest input cost $1.25/M (2.5 Flash) $0.15/M (5.4 mini) $3.00/M (Sonnet 4.6) $0.20/M (4.1 Fast)

What I'm actually choosing

For my own work — coding projects, multi-agent pipelines, this blog — Claude remains my primary tool. The token budget awareness alone makes it the most reliable agent for long-running tasks. When I'm in a four-hour session building features, I want the model to know it's running low on context, not just silently degrade. The interleaved thinking between tool calls produces better multi-step reasoning than any other platform I've tested. And the memory approach, while more manual, works in my jurisdiction.

For research and large-document analysis, I reach for Gemini. The 1M–2M window with implicit caching is hard to beat when you're loading entire codebases or long technical documents. The March 2026 tooling updates for cross-tool context make it increasingly viable for agent workflows too, as long as you build in reliability guardrails.

Grok 4.1 Fast is the wildcard. At $0.20/M input tokens, it's the obvious choice for cost-sensitive batch workloads where you need massive context but don't need sophisticated memory management. I haven't used it extensively for agent workflows, but for "load everything and query it" use cases, the economics are compelling.

ChatGPT I use primarily as a conversational interface — the product experience is polished and the memory features are excellent. If I were based outside the EU, the full conversation recall would make it my default for ad-hoc work.

What to watch

The next six months will be telling. Gemini 3 Pro's tool-calling reliability is the single most important variable — if Google closes that gap, the combination of context capacity, caching economics, and multimodal depth makes Gemini the dominant agentic platform. Claude's 1M context is currently restricted to newer model tiers; broader access would change enterprise adoption significantly. Grok's 2M window needs independent Needle-in-Haystack validation at full context before anyone should commit to it for production. And EU regulatory developments could unlock ChatGPT and Grok memory for European users, which would reshape the competitive landscape overnight.

The deeper trend is clear though: raw token counts are converging. The differentiation has shifted to how intelligently each platform manages the context it has — through caching, compaction, budget awareness, and persistence. Claude leads on architectural sophistication for agent workflows. Gemini leads on scale and multimodal depth. OpenAI leads on memory usability outside the EU. Grok leads on capacity-per-dollar.

None of these advantages is permanent. But the underlying principle — the one I keep coming back to in every post in this series — isn't going anywhere either: context is a resource that needs to be managed, not a bucket that needs to be bigger. The platforms that internalize this will win. And the practitioners who understand these trade-offs will make better choices about which platform to build on for which job.