Someone sent me a piece on context rot this week, and reading it was a strange experience — like finding a clean schematic for a machine I'd been operating by feel for four months.

The article makes a tight argument: context rot — performance degradation as context lengthens — is a real, structural phenomenon, not something a bigger window fixes. It names three mechanisms behind it and puts actual numbers on where it starts to bite. I've been writing about exactly this since February, but always from the practitioner's side: here's what broke, here's the habit that stopped it breaking. The article supplies the part I never had — why it breaks, stated precisely. So this post is a reflection: where the analysis confirms what I found the hard way, where it sharpens it, and where my months of living with the problem push back on it.

The thing I called "context rot" in February now has mechanisms

I used the phrase "context rot" in my first post on this back in February, almost casually, to describe the drift I kept seeing: by session three or four, the AI suggests a library I ruled out on day one, generates a migration that conflicts with a schema decision from two days ago, stops following the naming conventions I established. My explanation at the time was hand-wavy — "lost in the middle," attention shifting to recent messages. True as far as it went, but it was a description dressed up as an explanation.

The article does the real work of separating the mechanisms, and the distinction is more useful than my single bucket:

Attention dilution. Transformer attention spreads across all tokens, so every token gets proportionally less weight as the context grows. Early critical information doesn't vanish — it gets outvoted by sheer volume of later content. This is the one I'd gestured at.

Conflicting instructions. Older instructions don't disappear when newer ones supersede them. They sit there as latent contradictions that surface unpredictably. This one I'd felt but never named — and naming it explains a failure mode that always confused me. It's not that the model forgot my constraint. It's that the constraint is still in there, now competing with three later messages that softened or contradicted it, and the model is adjudicating a conflict I didn't know I'd created.

Coherence degradation. Synthesizing across dozens of prior exchanges gets harder as they pile up. Responses become "locally sensible but globally inconsistent" — each individual answer is fine, the trajectory drifts. That phrase is the cleanest description I've seen of the slow-motion failure I described in my legal-work example: an AI five months into a case producing a motion that's technically competent and strategically wrong, because it's reasoning soundly over a degraded picture of the whole.

The reframe that matters: context rot is not forgetting. The information is right there in the window. The model is confused by accumulation, not starved of data. I knew this empirically — it's why I kept insisting the fix isn't "remind it harder" — but I didn't have the three-way decomposition to explain why reminding harder sometimes makes things worse (you've just added another instruction to the conflicting pile).

The thresholds match my gut, and that's worth saying out loud

The article offers practical, explicitly non-binding thresholds from production agent tasks: under ~20K tokens is generally fine; 20K–50K shows noticeable degradation on complex multi-constraint tasks; past ~50K, rot becomes a real reliability problem.

In my second post I wrote about the "maximum effective context window" — the gap between a model's advertised window and the point where its performance actually holds up — and said I wished I'd had that concept six months earlier. What I didn't have was numbers. I could tell you a fresh session with a 500-word handover note beat a four-hour session where everything was technically "in context," but I couldn't tell you where the cliff was.

These thresholds line up with my experience closely enough that I trust them as a starting heuristic — with one caveat the article is careful to flag and I want to amplify: they're task-shaped, not universal. My CSS-debugging sessions tolerate far more accumulation than my multi-constraint architecture sessions, because the former is a sequence of locally-scoped problems and the latter demands global coherence — exactly the dimension that degrades first. The number that actually matters isn't tokens, it's how many live constraints the current task must hold simultaneously. A 60K-token session reading mostly inert reference material can be fine; a 30K-token session juggling eight interacting decisions is already in trouble.

So I'd reframe the thresholds as: they're a proxy for constraint load, and they hold until your task gets constraint-dense, at which point the cliff arrives earlier than the token count predicts.

Where the four mitigations are exactly my .md files

The article's engineering recommendations — compression, pruning, session splitting, structured curation — are, almost line for line, the system I described building out of markdown files in February. I find this more reassuring than redundant: two people reasoning from different directions (it from mechanism, me from scar tissue) converging on the same four moves suggests the moves are right.

  • Context compression → my HANDOVER.md. A two-hour session with 150 exchanges distilled into a structured state document at the session boundary. The article's "summarize earlier exchanges" is the automated cousin of the thing I ask Claude to do by hand at the end of every session.
  • Context pruning → "know when to reset." The most counterintuitive habit I have, and the one that helps most: when a session starts making subtle errors, the best move is to end it, generate a handover, and start clean. That's pruning taken to its logical extreme — prune everything, keep only the distilled state.
  • Session splitting → "modularize aggressively." Break the work into discrete tasks with clear interface contracts, finish one, save the result externally, start the next with only the relevant context. The next module doesn't need the debugging history of the last one — it needs the current contract.
  • Structured context management → the entire premise of the starter prompt. Actively curate what enters context instead of letting it grow organically.

The one thing I'd add from experience: compression quality is the whole game, and it's lossy in a way that's easy to get wrong. A lazy handover ("worked on CSS and deployment") is nearly useless. A good one ("changed header border to gradient; deployed via Portainer; discovered Docker volume overlays content images — fix: copy to public/images") lets the next session start instantly. The article treats compression as a strategy; I'd treat it as a skill with a sharp quality gradient. The decisions and constraints have to survive; the debugging tangents have to die. Automated compression that summarizes by recency or token-proportion rather than by durability of the information will compress the wrong things — it'll keep the recent CSS noise and drop the load-bearing schema decision from day one. That's the same prioritization failure that causes context rot, just relocated into the tool that's supposed to fix it.

Where I'd push the argument further

The article is excellent on the single-session problem. The last four months convinced me the more interesting frontier is what happens when you stop fighting accumulation and start architecting around it.

Context rot is the reason agent teams work. Once you accept that a single context degrades as it accumulates constraints, the case for decomposition stops being about parallelism and starts being about rot avoidance. I argued in March that agent teams need explicit context boundaries the way human teams do. The mechanism the article describes is why: a discovery agent that only scans and inventories never accumulates the constraint load that produces coherence degradation, because its context stays scoped to one concern. The monolithic "do everything" agent isn't bad because it's slow — it's bad because it's a context-rot generator by construction. Splitting concerns across agents is session-splitting applied to the org chart. The article's fourth mitigation, scaled up, is the multi-agent architecture.

The platforms are now absorbing every one of these mitigations. When I wrote about Claude Fable 5 last week, the thing that struck me was that the new tier didn't grow the context window — it shipped the management layer. Task budgets (the model pacing itself against a known limit). Server-side compaction (my HANDOVER.md pattern, automated). A 3× jump on long-horizon tasks when given file-based memory. The four mitigations in this article are exactly the capabilities the frontier labs are building into the infrastructure. The article frames context governance as something you must do; I'd add that it's becoming something the platform increasingly does with you — and the practitioners who internalized the discipline by hand are the ones who'll actually know how to drive those features, because they understand what the features are for.

Observability is still the missing piece. The one place neither the article nor the platforms have caught up to where I wish they were: I still can't see what's actually in the model's context at any given moment. When a session goes sideways, I can't inspect which mechanism fired — was a constraint diluted, did it conflict with a later instruction, or did coherence just decay? I diagnose by symptom and reach for the same reset either way. The three-mechanism decomposition is genuinely useful here, because it gives me a vocabulary for what to look for — but until tools surface context composition the way they surface a stack trace, context engineering stays a craft practiced partly blind.

The line that hasn't changed

I've ended several posts in this series on the same sentence, and this new piece doesn't move me off it — it reinforces it from the mechanism side: context is a resource that needs to be managed, not a bucket that needs to be bigger.

What the article adds is the why, decomposed cleanly enough to be actionable. What four months of doing it by hand adds is the texture: the thresholds bend with constraint density, the compression is a skill with a steep quality gradient, the single-session mitigations scale up into agent architecture, and the platforms are racing to absorb all of it. Same conclusion, arrived at from opposite ends. When the theory and the scar tissue agree this precisely, it's usually because the thing is true.