Context Rot: Why a Bigger Context Window Won't Save Your LLM Feature

There's a short post on the Hacker News front page this week that finally puts a name on something I'd been feeling for months. Garrit Franke splits an LLM's context window into two zones: a smart zone, where the model is sharp, and a dumb zone, where attention drops off and the model starts forgetting what you told it five minutes ago. The boundary sits somewhere around 100,000 tokens — no matter whether the box says "200k," "1M," or "2M."

If you build LLM features for production, rather than just pasting things into ChatGPT, this isn't a philosophical aside. It touches the architecture of your RAG pipeline, your agent loops, and your prompt budget directly. I work mostly in .NET, so let me show what context rot actually means for a C# codebase and how to build around it pragmatically.

For the record, the source: Garrit's post on garrit.xyz and the accompanying HN discussion.

The number on the box is marketing

Let's clear up the misconception first. A one-million-token context window does not mean you can reliably work with a million tokens. It means the model technically won't crash if you feed it that many. That's a very different claim from "the model uses all of that information equally well."

Two pieces of research are worth naming here. The RULER benchmark showed back in 2024 that a model's effective context length is often a fraction of the advertised number. And Chroma's context rot report (July 2025) tested eighteen frontier models — GPT-4.1, Claude 4, Gemini 2.5, Qwen3 — and found that every single one degrades as input length grows. Not just when the window fills up: the decline is gradual and starts early. For models with a 1M window, Chroma observed a clearly measurable effect around 300,000–400,000 tokens, with the steepest degradation between 100k and 500k.

The crux: attention isn't a free lunch. The more tokens compete for the model's attention, the less sharply it attends to any individual one. The architecture behind long context windows works, but it papers over a problem the underlying attention mechanism doesn't really solve. The number on the box gets bigger every release. The usable part doesn't keep up.

What this means for your code

If you're building a chatbot that answers a single question from a short prompt, you'll barely notice this. It only becomes painful in exactly the scenarios we builders tend to land in:

A RAG pipeline that sends the top-50 chunks "to be safe" instead of the top-5. An agent loop that piles up file reads, tool output, and a long debugging session until you're at 100k before lunch. A summarization feature that stuffs an entire 80-page report into one prompt. In all three cases it feels like you're giving the model more to work with. In reality you're pushing it into the dumb zone.

The temptation is strong, because it's easier to send everything than to choose. But "more context" and "better context" are not the same thing, and past a certain point they're enemies.

Treat your context window as a budget

The practical mindset I've adopted: the context window is a budget, not storage. You have a limited number of "sharp" tokens, and anything you don't strictly need is crowding out something you do.

In .NET I start by measuring explicitly. Now that Microsoft.Extensions.AI has become the common abstraction, I work against IChatClient and keep my own budget. A rough token estimator is enough to make decisions — you don't need to reproduce the model's exact tokenizer to know that 50 chunks is too many.

using Microsoft.Extensions.AI;

public sealed class ContextBudget
{
    private readonly int _maxTokens;
    private int _used;

    public ContextBudget(int maxTokens) => _maxTokens = maxTokens;

    public int Remaining => _maxTokens - _used;

    // Rough estimate: ~4 characters per token for English.
    // Good enough to decide on, not to bill on.
    public static int Estimate(string text) =>
        (int)Math.Ceiling(text.Length / 4.0);

    public bool TryAdd(string text)
    {
        var cost = Estimate(text);
        if (cost > Remaining) return false;
        _used += cost;
        return true;
    }
}

Note the important detail: I do not set _maxTokens to the advertised window. I set it to the zone where I believe the model stays sharp — often something on the order of 30k to 60k for a retrieval task, well below the 100k mark where decline begins. That's a deliberate, conservative choice. I'll take a tight, sharp context over a roomy, fuzzy one every time.

Retrieval: choose, don't hoard

The biggest win is in your retrieval step. The anti-pattern I see in the wild is a vector search that returns the top-N on pure cosine similarity, after which someone sets N to 40 "because that's more complete." That's exactly the wrong reflex.

Better is a budget-driven selection: sort by relevance and stop adding once the budget runs out. That way you drop the mediocre chunks instead of spending your sharp tokens on them.

public sealed record Chunk(string Text, float Score);

public static IReadOnlyList<Chunk> SelectWithinBudget(
    IEnumerable<Chunk> ranked,
    int budgetTokens)
{
    var selected = new List<Chunk>();
    var used = 0;

    foreach (var chunk in ranked.OrderByDescending(c => c.Score))
    {
        var cost = ContextBudget.Estimate(chunk.Text);
        if (used + cost > budgetTokens)
            continue; // skip; a smaller later chunk might still fit

        selected.Add(chunk);
        used += cost;
    }

    return selected;
}

A subtlety that shows up in the Chroma report: distractors — chunks that look relevant but aren't — hurt performance more as context grows. In other words, a large pile of "almost right" chunks is worse than a small set of truly relevant ones. That's an extra argument for being strict when you select, rather than generous.

Chunking and position

Beyond how much you send, where it sits matters too. Models have a well-documented bias toward the beginning and end of the context (the "lost in the middle" finding). If you combine a handful of chunks with a system prompt and a question, don't bury the most important material deep in the middle.

In practice I build the prompt in a fixed order: instructions at the top, the most relevant context right below them, less relevant context after that, and the concrete question at the very bottom, just before generation. It costs nothing and it measurably helps.

public static string BuildPrompt(
    string systemInstructions,
    IReadOnlyList<Chunk> context,
    string question)
{
    var sb = new System.Text.StringBuilder();
    sb.AppendLine(systemInstructions);
    sb.AppendLine();
    sb.AppendLine("## Relevant context");

    // Most relevant first; the model reads the opening most sharply.
    foreach (var chunk in context.OrderByDescending(c => c.Score))
    {
        sb.AppendLine(chunk.Text);
        sb.AppendLine("---");
    }

    sb.AppendLine();
    sb.AppendLine("## Question");
    sb.AppendLine(question);
    return sb.ToString();
}

Agents: move information out of the session

For agent workflows, Garrit's advice lands: leave a breadcrumb. Instead of relying on auto-compaction — where the summary is produced by a model that's already in the dumb zone — write a spec yourself and start a fresh session. That's a higher-signal handoff than any automated summary, because you get to decide what matters going forward.

In .NET terms: treat your agent's intermediate results as artifacts that you explicitly pull out of the live session. A plan, a PRD, a task list — stored outside the context window, and loaded back in only when needed. The pattern that projects like obra/superpowers and mattpocock/skills formalize comes down to this: keep the working session in the smart zone by deliberately moving information out into small, named artifacts.

A minimal version in code: have each step write its output to a store, and give the next step only the summary plus the relevant artifacts — not the full history.

public interface IArtifactStore
{
    Task SaveAsync(string key, string content, CancellationToken ct = default);
    Task<string?> LoadAsync(string key, CancellationToken ct = default);
}

public sealed class AgentStep
{
    private readonly IChatClient _client;
    private readonly IArtifactStore _store;

    public AgentStep(IChatClient client, IArtifactStore store)
    {
        _client = client;
        _store = store;
    }

    public async Task<string> RunAsync(
        string instruction,
        IReadOnlyList<string> artifactKeys,
        CancellationToken ct = default)
    {
        var messages = new List<ChatMessage>
        {
            new(ChatRole.System, "You are a focused sub-agent. Work only on the task.")
        };

        // Load only what this step needs — not the entire history.
        foreach (var key in artifactKeys)
        {
            var artifact = await _store.LoadAsync(key, ct);
            if (artifact is not null)
                messages.Add(new(ChatRole.User, $"Artifact {key}:\n{artifact}"));
        }

        messages.Add(new(ChatRole.User, instruction));

        var response = await _client.GetResponseAsync(messages, cancellationToken: ct);
        var output = response.Text ?? string.Empty;

        await _store.SaveAsync($"step-output-{Guid.NewGuid():N}", output, ct);
        return output;
    }
}

The value here isn't the elegance of the code, it's the discipline it enforces: every step gets a tight, purpose-built context instead of a growing snowball.

Test it; don't trust your gut

The tricky thing about context rot is that it's gradual. Your feature doesn't break; it slowly gets vaguer. That's why the single most important measure is an evaluation set that treats context size as a variable. Build a handful of questions with known answers, and run the same questions with different amounts of context: just the top-3 chunks, the top-10, the top-40. If the top-40 variant scores worse than the top-5 — and that happens more often than you'd think — you have your answer.

public sealed record EvalCase(string Question, string ExpectedSubstring);

public static async Task<double> ScoreAsync(
    IChatClient client,
    IReadOnlyList<EvalCase> cases,
    Func<string, IReadOnlyList<Chunk>> retrieve,
    int budgetTokens)
{
    var correct = 0;

    foreach (var c in cases)
    {
        var ranked = retrieve(c.Question);
        var context = SelectWithinBudget(ranked, budgetTokens);
        var prompt = BuildPrompt("Answer based on the context.", context, c.Question);

        var response = await client.GetResponseAsync(prompt);
        if ((response.Text ?? "").Contains(c.ExpectedSubstring, StringComparison.OrdinalIgnoreCase))
            correct++;
    }

    return (double)correct / cases.Count;
}

Run this with different budgetTokens and you have a curve instead of a hunch. In my experience that curve often peaks early and then flattens or even drops — exactly what the literature predicts.

The trade-off, stated honestly

There's a counterargument, and it deserves a place. Long context windows aren't useless. For tasks where you genuinely need a lot of material at once — understanding a large source file, summarizing a long document you have no good chunking for — a roomy window is a real improvement over the alternative, which is not being able to include the material at all. And models keep improving: the boundary of the smart zone moves out with every generation. Anyone who hard-codes everything around a fixed 100k limit today may be baking in an unnecessary constraint by next year.

So the nuance is: use long context where it genuinely adds value, but don't lean on it by default as a substitute for good retrieval and context curation. "Just throw everything in" isn't an architecture; it's a deferral of the choice you have to make anyway.

Takeaway

For me as a builder, the message of this trending HN post isn't pessimism — it's a design guideline. Treat the context window as a budget, not a bucket. Choose your context deliberately instead of generously. Put the most important material where the model reads most sharply. Move information out of the live session and into artifacts. And measure the effect with an eval set that includes context size as a variable, so you base decisions on data rather than on the number on the box.

The number on the box gets bigger every release. The usable part grows more slowly. Build as if that's true, and you won't be caught out.

Source: "Don't trust large context windows" — Garrit's Notes and the Hacker News discussion. Research: RULER and Chroma's context rot report.