William Lubelski

On Compaction

Tue, 10 Mar 2026 00:00:00 GMT

Forgetting

Forty minutes into a complex task, an agent starts repeating itself,
re-derives old conclusions, proposes an approach it already tried and abandoned.

Old findings get compressed into a summary, and the summary lost the subtle reasoning. The sharp edges got sanded off. A limit got reached, and so an algorithm guessed at what was important to retain. The agent continues with a lossy copy of its own thinking, but the work continues.

As a user on the outside, sometimes we power through it, sometimes we throw our hands up and start a new session. Until that session degrades too.

The standard read on this: context windows will get bigger. Models will get better at long-range attention. Compaction algorithms will get smarter. The implicit assumption: the architecture is fine, the scarce resource is context. Make that resource less scarce and the problem goes away.

Maybe.

Context preservation techniques

The community continues to eveolve lots of clever techniques for managing context.

Sub-agents

Instead of burning the main agent's context on every sub-task, you spin up a smaller agent with its own context, let it do the work, and return a conclusion. The main agent's context burns slower because it only absorbs summaries, not the full reasoning process.

This helps. It's a real improvement. But the main agent is still degrading. You slowed the burn rate, but you didn't change what's burning.

The Checklist

The second fix is the checklist loop. Write the plan to a file. Give the agent one task at a time. Reload context fresh for each task. Externalize everything to disk so there's nothing in the context window that needs to survive.

This is genuinely good engineering. It treats the context window as volatile scratch space — you keep emptying it, so rot can't accumulate. It's thermostatic control: read the state, compare to the goal, take the next action, repeat. A thermostat doesn't understand heat transfer. It reads a number and flips a switch. And for a surprising range of real work, that's sufficient.

But someone has to write the checklist. Hard engineering problems don't arrive pre-decomposed. When the work reveals that the plan was wrong — when you discover mid-task that your decomposition was about the wrong thing — the checklist can't adapt. It can change what the agent does next. It can't change what the agent understands.

Preemptive compaction

As the agent approaches its context limit, have it write up its current state and pick-up instructions. Reboots fresh and have this agent pick up where the old one left off. More thoughtful than algorithmic compaction. More adaptive than a fixed checklist.

The necessary coordinator

Each of these is a genuine improvement on the one before. Sub-agents burn the context slower. The checklist avoids burning it at all for simple tasks. Preemptive compaction manages the burn more gracefully.

Each still assume there is a prime agent, and that notion still holds that agent's context in special vaunted status. The necessary coordinator. Without that coordination, the work cannot continue. And so everything else is a strategy for keeping that agent alive and functioning as long as possible.

The context window is finite, so you manage the finite resource.

The prime agent is the thread of reasoning, so you protect the thread.

Fire, uncontained and contained

Fire burning in the open is useful. You can warm yourself by it, cook over it, see by it. But it's ambient. The heat goes everywhere. You manage it by tending it. More fuel, less fuel, a ring of stones, a cleared area so the sparks don't catch. Every improvement is a better way of tending the same combustion.

But fire sculpted by a mechanism becomes something else. It is, in one of the oldest senses of the word, 'an engine'.

The heat goes where the machine directs it. The mechanism doesn't make the fire hotter or more efficient — it makes the fire's output structural. The fire does the same thing it always did. The machine is what changed.

In the early 1700s, Thomas Newcomen built one of the first machines to do this with steam. His atmospheric engine pumped water out of coal mines by injecting steam into a cylinder, then injecting cold water to condense it — the condensation created a vacuum, the atmosphere pushed the piston down, water got pumped. It worked for sixty years. But it was roughly 1% thermally efficient, because the cold water cooled the cylinder on every stroke, and most of the fuel went to reheating what had just been cooled. The mechanism that did the work also destroyed the conditions for doing more work.

In the 1760s, James Watt was repairing a model Newcomen engine at the University of Glasgow. He wasn't trying to build a new kind of engine. He was trying to understand why the model used so much steam. And he noticed where the waste was going: into reheating the cylinder.

His fix was not "make a better cylinder." It was: stop condensing in the cylinder. Move the condensation to a separate vessel — a condenser — that stays cold while the cylinder stays hot. Each component does one job, in the conditions suited to that job. Efficiency roughly tripled. Not from a better version of Newcomen's engine, but from a different machine that happened to use the same steam.

Heating and condensation were fighting over the same vessel.

Coordination

Inference and memory are fighting over the same vessel.

The context window is good at inference: hot, expensive, high-bandwidth, the place where reasoning actually happens. It is bad at memory: finite, degrading, lossy under compression. We keep trying to make it do both.

Compaction is the cold water injection. It preserves a lossy version of the agent's state so inference can continue — but it degrades the context that makes inference valuable. The agent spends tokens re-deriving conclusions, re-orienting in territory it already mapped, reconstructing judgment from summaries of judgment. Every compaction cycle means reheating a cooled cylinder.

Sub-agents coordinate multiple cylinders in parallel.
Checklists coordinate multiple cylinders in series.
Preemptive compaction more elegantly times the cooling.

What would this look like?

If you stop trying to preserve the cylinder — if you accept that the context window is scratch space, not storage — the architecture changes shape on its own.

There's no prime agent. There's no single thread of reasoning to protect.

Instead there's a back-and-forth:

An assessment step reads the original prompt, reads the current state of the work, reads the log of what's been done. Makes a judgment: what needs to happen next? Dispatches work.

A work step receives a task, does the work, writes its findings somewhere persistent, logs what it did, and yields its final notes to exit.

It doesn't need to explain its entire reasoning back to the assessor. That's in the report. The assessor can double check agent's homework if it seems prudent, otherwise it can proceed with more of the high level task.

This next assessment could be a resumption of the previous assessor's context. But it doesn't have to be. It could be a fresh agent that reads the prompt, the current working state, and the log as needed. Fresh context, full fidelity, reading the current state of the world.

Like a shift change on a navy boat:

You know the standing orders.
You check the notes from the last shift.
You check the state of the world.
Then you get to planning what needs to be done next.

The "thread of reasoning" isn't in any agent's head. It's in the files.

This is not recursive work burning down a single finite resource. It's a trampoline — a back and forth where each participant starts fresh and reads the current state of the work. Context isn't precious. It's a fresh cylinder, heated and ready. What's scarce is something else entirely: good judgment about what to do next. Discipline. And good judgment comes from clear state and full history, not from a degrading memory of having been there.

The agents are the steam

Watt didn't just optimize the Newcomen design. But he also didn't reinvent fire or steam. He changed which parts did what. Which parts stay hot, which parts stay cold, and where the work accumulates.

Agents are the steam. You don't waste them, you choose the right kind for the job. But you don't design the whole machine around keeping one batch of steam alive. You design the machine so the steam can do its work and be replaced by fresh steam, and the work persists in the parts that were built to hold it.

I don't know that any of this is right. I'm messing around with a Claude Max subscription just like lots of other people. I am not an AI researcher and I don't have benchmarks or even a working proof of concept yet.

It's a premise, a feeling even. It's a shape I keep noticing where the current approaches all seem to be optimizing within a design that might have the seam in the wrong place.

But maybe the move now is the same as it was in 1765.

Separate the condenser.

Appendix

A Condensor Shape

My hunch about what the condenser looks like:

[A] partitioned shared working state

Partitioned meaning multiple agents can access and work on different parts without needing to read in the entire corpos
Shared meaning any agent can write to any portion (efficiently shareing is policy, not mechanism)

[B] a queryable append-only log of what happened

An immutable log lets the agents and the humans see if someone isn't playing well with others
Queryable means that if a new finding challanges an old assumption, old work can be reexamined at full fidelity (like a a detective reading cold case files)

Addressable Agents

Mon, 02 Mar 2026 00:00:00 GMT

OpenClaw had a moment and other companies are releasing similar features (Notion scheduled agents, Perplexity computer use, etc).

But they're all missing the same thing (it's very McLuhan).

They're all trying to invent "the next computer" — they want AI to be an interface paradigm shift like terminal → GUI, desktop → laptop → mobile. (After mobile, VR/AR/XR was the theoretical successor, with middling to bad results.)

It feels like the thing that's going to make AI actually useful is giving it an addressable identity.

And maybe, at least to start, that just means "an email account".

The truly useful agent can't just be in one app — it has to be in any app. It can't just act as me, it has to be something else.

In every traditional UI shift, it's still me doing the action, using a new surface.

I sit at my desktop and buy a plane ticket.
Then I sit on my couch and buy a plane ticket.
Then I'm on the train and I buy a plane ticket.
Now everyone is obsessed with... my browser magically buys a plane ticket for me?

No. "My agent buys a plane ticket on behalf of me". Say it out loud to yourself. This is a 40 year old solved problem. When a human travel agent in 1995 buys a plane ticket for me, they do so not by impersonating my voice. They do so "on behalf" of me. The system knows the provenance of the fact that they did this. If I call up and buy a ticket myself, the final result is the same, but the records reflect the difference.

Every one of those shifts gave us a better tool (and specifically, expanded where we could use them). A laptop is a portable desktop. A phone is a pocket laptop. Each one widens the surface over which we can coordinate whatever it is we need to get done.

Each of these were power tools. The gas chainsaw was powerful. Then the electric hedge clipper let us garden more often and with less fuss. We didn't throw away the chainsaw — we just had more tools in the shed. Each new paradigm is additive: it captures some use cases and opens new ones, but the old surface sticks around. But what if the next shift isn't another tool for the shed?

The Jetsons' Robot Gardener

We don't interact with it in specific new AI channels. We just use the existing plumbing for how we coordinate with human actors. The difference isn't just capability, it's also relationship.

Notion and Perplexity and everyone can't crack the nut of the next big thing in their apps because the next big thing isn't going to be in one app. OpenClaw scratched the surface of this but got sidetracked with signal as a "control scheme".

These are all going to need one thing: an email address. Or, more generally, a unique addressable identity. Once we give these things stable addressable identities, I think the floodgates are going to rip open.

(Email addresses aren't cool on their own. The agent doesn't even need write access to the email account. It exists to enable accepting the invite email from Linear, GitHub, Slack, etc: to participate in the systems humans already use, without those systems needing to be rearchitected as "AI Native")

Your "coworker," your "assistant," whatever its scope and mandate — it's an email address, a set of memory files, an event bus (message received → run prompt), and a cron job (every 10 min, run prompt — usually go right back to sleep).

That's it. That's the baseline:

[A] Persistent unique addressable identity ("an email address")
[B] Thoughts can span interactions (memory files)
[C] Can respond to your queries (event bus)
[D] Can take actions proactively ("cron job")

Every current AI product gets some but not all of these (and executes each with varying quality):

Siri, Alexa, etc: C, only, nothing else
Claude Code: B, C. a little D. no A
Notion Agents: D, C, kind of B, no A.
OpenClaw: B, C, D, flickers of A

This is textbook Clay Christensen disruption. Painfully so. Apple, Google, etc. desperately want AI to be a feature that fits into their existing platforms.

They're all trying to put the power of Lt. Commander Data into a text input, but none of them wants to ship Lt. Commander Data. (Claude coworker? Come on, it's right there)

(Now obviously on the show, Data can stand in the ready room and give his status report. Today that part is a robotics problem. But almost any meeting in the ready room on TNG could have been a space-Zoom call, or honestly just a space-email. The point is that the agent can converse, task, and be tasked — same as any other participant.)

A Clay Christensen style analysis

Disruption theory says incumbents fail not because they're stupid but because they're rational. They listen to their best customers, invest in sustaining innovations, and rationally ignore the low-end or new-market footholds where disruptors start. The disruption happens when the disruptor improves along a trajectory that eventually meets mainstream needs — at which point the incumbent can't respond.

Map the ABCD framework onto the competitive landscape:

Google

Best structural position of any incumbent. They already give identities to everything (Workspace accounts, service accounts). They already have the event bus (Pub/Sub, Cloud Functions). They already have the cron (Cloud Scheduler). Gmail is literally THE identity layer of the internet for a billion people. A Google AI agent with its own agent-for-bob@workspace.google.com that can send email, read calendar, book meetings, file expenses — they have every piece.

Why they might blow it: Co-option. Google's business model is ads, and ads require you looking at screens. An agent that acts on your behalf means fewer eyeballs. Every incentive pushes toward making the agent a feature of Gmail/Docs/Search (sustaining innovation) rather than an independent entity that reduces screen time. They can't rationally cannibalize their own attention economy.

Microsoft/OpenAI

Second best structural position. Azure AD already has identity and delegation primitives. Exchange has had delegate mailboxes for 25 years. They understand "on behalf of."

Why they might blow it: Same co-option, different flavor. "Copilot in Teams," "Copilot in Excel," "Copilot in Word." The agent transcends any single product; Microsoft wants it trapped in M365. Their enterprise customers are asking for "Copilot in my app" — and Christensen says listening to your best customers is exactly how you miss the disruption.

Apple

They have the identity (Apple ID), the device graph, and the most intimate user relationship. They could give an agent an iCloud email address tomorrow.

Why they'll almost certainly blow it: Apple's entire philosophy is that the human is in control and nothing happens without explicit user action. An autonomous agent with its own identity that sends emails you didn't individually approve violates Apple's DNA. It's not a technical limitation, it's a constitutional one. And their best customers — privacy-conscious consumers — are explicitly asking them NOT to do this.

Anthropic/Claude

B and C, a little D, no A. Claude Code is the closest thing to the "coworker" framing, but it has no persistent identity in the world. It can't receive an email. It can't be addressed by other systems. Each session is an amnesiac (modulo memory files, which are B but fragile B).

Where Anthropic could win: They're the least encumbered by a business model that requires eyeballs or lock-in. No ad business (Google), no enterprise suite (Microsoft), no hardware-control philosophy (Apple). Their business model is API calls, and an agent with an email address that autonomously interacts with the world makes more API calls, not fewer.
Why they might blow it: Safety culture could become resistance. "We can't give the agent an email address because it might send something harmful" could calcify into a constitutional objection to A and D. Also, they don't own any of the surfaces — they'd need partnerships with the incumbents who have incentives to block them.

Newcomers / Disruptors

The over-served market: knowledge workers who want AI to help them be more productive in their existing tools. Every incumbent is fighting over this.

The underserved market: small businesses and solo operators who need delegation, not assistance. A freelancer doesn't want "Copilot in Excel." They want someone to handle their invoicing. A 5-person startup doesn't want "AI in Notion." They want a back-office person who doesn't exist yet because they can't afford one. That's a delegation relationship, not a tool relationship.

The disruptor is probably someone currently building a "toy" that incumbents dismiss. It'll look like "an email address connected to an LLM with a cron job" and the first reaction from Google/Microsoft will be "that's cute, but it doesn't have enterprise security features." By the time it does, it'll be too late.

Interestingly, customer-facing agent companies (Intercom, Zendesk) are already building agents with their own identities — email address, memory, actions. They just haven't generalized beyond customer support. If one of them realizes they've already built ABCD for one domain and generalizes it...

The dark horse: someone who builds the "agent identity provider." Not the agent itself, but the identity layer. The way Okta/Auth0 became the identity layer for SaaS apps, someone could become the identity layer for AI agents. Issue the agent an identity, manage its permissions, handle the "on behalf of" delegation chain. Every agent builder would use them because building identity is a distraction from building the agent.

The winners will be whoever recognizes that the set of things worth doing has expanded, rather than doing the old things but fancier:

Google, Microsoft, Apple will make agents a feature of their existing products. They'll bolt an agent onto "user interacts with our app." This is using the word processor to grade the same test.
Anthropic has the best shot among established AI companies, but only if they make a bet that feels irresponsible by current safety norms — giving agents persistent identity and autonomy.
The actual winner probably starts with the identity primitive, not the intelligence primitive. Everyone is competing on who has the smartest model. The disruption will come from whoever figures out that intelligence is becoming commoditized and the scarce resource is the identity and delegation infrastructure that lets intelligence act in the world.

Not "smarter AI." Not "AI in every app." Addressable AI entities with provenance.

All Bets Are Off

Tue, 03 Feb 2026 00:00:00 GMT

Outline

§1 — Zero marginal cost of production.

The cost of production is trending towards zero. Old measurements were bets; the odds moved. They've decoupled from what they tracked. So what now?

§2 — Coherence.

Therefore: Coherence is the new scarce resource. The value of externalizing it went up; the cost went down. Now we can write it all down.

§3 — Leverage.

But: Documentation is the foundation. What possibilities does that open up? Automating taste, the divergent-convergent loop, frequency vs. amplitude. The process is recursive: a ratchet.

§4 — "Too impractical."

Therefore: Here's what amplitude looks like in practice. Not faster — more thorough. The things that were never worth doing are now worth doing. These gains compound. Your competitors have the same tools.

§5 — Path dependence.

But: Most orgs will get this wrong. Resistance, co-option, thrash: three ways of refusing to place new bets.

§6 — New bets.

Therefore: The old bets were real, and now they're off. Choose wisely.

Zero marginal cost of production (§1)

The naive read of AI coding tools is that we'll take what we used to do in a month and now we'll do it in two weeks, or eventually a week, and eventually a day. Maybe, but the implicit assumption here is typing speed was already your limiting factor.

But AI isn't changing every aspect of a business in the same way and the same amount. Most of your business still runs at the speed of business. (Regulatory and compliance still run at the speed of government.)

What the numbers used to mean

A codebase with a million lines used to be worth something. Not because a million lines is inherently valuable, but because someone had to write them. Who would have spent all that time if the thing didn't work? A million lines was evidence of a million decisions.

The million-line codebase is the most dramatic example, but the same thing happened everywhere:

Lines of code used to mean effort.
Coverage used to mean diligence.
Velocity used to mean capacity.
A 5K-line PR used to mean something had gone wrong.

None of these mean the opposite now. A codebase with high coverage might still reflect genuine diligence. A team with steady velocity might still be well-coordinated. But you can't tell by looking at the number anymore. The number doesn't confirm and it doesn't deny. It just stopped being evidence.

If we haven't already, we'll see some series A "scam" acquisitions where a flashy startup gets acquired and once the purchaser does due diligence, they find that most of the repo is the scrawlings of a madman. A million lines, generated in weeks, signifying nothing. Worth less than nothing.

Why they all broke at the same time

Every one of those was an indirect measure — a bet that the thing you could measure would track the thing you couldn't. Lines of code tracked effort. Coverage tracked diligence. Velocity tracked capacity. These were never the real thing. They were stand-ins.

They worked because you couldn't hit the number without doing the work. Writing a thousand lines of coherent code required understanding the problem. Achieving 85% coverage required thinking about edge cases. Shipping consistently required genuine team coordination.

The indirect measures and the real things were linked by production cost. The cost was the authenticator.

When production cost drops, every indirect measure authenticated by that cost breaks at the same time. Not because anyone is gaming the system — in a healthy org, nobody is. But the numbers that used to require the underlying work no longer do. You can hit every metric on the dashboard and have done none of the thinking.

"LOC tells us something useful" was a bet. "Coverage means the code is solid" was a bet. The cost structure made those safe bets. The cost structure has changed.

Coherence (§2)

The goal was and remains: a high-quality product you can efficiently maintain and change over time.

So what correlates with that now?

Coherence.

By coherence I mean the structural property that makes the next change obvious. Not easy, necessarily — but obvious. You look at the existing patterns and you know where the new code goes, what it should be called, how it should behave.

That property exists at every level of the system. At the top it's an architecture that maps cleanly to the business domain. In the middle it's consistent patterns — one way to handle errors, one way to structure a service. At the bottom it's naming conventions and file structure that don't make you guess.

If code trends to zero marginal cost, then well-defined features start to trend to zero marginal cost as well. Coherence is what makes a feature well-defined. We'll come back to why that matters.

Incoherence for humans

When humans do all the work, coherence lives in two places: the artifacts and the people. The code, the docs, the tests — and then everything the team just knows.

That second category is bigger than most teams realize. It's not just "who understands the billing service." It's the shared scar tissue. "We tried event sourcing in payments and it was a nightmare, so we use simple CRUD everywhere now." Nobody documented that as an architectural decision record. It's just something the right people know, and they steer new work away from it instinctively. You don't document flinches. The space of things you decided not to do is infinite.

Externalizing all of this — writing it down, keeping it current, making sure it reaches every engineer who needs it — was a real cost that competed with building. Teams made rational tradeoffs about how much to externalize.

Incoherence for machines

LLMs do not have 1:1s with your coworkers. LLMs do not even have memory. Their long-term memory is the artifacts.

So now two tradeoffs have fundamentally changed:

the value from encoding coherent business thinking into the artifacts goes up.
the cost of encoding coherent business thinking into the artifacts goes down.

Take the event sourcing example. When the humans wrote all the code, the three people who remembered the payments disaster would steer new work away from it. An LLM has no scar tissue. If nothing in the codebase or the docs says "we don't do event sourcing in payments," the LLM will cheerfully propose it — and generate a clean, well-structured, completely institutionally incorrect implementation. The value of having that decision written down went from "nice to have" to "the difference between useful output and output you throw away."

And the cost of writing it down dropped. The same tool that can't intuit the flinch can help you externalize it. Point the LLM at the payments service, tell it the history, and ask for an architecture decision record. Five minutes. The document that nobody was going to spend an afternoon writing now costs almost nothing to produce.

Now we can write it all down. It's cheaper than it's ever been, and it matters more than it ever has.

Leverage (§3)

Documentation is the foundation. What possibilities does that open up?

Automating taste

Once coherence is externalized, the next question is whether you can measure it. Nobody has a coherence score for a codebase today. But we're close.

Static analysis was the first generation of automated judgment. It could measure what's mechanically computable, like cyclomatic complexity. It was a blunt instrument, but it was an honest attempt to automate taste.

The next generation is already visible: an LLM running on every CI pipeline, assessing the fuzzier qualities that previously required a senior engineer's eye. Does this PR introduce a new pattern where an existing one would do? Is the naming consistent with the rest of the module? How far has the actual code drifted from the documented architecture?

These assessments get scored per-PR, trended over time, and used as guardrails. The coherence score becomes correctness infrastructure.

Frequency vs Amplitude

In systems design there's a pattern called divergent-convergent thinking.

Divergent thinking is spitballing. "No bad ideas."
Convergent thinking is analysis and verification. "Doing the homework."

The design process is framed as a repeating pattern of divergent → convergent → divergent → convergent.

   /----------\      /----------\      /----------\
  /            \    /            \    /            \
-*              *--*              *--*              *-->
  \            /    \            /    \            /
   \----------/      \----------/      \----------/

Often visualized in a diamond shape of widening scope and then winnowing to the practical, then repeating.

LLMs are comically good spitballers. But they're mediocre verifiers, almost by definition. Coherence and correctness infrastructure are investments in guiding the divergent phase and implementing the convergent phase.

One option is to try to leverage LLMs to run this process at a higher frequency. But doing the same thing you were doing before, but faster... that's only so interesting, and kind of exhausting.

   /\  /\  /\  /\  /\  /\  /\  /\
  /  \/  \/  \/  \/  \/  \/  \/  \
──                                ──>
  \  /\  /\  /\  /\  /\  /\  /\  /
   \/  \/  \/  \/  \/  \/  \/  \/

The first thing we all do with a new tool is replicate what we were already doing. It's not wrong, it's the natural first step. But the task is to not confuse that for all the other unforeseeable new options that will open up.

But frequency isn't the only dial.

Amplitude: Could we do drastically more in one cycle than we used to, because doing so in the old world would have been cost prohibitive or downright impossible?

         /----\           /----\           /----\
        /      \         /      \         /      \
       /        \       /        \       /        \
      /          \     /          \     /          \
     |            |   |            |   |            |
     |            |   |            |   |            |
     |            |   |            |   |            |
     |            |   |            |   |            |
    /              \ /              \ /              \
---*                *                *                *--->
    \              / \              / \              /
     |            |   |            |   |            |
     |            |   |            |   |            |
     |            |   |            |   |            |
     |            |   |            |   |            |
      \          /     \          /     \          /
       \        /       \        /       \        /
        \      /         \      /         \      /
         \----/           \----/           \----/

Wider divergent phase: more options explored per cycle, more approaches prototyped, more ideas tested against reality before committing to one. Wider convergent phase: more thorough verification, denser correctness infrastructure, the kind of rigor that was always valuable but never budgeted for.

And here's the thing that makes this more than a one-time trick: the process is recursive. The machine helps you build the convergent infrastructure — the tests, the lint rules, the architecture docs — and that infrastructure constrains and improves the machine's next round of divergent output. Better generation means better infrastructure gets built on top of it. The widening isn't a single gesture. It componds.

Don't do what you were doing before, but faster. Do the things that were always valuable but never justifiable under the old cost assumptions.

The engineers and teams that get the most out of this shift won't be the ones shipping the same roadmap at higher velocity. They'll be the ones who recognize that the entire set of things "worth doing" has expanded, and are systematically exploiting the new tradeoffs.

"Too impractical" (§4)

"That refactor isn't worth it right now." "Good enough for v1." "Nobody's going to write 200 test cases for that edge case."

Every engineering team has a version of these sentences. Here's what happens when they stop being true.

Not faster. More thorough.

I needed to build mock APIs with realistic backing data. In the old world, an engineer spends a day, writes maybe 2000 lines, covers the happy path and a few known edge cases. That's a 90% solution, and everyone agrees it's good enough, because going further means another day of tedious hand-written data and there's other work to do.

With an LLM, the 90% solution takes an hour. But the other bottlenecks haven't moved. Code review still takes the time it takes. Integration still takes the time it takes. Alignment with the team still takes the time it takes. So raw speed on the implementation isn't the constraint worth optimizing.

The real move is to spend the same half-day you would have spent before, but instead of a 90% mock, you produce a mock with comprehensive test scenarios, realistic edge cases, failure modes, varied data shapes — the kind of thoroughness that nobody would have budgeted for previously. Not 90% faster. 500% more thorough in the same time envelope.

And that thoroughness isn't just nice to have. That mock data becomes correctness infrastructure. Every feature built on top of it now has a richer, more realistic environment to be tested against.

Exploration collapses into proof

Exploration used to have to be budgeted and managed in steps. A proposal, a time allocation, a research spike, a partial implementation, a review — a multi-week process before anyone sees concrete results.

Instead, a proposal doc can have an attached reference PR with a likely full working implementation.

If the proposal gets refined based on feedback... regenerate the PR. The exploration and the proof collapse into one artifact.

This workflow has no analogue in the old world. It's not a faster version of the old process. It's a different process.

The old world separated "should we do this?" from "can we do this?" because answering the second question was expensive. When it's cheap, you just answer both at once.

Let's get weird

Those examples are conservative. Where could the "too impractical" calculus go from there?

Some things that wouldn't have survived a planning conversation six months ago:

Self-healing CLAUDE.md. Use the LLM to write a CI job that uses the LLM to analyze a PR for divergences between the new code and existing CLAUDE.md files. When a PR changes a pattern that contradicts a CLAUDE.md, generate a proposed CLAUDE.md update and a proposed code revert. Let the reviewer pick: did we change the convention, or did we violate it? Forces the decision to be explicit either way.
Convention extraction from code review comments. Mine your team's PR review history for recurring feedback patterns. "We always ask people to use the error wrapper." "We always flag direct database access outside the repository layer." Generate lint rules from the things humans keep repeating. Your reviewers have been writing a spec for years — it's just trapped in GitHub comments.
Invariant mining. Point the LLM at your test suite and ask it to infer implicit invariants — things that are true across every test but never stated as a rule. Then generate lint rules or property tests that enforce them explicitly. The tests knew something the codebase didn't say out loud.
Test generation from prod incidents. When a bug hits production, have the LLM write a regression test, but also have it scan for structurally similar code paths and generate speculative tests for those too. The incident becomes a pattern detector, not just a point fix. Every bug you find makes the next bug harder to ship.
PR-to-PR pattern drift. Track the patterns introduced across the last N merged PRs. Flag when the same problem is being solved three different ways across three PRs by three people (or three LLM sessions). Nobody sees drift in real time. An LLM reading across PRs can.
Architecture doc staleness detector. LLM reads the actual code, reads the architecture docs, flags divergence. "The docs say payments uses REST, but there are three gRPC endpoints now." Reverse the usual flow — instead of updating docs from decisions, update docs from reality.
Mutation testing on steroids. Have the LLM generate semantically meaningful mutations — not random bit flips, but plausible mistakes an LLM might actually make. "What if someone used optimistic locking here instead of pessimistic?" If the test suite doesn't catch it, that's a real gap, not a synthetic one.
Dependency impact simulation. Before upgrading a dependency, have the LLM read the changelog and your usage of the library, then generate a set of "things that might break" as test cases. Run them before you upgrade. Turn the changelog into a pre-flight checklist.

These are all practically-wise, non-starters under the old cost structure. Not because they were bad ideas — because the implementation hours dwarfed the payoff.

But implementing any of these isn't an independent win. Each one would produce infrastructure that the others consume. The system's fabric gets stronger with every piece you add.

These gains compound. Each piece of infrastructure improves the next rounds of generation and verification, which means the next piece of infrastructure lands better too.

Our competitors have the same tools. The question is whether they're investing in this coherence infrastructure or if they're just trying to turn the crank faster.

Path dependence (§5)

The honest response

Everything in §1-§4 requires changing how you work. A natural response to that, when you've spent years getting good at the old way, is: no. That's not irrational. It's protective.

When React arrived in 2014, it violated every established best practice in frontend engineering. Some developers called it a fad. They were wrong, but their skepticism wasn't stupid — it was calibrated to a world where those best practices had been genuinely correct. What changed wasn't the quality of their judgment. What changed were the constraints their judgment was calibrated to.

That's resistance. It says: this threatens something real, and I'm not ready to let go of it.

The worse failure mode

There's another response, and it's more dangerous. Call it co-option. This is where you technically adopt the new thing but use it to preserve every existing structure. Same org chart, same process, same estimation methods, same job descriptions — now with a subscription.

You can already see it. Jira integrations that auto-generate status updates. Sprint retrospective summarizers. AI-powered ticket estimation. Same work, same structure, same assumptions — now with a chatbot bolted on. And co-option is self-reinforcing: the tooling creates jobs, the jobs create advocates, the advocates entrench the tooling. This will happen at massive scale.

The worst failure mode

There's a third response, and it's the most destructive. Call it thrash. This is where someone fully embraces the new tool, points it at everything, and generates at full speed with no spec, no architecture, no convergence infrastructure — just output. PRs pile up. Code ships. Activity is visible on every dashboard. And the codebase gets worse on every merge, because volume without direction isn't progress. It's the politician's syllogism applied to engineering: AI is transformative; I am using AI; therefore I am transforming.

Resistance preserves the old structure by refusing the new tool. Co-option preserves the old structure by absorbing the new tool. Thrash destroys the old structure and replaces it with nothing. All three end up in the same place: no coherence, no compounding, no infrastructure that makes the next cycle better. These are three ways of refusing to place new bets.

Banning the word calculator vs. using GPT to grade the same old assignment

An education parallel captures the first two failure modes cleanly. Banning GPT essays is resistance — honest, protective, ultimately a losing move because the word calculator isn't going away. Using GPT to auto-grade the same five-paragraph essays is co-option. Technically it's "adopting AI", but preserving the exact measurement that stopped measuring what it was supposed to measure.

The hand written essay was an indirect measure of critical thinking. If the measure is dead, the right move is neither banning the tool nor automating the old measure. It's raising the bar: teaching critical thinking using the tools that students will encounter in the world today.

The distinction

Resistance is at least honest about the stakes.
Co-option pretends the stakes don't exist.
Thrash pretends the work is the stakes.

Of the three, co-option and thrash are harder to fight, because both look like progress.

All three are cultural problems wearing technical clothes.

Resistance is an identity problem.
Co-option is a bureaucratic self-preservation problem.
Thrash is a leadership problem.

The technical prescription — coherence and correctness infrastructure, compounding — is necessary but not sufficient. The organizational self-reflection required to actually adopt it is a different essay.

New bets (§6)

Everyone views the new thing in the lens of the old. It's the only lens we have to start. The question is whether you're going to get stuck there, or whether you can start to acquire new lenses.

The engineers who called React a fad weren't wrong about their craft. The teachers banning AI essays aren't wrong about critical thinking.

Every heuristic, every indirect measure, every definition of "worth doing" was a bet placed against a specific cost structure. Lines of code measured effort because effort was expensive. "Good enough for v1" was rational because thoroughness cost more than it saved. Estimation worked because implementation was the bottleneck. These were all good bets. They paid off for years.

But the cost structure is being rewritten, and now they might not be.

The indirect measures decoupled from what they measured. The set of things worth doing expanded past what the old calculus can see. And the most dangerous response isn't refusing to adapt — it's adopting the new tools to preserve the old assumptions.

All bets are off. New table. No limit. Choose wisely.

Appendix

Related thoughts

Craftsmanship

The Woodwright's Shop (1979-2017)
The New Yankee Workshop (1989-2009)

We have all been writing code for 50 years like Roy Underhill. Those who decide to retain their artisanal path but adopt the new tools will go the way of Norm Abram. Those who choose to follow the path of scale will have to learn some patterns that may feel a lot like the advent of the modern factory. It's bittersweet to see the twilight of the golden age.

Sea Change

Margin Call (2011)

When a sea change comes, it is not obvious to most until the time for meaningful action has long passed.