The Irony of the LLM treadmill
A strange new burden has crept into software teams: the LLM treadmill. Many models retire within months, so developers now continuously migrate features they only just shipped.
Our team feels this new pain sharply. We support many LLM-based features, together measuring massive token volumes each month. Though I suspect even teams with small LLM dependencies feel this frustration.
Example: A migration affected by the “jagged frontier”
You can treat these migrations like any other software version bump. But users dislike adapting to change that is only mostly better. And since LLMs are weird and lately their upgrades jagged, simply bumping the version can be quite messy.
Consider a common scenario: a feature in your product is powered by a clever, “vibe-based” prompt. It worked surprisingly well on a popular model, so you shipped it and iterated on it when users gave feedback.
Then came the model’s deprecation notice. Time to migrate. When you migrated the same feature last year, the version bump was a clear and easy win. Hopefully again!
Only this time the new model makes this feature feel different. It’s sometimes better, sometimes worse. The prior model had a special knack for the task. You worry about your users adapting to change.
This pushes you to graduate. You formalize the task, annotate high-quality examples, and fine-tune a replacement model. You now have a more robust solution with much-improved quality, all because the treadmill forced you to build it right.
Was all that necessary?
Migrations are risky opportunities.
A recent example is ChatGPT’s move to GPT-5. Chat became smarter, but lacked 4o’s personality. Many users were unhappy and wanted it rolled back1. It’ll take another migration to fix properly2.
So what should you do when your vibe-prompted LLM is to be sunset?
If a newer model makes a mediocre feature feel great, take it quickly.
Otherwise, move beyond feel. Really break down what people like in your feature.
And this takes serious effort … just to migrate.
But it pays for itself. Once the nuances of “good” are measurable, you can make your feature even better. In the above scenario, the new smarter model is often also 10x cheaper and 2x faster. And it’ll be easier to migrate next time, as you already have our nicely annotated dataset.
My team does this often. We ship a v1 with prompting. A model gets deprecated. We nail down “good” → measure it → kick off an optimization loop.
We try new prompts, alternative models, and sometimes go tune our own. We usually end up faster, cheaper, sturdier, and consistently higher quality than the vendor’s version bump that forced the whole process.
Awkwardly, that often churns spend from them.
And that’s the irony of the LLM treadmill: short model lifespans force even happy API customers to keep reconsidering. And the better customer they are (the more features they’ve built and maintain), the more forceful is this push away.
Seems like a hard way to do business.
OpenAI’s push is gentler
Each big AI lab3 forces a different “treadmill”, with OpenAI’s so far offering the most self-determination.
Google’s Gemini models retire one year from release4. New releases are often priced quite differently (↕)5
Anthropic’s retirements can occur with just 60 days notice6, after a year from release. Pricing has been flat since Claude 3’s release (except for Haiku ↑).
And OpenAI’s treadmill is more developer friendly:
Models are supported for longer (still supporting even GPT-3.5-Turbo and Davinci).
Upgrades often arrive with a lower price ↓.
Labs’ diverging focus
The contrast between OpenAI and Anthropic becomes clearer when you look at how they position their models.
At their recent DevDay, OpenAI showcased a long list of top customers. One thing I observed is how varied the top customer list is. All kinds of unicorns — consumer, business platforms, productivity, developer tools, and more — seem to be heavily using OpenAI’s API.
This was consistent with how OpenAI positioned GPT-5 upon release; as a model intended to tackle a broad range of tasks.
Anthropic, in contrast, appears to be specializing. In their Claude 3 announcement, Anthropic touted a wide array of uses. By their Claude Sonnet 4.5 release, they more narrowly positioned Claude as the best “coding model”. And code tools are reportedly an increasingly large part of their revenue.
I think it’s not a coincidence that the friendlier-treadmill vendor has kept a wider base of software built on them. I also wonder if this is self-reinforcing in how the big labs iterate on their product-market fit.
Where I think this is headed
I think software teams will keep following their incentives. If model migrations cost more than they deliver, those teams will grow tired. They’ll reclaim control over quality and their roadmap prioritization, by either self-hosting models or moving to labs with friendlier policies.
That said, I’m also optimistic that the big AI labs will see this and fix the driver. I hope they’ll commit to long-term support of their models. The pain today might just be a growing pain of a new industry. The LLM treadmill may, in time, disappear.
The exception to this post is code tools. That crowd is seeing pure upside from each new model, so Anthropic’s focused bet on code will likely continue to compound. But for the rest of us building AI-powered app features, navigating the treadmill has become a very real and pressing problem.
Find this work interesting? We’re hiring!
You can also follow me on X.
Not limited to the closed labs. E.g. last week a unicorn open-model inference provider gave just weeks notice for three deprecations. One of those was a “migrate to” suggestion just 85 days prior. Open models can be moved between vendors though.
We’ve experienced capacity issues in the months ahead of model retirement, so out of an abundance of caution we now treat these as 10-month releases rather than 12-month.
E.g. the leap from Gemini Pro 1.0 → 1.5, or the leap from Flash 1.5 → 2 → 2.5.
As experienced with their popular Sonnet 3.5 models.

