{ Mario Zechner }

developer • coach • speaker

Armin is wrong and here's why

2025-11-22

The offending post.

Table of contents

I consider Armin a friend, so the following is me just rambling at him like I would over a coffee in real life. I most likely just misunderstood his blog post. Read his post first to make any sense of mine. Here's a shitty TL;DR: He claims LLM APIs are secretly a state synchronization problem because providers support prefix caching and hide state from you, and proposes we look at local-first/CRDT-style solutions.

Armin, I'm gonna wall of text you. Think of me as one of your slightly demented, contrarian HN-style blog post bot reviewers.

The reality of messages

My first contention is that you say the message abstraction is different from the underlying reality:

"The underlying reality is much simpler than the message-based abstractions make it look: if you run an open-weights model yourself, you can drive it directly with token sequences and design APIs that are far cleaner than the JSON-message interfaces we've standardized around."

"The core idea of it being message-based in its current form is itself an abstraction that might not survive the passage of time."

But the messages abstraction is baked into the model weights via the prompt format. Token sequences you feed to the LLM directly have to conform to it. Messages map directly to the prompt format. Yes, providers might inject additional tokens, but we don't have access to those anyway, so it doesn't matter (this will be a recurring theme). This is the underlying reality, not an abstraction hiding something simpler. I'm doubtful you could design a cleaner API than a trivial JSON transform.

Here's the Harmony response format gpt-oss was trained on:

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-06-28

Reasoning: high

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|>

<|start|>developer<|message|># Instructions

Always respond in riddles

# Tools

## functions

namespace functions {

// Gets the location of the user.
type get_location = () => any;

// Gets the current weather in the provided location.
type get_current_weather = (_: {
// The city and state, e.g. San Francisco, CA
location: string,
format?: "celsius" | "fahrenheit", // default: celsius
}) => any;

} // namespace functions<|end|>

<|start|>user<|message|>What is the weather like in SF?<|end|>

<|start|>assistant

System, developer, tools, user, assistant. It's the messages API with angle brackets (special tokens, really, which are part of the LLM's vocabulary).

Where's your state sync problem at?

In its initial iterations, the messages abstraction was just that: a nice way to avoid dealing with chat template formats. The provider could add extra info behind your back without sending state back to you. You just appended the LLM response. All was nice and stateless server-side (from the endpoint user's perspective). You send your system prompt, tools, and user message, append the assistant response, send new user message and all history, rinse repeat. Easy peasy.

Then providers started committing crimes against statelessness: stuffing stateful info into the perfectly fine stateless messages abstraction. Opaque thinking blobs. Server-side tool result blobs or IDs. Some providers make you explicitly mark segments for prefix caching. All to match your inputs against cached server state and protect their IP.

But this is... fine, actually? Yes, you now have to retain opaque data sent from the server as part of your client-side state. But that's no different than appending the LLM response. You still have fully replayable state from your perspective on your end.

Can the server reconstruct thinking traces or tool results from the opaque blobs after a long delay? Unclear. But it doesn't matter: you never had access to that hidden state anyway. This is what we get, and it's unlikely to change. Providers are protecting their IP.

As far as you're concerned, the server is still essentially stateless. You're not constructing opaque blobs and syncing them back. You're just echoing what the server gave you. Like an HTTP-only cookie: client-side storage of server state or cache keys, sent back each request so the server can reconstruct the full "hidden" state. The only state you manage is what you can observe and modify: your messages and the LLM's responses. Cache markers are optimization hints, not shared mutable state that requires synchronization across multiple peers. Everything else is conforming to an API contract. That's not a sync problem in the local-first sense.

Switching providers

Of course, storing these opaque blobs makes your local state provider-specific. If you switch from Provider A to Provider B mid-session, you need to transform your state: drop A's opaque blobs, reformat for B's API. Provider B can't reconstruct A's hidden thinking traces or server-side tool results. But you never had access to those anyway. Providers wouldn't share those with each other either. Similarly, Provider A's KV cache can't be transferred to Provider B. From your perspective, you're still in fully replayable client-side-only state land, just with some unavoidable data loss. There's no interoperability for this stuff, and there likely never will be.

A local-first/CRDT solution won't help with that either.

So what's the actual problem?

There are real issues here though. Armin points out:

"In completion-style APIs, each new turn requires resending the entire prompt history. ... the cumulative amount of data sent over a long conversation grows quadratically because each linear-sized history is retransmitted at every step."

Yep, that's terrible. Providers have created API endpoints to upload large attachments like images or files, reducing the data you send when your session updates with new messages. Text and tool results are tiny compared to attachments, so even with quadratic growth, the actual bytes are manageable.

Is this a sync problem? No. The server has the file, you have a reference. If the file expires, you re-upload. Server is authoritative. Standard resource management.

But then we meet our old friend: lack of interoperability. If you use those upload APIs and want to switch providers, you're in for a world of hurt. Well, actually it's not that bad. With completions you have to store the entire session state, including the files, on your end anyway. So all you need to do is re-upload everything to the other provider. And if you switch multiple times within the same session, well, good luck. But again, not a local-first style sync problem. For it to become one, providers would need to be interoperable, which is not going to happen.

Armin also brings up the Responses API:

"One of the ways OpenAI tried to address this problem was to introduce the Responses API, which maintains the conversational history on the server"

While that's not quite the reason why OpenAI introduced the Responses API according to themselves, I concur fully with Armin's critique of the implementation. It's not great. And the documentation is severely lacking, especially concerning edge cases like network splits. Which is probably why you can set store: false, where you are in charge of managing the full observable state again, just like with completions (with the thinking blob caveat).

But is the Responses API a sync problem? The server is authoritative. If something goes wrong, you query the state... well, you would, if the API provided that functionality. Since it doesn't, you can either rely on thoughts and prayers, or do a full restart.

One problem Armin does not mention in his blog post, which I've butted my head against: some providers give LLMs access to VMs where they can store files and state. It's essentially all server-side tools with server-side containers for the LLM to play with: create files, code, artifacts, modify the container environment, which it relies on to evolve the session state turn by turn. That's the biggest possible state fuckery imaginable. There's no good way for you to extract that full state, just like with thinking traces.

But thinking traces are a nice-to-have comparatively. Your agent using a container to build up state it requires to fulfill its task is a whole new ballgame. A loss of that hidden state is actually catastrophic for that agent run, while a loss of thinking traces can likely be survived. And again, interop is not a thing and never will be. Solution? Just use the model, manage containers and artifact storage yourself as part of your "LLM-client-side" state.

Armin is right actually, but

Local-first principles can't realistically govern a provider's internal state as long as they keep it hidden. And for closed SaaS LLMs, they will. Exposing full internal state would leak proprietary signals, make it easier to clone or fine-tune competitors, and lock providers into internal architectures they can't freely change. Wishing for "local-first friendly" APIs where all hidden state is exportable is nice in theory. It's just not going to happen with closed providers.

The best closed LLM providers can do is expose a simulacrum or key identifier for that hidden state. You can include those identifiers in your local canonical state, but they only let you resume on the provider as long as it keeps the backing data around. Under a local-first lens that is not enough. You would need to be able to replay that state from your own data if the provider lost everything. With opaque identifiers, you cannot do that. If the backing data is deleted or expires, there is nothing you can reconstruct yourself. That is the situation with thinking traces (depending on the provider), VM and container state, and likely other hidden state on the provider side that we never get to see.

However, it is useful to look at your side of this from a local-first perspective. You need to construct your systems so that hidden provider state does not matter for your tasks. Under that lens, thinking traces are just remote, transient scratchpad state the provider uses to produce answers, not part of the document you are actually syncing. It is unlikely that you will experience catastrophic failure if you cannot replay those hidden traces when you recover a session that errored out. For your local-first canonical state, you should treat thinking traces as a nice-to-have, not as an essential part. What ultimately counts are the actual LLM outputs and tool results that you store in your own canonical state.

A Responses-style, fully server-side managed session store goes against local-first as long as providers do not expose that state in full to you, which is currently the case for OpenAI's Responses API. From a local-first perspective, you have to treat that server session as a cache and store the full session state on your side instead. That means taking the egress hit from the quadratic cumulative growth of data you send on each turn. In practice this is manageable if you also use the provider's file APIs.

The prefix cache is derived state in the local-first lens and does not need any special treatment. You should treat it as a pure performance optimization. Submitting cache checkpoint markers as part of your messages JSON is maybe a bit icky, but it is simple enough to ignore as a real problem.

Which brings me to the final issue: provider-managed execution environments. There is simply no hope that this can fit into a local-first approach where you, as the endpoint consumer, get enough information from the provider to do a full replay. The VM or container state is opaque and you cannot export or reconstruct it from your own data. Under a local-first lens, your only real option is to manage execution environments yourself and make their state part of your own canonical state.

In short, local-first is still a useful lens, but only on your side of the wire. For closed SaaS LLMs you have to treat the provider as a black box that computes over your canonical log and your data, with a pile of opaque, lossy caches and execution state on their end. Anything that actually matters for correctness or recovery has to live with you.

All of this also circles back to the messages abstraction. I do not think messages are the problem here, and I doubt they are going away. The models are literally trained on a chat template, and the thing you actually keep in your local canonical state is a list of messages, tool calls, and file references. Under a local-first lens that list is your document. Everything else on the provider side, from prefix caches and thinking traces to session stores and VMs, is derived scratchpad state. Throwing out the messages abstraction in favour of some "purer" token or state API would not fix any of the hidden state issues with closed SaaS LLM providers. It would just move the complexity around. The only place local-first really applies is how you store and replay your own messages and data, not how the provider arranges its internal guts.

Bonus point: if you apply local-first on your end, you'll also have an easier time moving between providers.

In conclusion

I have too much disposable time after laying the boy to bed and I think Armin is wrong on the internet.

This page respects your privacy by not using cookies or similar technologies and by not collecting any personally identifiable information.