The infrastructure gap in agent systems

Demo agents vs production agents

A demo agent is easy. You chain a few LLM calls, write some tool definitions, pipe the output somewhere, and it works. Once, in isolation, with a forgiving user watching. The demo is impressive. Then someone puts it in front of real traffic.

Production is different. Production means: the job runs at 3am with nobody watching. The external API times out halfway through. The model returns something subtly malformed. A downstream step fails and the whole thing needs to retry, but only from the step that failed, not from the beginning. The run took 40 seconds and your user wants to know what happened.

Most agent stacks aren't built for any of that.

What breaks first

The first thing that breaks is state. Demo agents are stateless, each invocation is fresh. But real workflows need memory: what did this agent do last time? What did the retrieval step return? If you restart a failed run, where do you resume?

The second thing that breaks is durability. Lambda-style execution, fire and forget, hope it completes, falls apart the moment your workflow outlives a single request window or a single process. Temporal solved this for general workflows. The same model applies to agents: execution needs to be durable, checkpointed, and resumable.

The third thing that breaks is observability. When a workflow fails in production, you need a trace. Not a log, but a structured record of every node that ran, every input it received, every output it produced, and exactly where the chain broke.

What reliable infrastructure actually looks like

The inspiration stack is Temporal x n8n x Lambda, but designed from scratch for agents:

Durable execution: workflows checkpoint at every node. A crash mid-run resumes from the last checkpoint, not the start.
Typed node I/O: inputs and outputs are schemas, not raw strings. Failures are caught at the boundary, not buried in prompt output.
Isolated runtimes: each agent node runs in a sandboxed environment. One bad tool call can't corrupt the rest of the run.
Structured traces: every run produces a complete execution graph with timing, token counts, tool calls, and retries.
Human-in-the-loop gates: workflows can pause and wait for approval before continuing. Not a hack, a first-class node type.

None of this is glamorous. It's the same work that made cloud computing reliable. It just hasn't been done yet for agents.

That's what we're building.