Intuition engineering
for agentic pipelines.

Everything I needed to run an autonomous AI dev pipeline had already been invented — by a database team, by the saga pattern, by Netflix. The AI era didn't make those ideas obsolete. It made them matter more.

BYMarcus Sánchez

TOPICagentic AI · observability

STACKsaga-pipeline · DST · replay

STATUSPUBLISHED

TL;DR

A pipeline of LLM agents is non-deterministic, expensive in ways you only see on the bill, and opaque. None of those problems are new — distributed systems solved their shapes years ago. I borrowed four old ideas (deterministic simulation, the saga pattern, intuition engineering, real-time agent visibility) and combined them into a dashboard that replays the pipeline's real history as a living flow you can scrub through time. The video above is it, playing a real day.

I run a development pipeline that works while I sleep. Issues come in; an agent enriches them with acceptance criteria and edge cases; another agent picks one up, writes the code, opens a pull request; a panel of reviewer agents argues about security, performance, architecture, and test quality; a merger agent waits for green CI and lands it. Then it does the next one. Nobody is watching most of the time. That's the point.

It's wonderful when it works. The problem is the other times — and the other times are different, in kind, from any system I've operated before.

01.The invoice is not a debugger

When a normal service misbehaves, you have a century of habits to fall back on: read the log, attach a debugger, reproduce it locally, bisect the change. A pipeline of language models breaks those habits one by one.

It's non-deterministic: the same issue, run twice, takes two different paths. "It worked yesterday" is not a figure of speech anymore — it's a literal, maddening fact. It's expensive in a way you don't feel until later: two agents can fall into a polite, infinite argument — review asks for a change, dev makes it, review asks for it back — and the only place that livelock shows up is the token bill at the end of the month. And it's opaque: by the time you notice something is wrong, the moment is gone. There's no breakpoint you can set on "the vibe drifted at 3 a.m."

So I did what I usually do when a problem feels new: I assumed it wasn't. I went looking for the people who had already solved the shape of it, in some older world, and asked what they'd built.

02.Four old ideas, borrowed

I found four. None of them are mine, and none of them are new. That's exactly why I trust them.

Determinism you can replay — FoundationDB

In 2014, Will Wilson gave a talk about how FoundationDB tested a distributed database ("Testing Distributed Systems w/ Deterministic Simulation"). The trick is audacious: they didn't test the database in the real world, they built a deterministic simulation of an entire cluster — every disk, every network packet, every clock — driven by a single seeded random number generator. Because every source of randomness flows from that one seed, the whole run is reproducible. A failure isn't a ghost; it's a number you can type back in to replay the exact same catastrophe, frame for frame. They even add bugs on purpose (a macro called BUGGIFY) to provoke the dangerous orderings real systems eventually hit.

The joke inside FoundationDB was that they spent more effort on the simulator than on the database. The database was almost a side effect of having something worth testing.

Now hold that next to an agentic pipeline, which is far more non-deterministic than any database. If determinism and replay were worth that much effort for software that at least tries to be predictable, what are they worth for software whose entire job is to improvise? Replay isn't a nice-to-have here. It's oxygen. The ability to take "it did something weird last night" and turn it into a seed — a run I can play back and watch — is the difference between tuning a pipeline and superstitiously poking at it.

A backbone that can fail halfway — the saga pattern

A multi-agent task is a long-running, multi-step transaction that can fail at any step — and unlike a database transaction, you can't just "roll back" three agents and a merged commit. This is the exact problem the saga pattern was invented for decades ago, and it's why AWS now prescribes saga orchestration for agentic AI: a central orchestrator that decomposes the goal, delegates to specialized agents, and — crucially — "maintains context and execution flow," handling retries, timeouts, failures, and compensation when a later step makes an earlier one wrong.

My pipeline is a saga. Each agent is a step; the orchestrator coordinates them and knows how to make things right when one of them doesn't. Saga gives the pipeline a spine. But a spine you can't see is just an assumption — which brings me to the part I actually care about.

Intuition you can't put in a threshold — Netflix

In 2015, Netflix wrote about a tool called Flux (later open-sourced as Vizceral) and gave the discipline a name I've never forgotten: intuition engineering. Their insight was that some states of a large system can't be reduced to a number or an alert threshold — the interesting situations are precisely the ones "too onerous to create a heuristic" for, the ones that require "an intuition that can't be codified." So instead of another dashboard of gauges, they drew the whole system as living, moving traffic and let the operator's visual cortex — the part of us built to read a savanna at a glance — do what it does best. Requests flow as dots; healthy is one color, trouble is red; you feel the system tip before you could ever articulate why.

You don't read intuition off a chart. You earn it by watching the thing move, over and over, until "normal" has a shape and "wrong" announces itself.

An agentic pipeline is exactly the kind of system Netflix was describing — more so. There is no single number for "is my fleet of agents healthy right now." But there is a shape.

Seeing the agents at all — ClawLibrary

The last idea I learned by building, not reading. I was a contributor to ClawLibrary, a playful little tool that renders what your agents are doing as a 2D, almost game-like world. I added the multi-agent tracking — several agents on screen at once — and a small thing I'm still fond of: an HP-bar-style meter for each agent's context usage, so you could see, at a glance, who was running out of room to think. I also made it reachable across the local network, so I could pull out my phone on the couch and watch my agents work from across the house.

That last bit was supposed to be a convenience. It turned into the whole lesson. Once the agents were watchable — once "what is it doing right now?" had a picture — I started to understand them. I'd catch a stuck agent the way you catch a kettle about to boil, from the corner of your eye. Watching is not a luxury feature bolted onto monitoring. Watching is how the understanding gets in.

03.So I built a board

Put those four together and the thing to build is obvious. Not another metrics page — a replayable, living picture of the pipeline. So that's what the dashboard is. (I named it, with no shame at all, "IntuitionOps." It's playing at the top of this page.)

It replays the pipeline's real history — every pull request, every CI run, every token spent — as a flow you can scrub through time, minute by minute, the way you'd scrub a video. It can replay the deterministic simulation of the pipeline too, so "tested in sim" and "watched on the board" are the same picture. It treats token cost as a first-class signal, sitting right next to throughput — so that polite infinite argument between two agents lights up as what it is (cost climbing while merges stay flat) instead of hiding on an invoice. And because the history is recorded, last night's weirdness isn't gone. It's a moment I can scrub back to and watch unfold.

I'm deliberately not going to walk through how it's wired together — that's a different article, and honestly the how is the least interesting part. The point is the why: you can only fine-tune what you can see and replay. Replay turns a ghost into a reproducible run. Determinism turns a flaky failure into a failing seed. Visual intuition turns twelve thousand events into a glance. Together they turn an autonomous pipeline from something you pray over into something you actually operate.

04.Why this matters more now, not less

Here's the part I want to be careful about, because it would be easy to oversell. I'm not claiming this is the only way to run agents, and I'm certainly not claiming I invented anything. Every idea here is borrowed and old — deterministic simulation is from 2014, saga is older than that, intuition engineering is from 2015.

What I am claiming is narrower and, I think, more interesting: the AI era doesn't retire these ideas, it promotes them. The conditions that made deterministic replay and visual intuition valuable in distributed systems — non-determinism, long-running multi-step work, failures that hide, behavior too holistic to threshold — are all more true of agentic pipelines, not less. We took systems that merely tolerated some randomness and replaced their core with a component whose entire purpose is to improvise. Of course the old disciplines come back. They were waiting for exactly this.

This is how I build and manage pipelines: I assume the hard parts have been solved before, in some adjacent world, and my job is mostly to recognize the old shape inside the new problem and bring the right paradigm forward. A child of ideas already in existence — given, in the AI era, a reason to matter even more.

If you could build anything you like, what would you build?

★Sources & inspiration

This piece is a child of four older ideas. In order of appearance:

Deterministic simulation testing. Will Wilson, "Testing Distributed Systems w/ Deterministic Simulation" — FoundationDB (Strange Loop, 2014).
Saga orchestration for agentic AI. AWS Prescriptive Guidance — Saga orchestration patterns.
Intuition engineering. Netflix, "Flux: A New Approach to System Intuition" (later open-sourced as Vizceral).
Real-time agent visibility. ClawLibrary — I contributed the multi-agent tracking and the per-agent context-usage meter.