Shahzad Bhatti Welcome to my ramblings and rants!

June 19, 2026

Making Bad State Impossible: A Practical Guide to ADTs and Algebraic Effects

Filed under: Computing,Concurrency — admin @ 9:46 pm

I. Introduction

Debugging a production incidents is much harder when dealing with a system with complex state management. For example, you might see a worker node is simultaneously “draining” and “upgrading” while flagged as “ready to restart.” or the heartbeat buffer filled with 100,000 metrics and silently dropped the overflow. In other cases, you might see a config deployment shows “success” in the database but never actually deployed because the error got swallowed by .catch(NOOP) somewhere. I’ve seen it in most legacy codebase I’ve worked on, e.g., in one system I found:

  • 441 instances of .catch(NOOP) — errors silently swallowed
  • 506 mode checks scattered everywhere — if (isLeader)... else if (isWorker)...
  • 64 possible boolean combinations for worker state, of which only 5 are valid
  • Race conditions in shared state with no synchronization
  • 816 files coupled to global singletons

Here is the core thesis: most production incidents aren’t algorithmic bugs. They’re states that shouldn’t exist. The system entered a configuration nobody intended, no test covered, and no monitoring caught. Algebraic Data Types (ADTs) and Algebraic Effects are the tools that make those impossible states unrepresentable in code. Not “less likely.” Not “caught by tests.” Literally impossible to express.


II. What Are Algebraic Data Types?

Forget the word “algebraic” for a moment. It just means “composed of parts using AND and OR.” That’s it.

Product Types: AND

A product type is a structure where ALL fields must be present at the same time. You use these every day:

struct WorkerConnection {
    id: String,
    address: String,
    port: u16,
    last_heartbeat: Instant,
}

Every WorkerConnection has an id AND an address AND a port AND a last_heartbeat. It’s called “product” because the number of possible values is the product of each field’s possibilities.

Sum Types: OR

A sum type is a value that is ONE of several variants. This is the powerful one most codebases miss:

enum TrafficLight {
    Red,
    Yellow,
    Green,
}

A traffic light is Red OR Yellow OR Green. It is never Red AND Green at the same time. It’s called “sum” because the number of possible values is the sum of each variant. The critical feature is exhaustiveness checking. When you pattern-match on a sum type, the compiler forces you to handle every variant. Add a new one and the compiler shows you every place that needs updating:

fn action(light: &TrafficLight) -> &str {
    match light {
        TrafficLight::Red => "stop",
        TrafficLight::Yellow => "caution",
        TrafficLight::Green => "go",
        // Add FlashingRed and this won't compile until you handle it here
    }
}

Why This Matters: Making Illegal States Unrepresentable

Here’s the practical payoff. Look at actual legacy code managing worker nodes:

// Legacy pattern
struct WorkerNode {
    current_action: Option<ExclusiveAction>,
    reconfig_in_progress: Option<ClusterRequest>,
    upgrade_in_progress: bool,
    draining: bool,
    allow_restart: bool,
    restart_on_exit: bool,
}

Six independent boolean fields. That’s 2^6 = 64 possible combinations. But the system only has about 5 valid states: idle, configuring, upgrading, draining, or restarting. The other 59 combinations are bugs waiting to happen. What does upgrade_in_progress = true AND draining = true AND reconfig_in_progress = Some(request) mean? Nobody knows and no test covers it. Now the same thing as a Rust enum:

enum WorkerState {
    Idle,
    Configuring { request: ClusterRequest },
    Upgrading { version: String },
    Draining { reason: String },
    Restarting,
}

Five states. The 59 impossible combinations literally cannot be expressed. You cannot write code that puts the worker in an invalid state because the type won’t compile. This isn’t about “good practice.” It’s about making an entire class of bugs impossible at compile time. The compiler becomes your 24/7 code reviewer, rejecting every impossible state before the code ever runs.

It Costs Real Money

  • Double settlement in banking: A payment system tracks settlement with isAuthorized, isSettled, isReversed. A race condition sets both isSettled = true and isReversed = true at the same time. Result: the same transaction is both settled and reversed so money moves twice. With a sum type (Authorized | Settled | Reversed | Disputed), that combination cannot exist.
  • Ghost billing in telecom: A session tracker uses isActive, isBilled, isTerminated. A network glitch terminates the session but the billing flag was set a millisecond before termination. Result: terminated sessions generate charges for hours. With a sum type (Active { startTime } | Terminated { endTime } | Billed { amount, endTime }), a terminated session cannot be in a billable state.

These aren’t hypothetical. They’re the kind of bugs that cost millions in reconciliation and regulatory fines. The root cause is always the same: boolean flags that allow impossible combinations.

Immutability Makes This Even Better

When state is immutable, you can’t accidentally corrupt it from another part of the code. But how do you “change” immutable data? You copy it:

fn update_progress(state: &JobState, new_progress: u8) -> JobState {
    JobState {
        progress: new_progress,
        updated_at: Instant::now(),
        ..state.clone()  // copy everything else
    }
}

let state1 = JobState { phase: Phase::Running, progress: 50, worker_id: "w-1".into() };
let state2 = update_progress(&state1, 75);
// state1.progress is still 50 — no other code sees a half-updated state

In Rust, this is enforced by the ownership system: you can have either one mutable reference OR many immutable references. Race conditions on shared state become a compile error, not a runtime bug.


III. ADTs Applied to Real Problems

Problem 1: Mode Detection Hell

Production systems support multiple deployment modes: leader, worker, edge, standalone. The result in the legacy codebase? Mode checks everywhere:

// 500+ instances of this scattered throughout
const configHelperMode = ProcessInfo.isConfigHelperMode();
const workerProcessMode = ProcessInfo.isWorkerMode();
const apiProcessMode = !configHelperMode && !workerProcessMode;

if (configHelperMode) { return runConfigHelper(...); }
if (workerProcessMode) { return ProcessMgr.initWorkerProcess(...); }
if (ServiceInfo.isService(role)) { return Service.initServiceProcess(...); }
if (isProxyNode(distMode)) { /* ... */ }
if (isSearchSupervisor(distMode)) { /* ... */ }
if (isLeader) { /* ... */ }
else if (isManaged(distMode)) { /* ... */ }
else if (isStandalone(distMode)) { /* ... */ }

The problems: adding a new mode requires finding and updating all 506 sites, missing one means silent incorrect behavior, and it’s easy to create contradictory states (isLeader && isWorker). The fix: one decision point at startup, exhaustive matching everywhere else:

enum AppMode {
    Leader { config: LeaderConfig },
    Worker { leader_id: String },
    Edge { leader_id: String },
    Standalone,
    ConfigHelper { group_id: String },
    SearchSupervisor { cluster_id: String },
}

// ONE place where mode is determined — at startup
fn determine_mode(env: &Environment) -> AppMode { ... }

// EVERYWHERE else — exhaustive matching
fn bootstrap(mode: AppMode) -> Application {
    match mode {
        AppMode::Leader { config } => bootstrap_leader(config),
        AppMode::Worker { leader_id } => bootstrap_worker(&leader_id),
        AppMode::Edge { leader_id } => bootstrap_edge(&leader_id),
        AppMode::Standalone => bootstrap_standalone(),
        AppMode::ConfigHelper { group_id } => bootstrap_config_helper(&group_id),
        AppMode::SearchSupervisor { cluster_id } => bootstrap_search(&cluster_id),
    }
}

Add a new mode and the compiler immediately shows you every match that needs a new arm. Miss one? Compilation fails. This is what “compiler-guided refactoring” means in practice.

Problem 2: Operations That Partially Succeed

One of the most dangerous patterns I’ve seen: multi-step operations without atomic boundaries.

// Config updated BEFORE deployment succeeds
groupConf.configVersion = hash;   // Step 1: mutate config
await this.update(groupConf);      // Step 2: persist to database
await cm.deploy();                 // Step 3: actually deploy

// If step 3 fails: database says "deployed" but nothing deployed.
// State is permanently inconsistent. Nobody notices until 2am.

Another version of the same problem:

// Package manager — loop continues after failure
for (const op of ops) {
    try {
        switch (op.type) {
            case 'install': await this.install(op.pack); break;
            case 'uninstall': await this.uninstall(op.pack); break;
        }
    } catch(e) {
        errors.push(e);  // collect error but CONTINUE the loop
    }
}
await this.save();  // save regardless — partially applied state!

The typestate pattern uses types to enforce operation ordering. Each step produces a different type, and the next step only accepts the correct input type:

// Each phase is a distinct type — not an enum, separate structs
struct Planned { operations: Vec<Operation> }
struct Validated { operations: Vec<ValidOperation>, checks: Vec<CheckResult> }
struct Applied { results: Vec<OperationResult> }
struct Committed { hash: String, timestamp: Instant }

// Functions consume one type, return the next
fn validate(tx: Planned) -> Result<Validated, Vec<ValidationError>> { ... }
fn apply(tx: Validated) -> Result<Applied, ApplyError> { ... }
fn commit(tx: Applied) -> Result<Committed, CommitError> { ... }

// You cannot call commit() on a Planned transaction.
// The types won't allow it.
// And because validate() CONSUMES Planned, you can't reuse the old value.

If apply fails, you have a Validated, not an Applied. You can retry or abort cleanly. There’s no half-committed state because the type system won’t let you call commit without a successful apply.

Problem 3: Silently Swallowed Errors

441 instances of .catch(NOOP) in production. Each one is a failure that nobody notices until the system is in an inconsistent state:

this.reconcileLbIfStandalone(req.body).catch(NOOP);  // load balancer fails silently
unlink(bundlePath).catch(NOOP);                       // file deletion fails silently
dest.connect().catch(NOOP);                           // connection fails silently

The problem isn’t laziness. Promise/exception-based error handling makes it easy to ignore errors and hard to handle them consistently. Rust’s Result type inverts this: handling errors is the default path, and ignoring them requires explicit effort:

// Every operation returns Result — no hidden exceptions
async fn reconcile_lb(body: &Request) -> Result<LbState, ReconcileError> {
    let state = do_reconcile(body).await
        .map_err(|e| classify_error(e))?;  // ? propagates errors up — visible in the code
    Ok(state)
}

// Caller MUST handle the Result
let lb_state = reconcile_lb(&req.body).await?;
// If we reach this line, it succeeded. Guaranteed.

// Want to explicitly ignore? You have to WRITE that intention:
let _ = reconcile_lb(&req.body).await;  // "I know this can fail and I don't care"

The key insight: with Result, ignoring an error requires writing code to ignore it. With exceptions, ignoring an error requires writing nothing. Defaults matter enormously. The ? operator makes propagating errors as easy as typing one character, no try/catch boilerplate, no .catch(NOOP) temptation.

Problem 4: Swapped Arguments and Primitive Obsession

The legacy codebase uses raw strings and numbers for everything like IDs, tokens, keys. Nothing stops you from passing arguments in the wrong order:

// 4,000+ uses of untyped parameters
fn send_request_to_worker(wid: u64, req: &str, body: &[u8]) { ... }
// What stops you from passing (request_id, worker_id, wrong_body)? Nothing.

Rust newtypes create distinct types with zero runtime cost:

struct WorkerId(String);
struct RequestId(String);
struct AuthToken(String);

fn send_request(worker_id: &WorkerId, request_id: &RequestId, body: &RequestBody) { ... }

// Now the compiler catches this:
send_request(&request_id, &worker_id, &body);  // COMPILE ERROR
// expected `&WorkerId`, found `&RequestId`

And smart constructors validate at the boundary, so the type carries the guarantee everywhere:

impl WorkerId {
    pub fn new(raw: &str) -> Result<Self, ValidationError> {
        if !WORKER_ID_PATTERN.is_match(raw) {
            return Err(ValidationError::InvalidFormat("worker ID"));
        }
        Ok(WorkerId(raw.to_string()))
    }
}
// Once you have a WorkerId, you KNOW it's valid. No re-validation needed anywhere.

Problem 5: Every Process Carries Everything

The legacy system scaled by spawning full OS processes because there was no type-safe way to separate workloads:

// Every worker loads the FULL binary — all 150 connectors, all modes
// Even edge nodes carry leader code they'll never use
// Default: 2GB heap per worker
this.env.NODE_OPTIONS = `--max-old-space-size=${heapSizeMB || 2048}`;

// 4 workers × 2GB = 8GB minimum. Plus API process, services...
// Competitors: Fluent Bit (10-30MB), Vector (30-50MB)

With typed resource boundaries, each workload declares exactly what it needs:

enum WorkloadProfile {
    IoBound { connections: usize, buffer_size: Bytes },
    CpuBound { parallelism: usize, memory_budget: Bytes },
    Mixed { io_weight: f32, cpu_weight: f32 },
}

enum ResourceClaim {
    Lightweight { max_memory_mb: u32, max_cpu_cores: f32 },
    Standard { max_memory_mb: u32, max_cpu_cores: f32 },
    Heavy { max_memory_mb: u32, max_cpu_cores: f32 },
}

fn resources_for(pipeline: &PipelineConfig) -> ResourceClaim {
    match analyze_workload(pipeline) {
        WorkloadProfile::IoBound { .. } =>
            ResourceClaim::Lightweight { max_memory_mb: 64, max_cpu_cores: 0.5 },
        WorkloadProfile::CpuBound { .. } =>
            ResourceClaim::Heavy { max_memory_mb: 2048, max_cpu_cores: 4.0 },
        WorkloadProfile::Mixed { .. } =>
            ResourceClaim::Standard { max_memory_mb: 512, max_cpu_cores: 2.0 },
    }
}

Instead of “every process gets everything,” each workload gets exactly what it declares. Resource requirements are now visible, auditable, and enforced by the type system.

Problem 6: Inheritance Hierarchies Nobody Understands

The legacy codebase had class hierarchies 7 levels deep:

BaseServiceable                // 100+ subclasses, forces EventEmitter
  --> BaseInput
    --> TcpInput
      --> FramedProtocol         // Framing, auth, metrics, load balancing — all mixed
        --> ControlListener
          --> ProxyListener      // 760 lines of proxy logic inheriting ~4,500 lines it doesn't use

Reading ProxyListener meant understanding 6 parent classes first. And there were 12 cloud storage subclasses that were entirely empty and they inherited ~5K lines and added exactly zero:

export class ProviderAOut extends CloudStorageOutput {}  // empty
export class ProviderBOut extends CloudStorageOutput {}  // empty
export class ProviderCOut extends CloudStorageOutput {}  // empty

The fix: composition with enums instead of inheritance:

enum S3Provider {
    Aws { region: String },
    Storj { gateway: String },
    Backblaze { account_id: String },
    Wasabi { region: String },
    Minio { endpoint: String },
}

fn create_s3_client(provider: &S3Provider) -> S3Client {
    match provider {
        S3Provider::Aws { region } => S3Client::new().region(region),
        S3Provider::Storj { gateway } => S3Client::new().endpoint(gateway),
        S3Provider::Backblaze { account_id } =>
            S3Client::new().endpoint(&format!("s3.{account_id}.backblazeb2.com")),
        S3Provider::Wasabi { region } => S3Client::new().endpoint(&format!("s3.{region}.wasabisys.com")),
        S3Provider::Minio { endpoint } => S3Client::new().endpoint(endpoint),
    }
}

No inheritance. No empty subclasses. Adding a new provider means adding a variant to the enum and the compiler shows you every match that needs a new arm.


IV. ADTs Applied to Concurrency

Race Conditions in Shared Mutable State

Here’s actual production code where multiple async operations read and write the same map:

private conns: { [key: string]: Connection } = {};

// Called by the service loop (runs periodically)
private async _service() {
    const values = Object.values(this.conns);
    for (const conn of values) {
        if (conn.isStale()) {
            delete this.conns[conn.key];  // Mutate while potentially being read elsewhere
        }
    }
}

// Called when a new node connects (can happen any time)
private addConnection(connKey: string, data: INodeEntry): boolean {
    this.conns[connKey] = conn;  // Race with _service()!
    this.assignToGroup(conn)
        .catch(LOG_ERR(logger, 'failed to assign'));
    return true;
}

And the classic read-modify-write race:

prevState = await this.getState(key);       // Process A reads state
// ... Process B also reads state here ...
// ... Process A modifies and writes ...
await this.store.set(key, newState);         // Process B writes — A's changes LOST

The fix: a single owner of state, communicating through typed messages:

enum ConnectionCommand {
    Add { key: String, conn: Connection },
    Remove { key: String },
    RemoveStale,
    GetAll { reply: oneshot::Sender<Vec<Connection>> },
}

// Single owner — only this task can access `conns`
async fn connection_manager(mut inbox: mpsc::Receiver<ConnectionCommand>) {
    let mut conns: HashMap<String, Connection> = HashMap::new();

    while let Some(cmd) = inbox.recv().await {
        match cmd {
            ConnectionCommand::Add { key, conn } => { conns.insert(key, conn); }
            ConnectionCommand::Remove { key } => { conns.remove(&key); }
            ConnectionCommand::RemoveStale => { conns.retain(|_, conn| !conn.is_stale()); }
            ConnectionCommand::GetAll { reply } => {
                let _ = reply.send(conns.values().cloned().collect());
            }
        }
    }
}

No mutexes. No locks. No races. Rust’s ownership system guarantees conns is owned by exactly one task. Other tasks communicate through the channel, they physically cannot access the HashMap directly because they don’t own it.

Backpressure: Making Buffer Overflow Impossible to Ignore

The legacy heartbeat system silently dropped metrics when its buffer filled:

add(metric: MetricPacket, doNotDrop: boolean): void {
    if (this.hbMetrics.length > this.maxHbMetrics) {
        this.packetCounter.onDroppedMetric();  // Increment a counter nobody watches
        return;  // Data gone forever. No error. No signal to sender.
    }
    this.hbMetrics.push(metric);
}

The sender had no idea data was being lost. It kept sending happily while the system silently degraded. With Rust’s bounded channels, backpressure is built in. When the buffer is full, you must decide what to do:

match tx.try_send(metric) {
    Ok(()) => { /* sent */ }
    Err(TrySendError::Full(metric)) => {
        // Channel is full — you MUST decide:
        // Option 1: wait (applies backpressure to sender)
        tx.send(metric).await?;
        // Option 2: spill to disk
        // disk_buffer.write(metric)?;
        // Option 3: drop with explicit acknowledgment
        // warn!("Metric dropped due to backpressure");
    }
    Err(TrySendError::Closed(_)) => {
        error!("Metrics channel closed unexpectedly");
        return Err(ChannelError::Closed);
    }
}

The type system forces the conversation: “What should happen when the buffer is full?” You can’t accidentally drop data and you must write explicit code to ignore it.

Event Sourcing: Eliminating Lost Updates

Instead of mutable state that can be overwritten by concurrent operations, event sourcing treats state as a derived value from an append-only log:

enum JobEvent {
    Created { job_id: String, config: JobConfig, at: Instant },
    Started { worker_id: String, at: Instant },
    Progressed { percentage: u8, at: Instant },
    Completed { result: JobResult, at: Instant },
    Failed { error: ErrorInfo, retryable: bool, at: Instant },
}

// State is derived — never directly mutated
fn derive_state(events: &[JobEvent]) -> JobState {
    events.iter().fold(initial_state(events), apply_event)
}

fn apply_event(state: JobState, event: &JobEvent) -> JobState {
    match (state, event) {
        (JobState::Pending { .. }, JobEvent::Started { worker_id, .. }) =>
            JobState::Running { worker_id: worker_id.clone(), progress: 0 },
        (JobState::Running { worker_id, .. }, JobEvent::Progressed { percentage, .. }) =>
            JobState::Running { worker_id, progress: *percentage },
        (state, _) => state,  // Invalid transition — state unchanged
    }
}

No lost updates because events are appended, never overwritten. Invalid transitions are no-ops and the reduce function simply ignores events that don’t make sense for the current state.

Message Ordering: Protocol State Machines

The legacy system sent commands from leader to worker with no ordering guarantees:

// Leader sends: 1. configure, 2. upgrade
// Worker may RECEIVE: 1. upgrade, 2. configure (reversed!)
// Result: config applied AFTER upgrade — potential data corruption

// Current "fix": reject conflicting operations
private failOnConflictingOperation() {
    if (this.currentAction) {
        throw new ConflictingActionError();  // Command REJECTED, not queued!
    }
}
// No command queue. No ordering. No acknowledgment.
// Leader has NO WAY to know if the worker processed the command.

A typed protocol state machine makes invalid command sequences unrepresentable:

enum NodePhase { Idle, Configured, Upgrading, Draining }

fn apply_command(state: ProtocolState, cmd: &Command) -> Result<ProtocolState, ProtocolError> {
    let seq = cmd.seq();
    if seq != state.last_applied_seq + 1 {
        return Err(ProtocolError::OutOfOrder { expected: state.last_applied_seq + 1, got: seq });
    }
    match (&state.phase, cmd) {
        (NodePhase::Idle | NodePhase::Configured, Command::Configure { .. }) =>
            Ok(ProtocolState { phase: NodePhase::Configured, ..state }),
        (NodePhase::Configured, Command::Upgrade { .. }) =>
            Ok(ProtocolState { phase: NodePhase::Upgrading, ..state }),
        (NodePhase::Idle | NodePhase::Configured, Command::Drain { .. }) =>
            Ok(ProtocolState { phase: NodePhase::Draining, ..state }),
        (phase, cmd) =>
            Err(ProtocolError::InvalidTransition { from: phase.clone(), command: cmd.name() }),
    }
}

The system cannot apply an upgrade before configuration because the match on (current_phase, command) rejects it. The exhaustive match means there’s no way to accidentally leave a case unhandled.

RAII: Locks That Can’t Leak

The legacy system used file-based locks with no timeouts or heartbeats:

// If the process crashes while holding this lock, it's stuck forever
static async acquireConfigUpdateLock(dir: string): Promise<void> {
    if (!(await acquireLock(dir, CONFIG_UPDATE_LOCK_NAME))) {
        throw new AppError('Failed to acquire config update lock.');
    }
    // No timeout. No heartbeat. Crash = lock held forever.
}

In Rust, RAII (Resource Acquisition Is Initialization) makes forgotten locks a compile-time impossibility:

struct ConfigLock {
    path: PathBuf,
    acquired_at: Instant,
    ttl: Duration,
}

impl Drop for ConfigLock {
    fn drop(&mut self) {
        // Automatically called when ConfigLock goes out of scope — even on panic!
        let _ = std::fs::remove_file(&self.path);
    }
}

async fn with_config_lock<T, F>(resource: &str, ttl: Duration, f: F) -> Result<T, LockError>
where F: FnOnce(&ConfigLock) -> Result<T, LockError>
{
    let lock = acquire_lock(resource, ttl).await?;
    f(&lock)
    // lock dropped here automatically — file released no matter what
}

let result = with_config_lock("config-update", Duration::from_secs(30), |_lock| {
    extract_bundle(&dir)?;
    save_system(&dir)?;
    Ok("deployed")
}).await?;
// Lock released here — even if any step panicked

The lock cannot leak because Drop::drop() runs when the guard goes out of scope and it’s a compiler guarantee.

Serialization: Schema Evolution as an ADT

The legacy heartbeat system used JSON serialization for 100,000+ metrics per heartbeat:

// JSON.parse for 100K metrics: ~500ms–1s
// With a 10s heartbeat interval, serialization alone eats 5–10% of your cycle time
// And there's no versioning — if the schema changes, old and new nodes break silently

With Rust enums, the protocol schema is defined once and versioning is a first-class concern:

enum HeartbeatMessage {
    V1 { metrics: Vec<MetricV1> },
    V2 { metrics: Vec<MetricV2>, deltas: Vec<DeltaMetric> },  // added delta support
}

// Schema evolution is an enum — every version must be explicitly handled
fn parse_heartbeat(data: &[u8]) -> Result<HeartbeatMessage, ParseError> {
    let version = data[0];
    match version {
        1 => parse_v1(&data[1..]),
        2 => parse_v2(&data[1..]),
        _ => Err(ParseError::UnknownVersion(version)),
        // Add v3? The compiler shows you every match that needs updating.
    }
}

With protobuf or flatbuffers: zero-copy deserialization runs 10–100x faster than JSON. And schema evolution is no longer an afterthought and the enum ensures every protocol version is explicitly handled.


V. What Are Algebraic Effects?

ADTs solve the problem of representing valid states. Algebraic Effects solve a different but related problem: how to separate what code needs from how those needs are fulfilled without forcing that separation to infect every caller in the chain.

The Intuition: Exceptions That Can Resume

You already understand exceptions, e.g., when you throw, execution stops and the stack unwinds:

function getName() {
    throw new Error("need a name");  // Execution stops. Stack unwinds. Gone.
}

try {
    getName();
} catch (e) {
    // We're here, but getName() is DEAD. We can't go back.
}

Now imagine if, instead of killing getName(), the handler could answer the question and let it continue:

function getName() {
    const name = perform AskUser("What's your name?");  // Pause, don't die
    return `Hello, ${name}`;  // Continues after handler responds!
}

handle(getName(), {
    AskUser: (question, resume) => {
        const answer = prompt(question);
        resume(answer);  // Jump BACK into getName() with the answer
    }
});

That’s algebraic effects in one sentence: exceptions that can resume. The code that performs an effect doesn’t die instead it pauses, gets an answer, and continues where it left off. You can think of it this way: regular exceptions are like quitting your job when you have a question. Effects are like asking your manager, you pause, they answer, you continue.

The Function Coloring Problem

Here’s why effects matter for real systems. Once a function is async, everything that calls it must also be async:

async function getConfig(): Promise<Config> { ... }
async function processEvent(e: Event): Promise<void> {  // must be async because getConfig is
    const config = await getConfig();
    // ...
}
async function handleRequest(req: Request): Promise<Response> {  // must be async because processEvent is
    await processEvent(req.body);
    // ...
}

One async function forces asyncness through the entire call stack. This is generally called “function coloring”, async and sync functions are different “colors” and they can’t mix freely. The same problem applies to error handling (once you use Result, every caller must handle it), to dependencies (once you need config, every caller must thread it through), and to logging (once you need a logger, every intermediate function must pass it along). Effects solve this by separating what a function needs from who provides it. Intermediate functions stay uncolored:

// With effects (conceptual syntax):
function getConfig(): Effect<ConfigService, Config> {
    return perform GetConfig;
}

function processEvent(e: Event): Effect<ConfigService, void> {
    const config = getConfig();  // NOT async! Just performs an effect.
    transform(e, config);
}

// Only the TOP-LEVEL handler knows how config is provided:
handle(processEvent(event), {
    GetConfig: (resume) => {
        const config = loadFromDisk();  // or from env, or hardcoded for tests
        resume(config);
    }
});

processEvent doesn’t know or care whether config comes from disk, network, or a test fixture. The handler at the boundary decides. Intermediate functions don’t need to thread the dependency through.

You Already Use Effects

If you use React, you’re already working with algebraic effects in disguise. React Hooks are effects:

function Counter() {
    const [count, setCount] = useState(0);  // "perform GetState" — component doesn't manage storage
    useEffect(() => { ... });               // "perform ScheduleSideEffect"
    const data = use(fetchData());          // "perform Suspend"
    return <div>{count}</div>;
}

useState doesn’t tell the component where state lives. It performs an effect (“I need state”), and the React runtime acts as a handler and then provides it. The component doesn’t know if state is in memory, in a reducer, or synced to a server. React Suspense is literally “throw, then resume”:

// Simplified React Suspense:
function fetchData() {
    if (!cache.has(key)) {
        throw promise;  // "perform Suspend" — throws a Promise UP the tree
        // React catches it, shows fallback, waits for promise to resolve,
        // then RE-RENDERS the component — effectively "resuming" it with data
    }
    return cache.get(key);
}

This is exactly the algebraic effects pattern: code performs an effect (throws a Promise), a handler catches it (the Suspense boundary), and the code is resumed (re-rendered) with the result. React couldn’t add real algebraic effects to JavaScript, so they simulated them with throw/re-render.

Everything Is the Same Control Flow Mechanism

Look at these seemingly different language features:

Feature“Perform”“Handle”“Resume”
Exceptionsthrow errortry/catch? (can’t resume)
Async/Awaitawait promiseRuntime schedulerResolves with value
Generatorsyield valuefor..of consumer.next(value)
React HooksuseState()React runtimeRe-render with state
DI Container@InjectContainer configConstructor call
Algebraic Effectsperform effecthandle blockresume(value)

They’re all the same pattern: (1) code declares “I need something,” (2) something up the call stack provides it, (3) execution continues with the provided value. Algebraic effects are just the general version that unifies all the others. The historical arc of control flow in programming languages tells the same story:

goto --> structured control (if/while) --> exceptions --> continuations --> algebraic effects

Each step gives more structured, more composable control over program flow.

The Monad Infection Problem

If you’ve used functional languages, you know what happens once you use Result, Option, Future, or IO as every function in the chain must return that type:

fn get_config() -> Result<Config, Error> { ... }
fn parse_event(config: &Config) -> Result<Event, Error> { ... }
fn validate(event: &Event) -> Result<ValidEvent, Error> { ... }

fn process() -> Result<Output, Error> {
    let config = get_config()?;
    let event = parse_event(&config)?;
    let valid = validate(&event)?;
    Ok(transform(valid))
}

Once one function returns Result<T, E>, everything up the chain must acknowledge it. This is the same coloring problem as async just with error types. Effects solve this: the function just performs the effect, and a single handler at the top decides what to do. Intermediate functions stay clean.

For example, Jane Street’s hardware simulation team switched from monads to OCaml 5’s algebraic effects for exactly this reason. Their testbench code had to synchronize threads stepping through clock cycles. With monads, every function needed special let%bind syntax and couldn’t use normal OCaml features. With effects:

(* Business logic is PLAIN OCaml — no special syntax *)
let run_testbench () =
    let clk = read_signal clock in
    step ();                           (* "perform Step" — suspend until next clock cycle *)
    let data = read_signal data_bus in
    assert (data = expected);
    step ();                           (* Step again — handler resumes us at next cycle *)
    write_signal reset 1

(* Handler provides the simulation scheduler *)
let simulate circuit testbench =
    match_with testbench () {
        effc = (fun (type a) (eff : a Effect.t) ->
            match eff with
            | Step -> Some (fun (k : (a, _) continuation) ->
                advance_circuit circuit;   (* Tick the simulated hardware *)
                continue k ()             (* Resume testbench at next line *)
              )
        )
    }

The testbench reads like sequential code without monadic boilerplate. The step() call suspends execution, the handler advances the simulated hardware clock, and execution resumes.

Effects in Languages You Use Today

You don’t need OCaml 5 or Koka. Effects can be approximated in any language. In TypeScript using generator functions:

function* processEvent(event: RawEvent) {
    const config = yield { effect: 'getConfig' };           // "perform GetConfig"
    const enabled = yield { effect: 'checkFlag', flag: 'v2' }; // "perform CheckFlag"
    yield { effect: 'log', msg: 'processing' };             // "perform Log"
    return transform(event, config);
}

// Handler interprets the effects
function runWithHandler(gen, handlers) {
    let result = gen.next();
    while (!result.done) {
        const effect = result.value;
        const value = handlers[effect.effect](effect);  // "resume with value"
        result = gen.next(value);
    }
    return result.value;
}

// Production vs test — trivially swapped
const prodResult = runWithHandler(processEvent(event), productionHandlers);
const testResult = runWithHandler(processEvent(event), testHandlers);

In Python using context variables:

from contextvars import ContextVar

config_effect: ContextVar[Config] = ContextVar('config')
metrics_effect: ContextVar[MetricsCollector] = ContextVar('metrics')

def process_event(event):
    config = config_effect.get()      # "perform GetConfig"
    metrics = metrics_effect.get()    # "perform GetMetrics"
    return transform(event, config)

# Handler provides implementations at the boundary
config_effect.set(production_config)
metrics_effect.set(prometheus_collector)
result = process_event(event)

VI. Algebraic Effects Applied to Real Problems

Problem 1: Dependency Injection Without a Framework

The legacy codebase had 816 files coupled to global singletons:

// Configuration.instance() called in 858 files
// ProcessInfo singleton accessed in 320 files
// GlobalMetrics singleton in 200+ files
// FeatureFlags singleton in 186 files

class WorkerConnection {
    async configure() {
        const config = Configuration.instance();     // Hidden dependency
        const metrics = GlobalMetrics.instance();    // Hidden dependency
        const flags = FeatureFlags.instance();       // Hidden dependency
        const env = process.env.DEPLOYMENT_MODE;     // Hidden dependency (488 files!)
    }
}

You can’t test this without the real singleton. You can’t run different configurations in the same process. And the dependencies are invisible because you discover them at runtime via crashes. Effects-style DI (approximated with the Reader pattern in TypeScript):

type AppDeps = {
    config: IConfigProvider;
    metrics: IMetricsCollector;
    flags: IFeatureFlags;
    clock: IClock;
};

// Business logic is a pure function of its dependencies
function configurePipeline(deps: AppDeps) {
    return (pipeline: PipelineConfig): Result<ConfiguredPipeline, ConfigError> => {
        const features = deps.flags.getEnabled(pipeline.namespace);
        const stages = pipeline.stages
            .filter(s => features.includes(s.requiredFeature))
            .map(s => buildStage(s, deps.config));
        return { ok: true, value: { stages, configuredAt: deps.clock.now() } };
    };
}

// Production wiring — one place, at startup
const production = configurePipeline({
    config: new FileConfigProvider('/etc/app/config.yaml'),
    metrics: new PrometheusCollector(),
    flags: new LaunchDarklyFlags(apiKey),
    clock: SystemClock,
});

// Tests — zero mocking frameworks needed
const test = configurePipeline({
    config: { get: (key) => testDefaults[key] },
    metrics: new NoOpCollector(),
    flags: { getEnabled: () => ['all-features'] },
    clock: { now: () => new Date('2024-01-01') },
});

In languages with native effect support (OCaml 5, Koka, Eff), this becomes even cleaner as intermediate functions don’t need to accept or pass deps at all. They just perform GetConfig and the handler provides the value.

Problem 2: Multiple Metrics Implementations

The legacy system had multiple parallel metrics implementations built by different teams, each with stringly-typed dimensions:

// different ways to record metrics, scattered across 17+ files
IMetricsStore
GlobalMetrics
IoMetricsMgr
DataInsightsMetricsMgr
LocalSearchMetricsReporter

// Plus per-class ad-hoc metrics: PeriodicStats, ConnectionMetrics, PacketReducer...

// Stringly-typed dimensions — typos produce SILENT missing metrics:
metrics.record(['id', prefixId, 'route', routeId]);  // Swap any string? Silent wrong data.

With a single metrics effect:

type MetricEffect =
    | { kind: 'counter', name: MetricName, value: number, tags: MetricTags }
    | { kind: 'gauge', name: MetricName, value: number, tags: MetricTags }
    | { kind: 'histogram', name: MetricName, value: number, tags: MetricTags };

// Branded types prevent typos
type MetricName = string & { __brand: 'MetricName' };
type MetricTags = Record<TagKey, TagValue>;  // Also branded

// Business logic performs the effect — doesn't know WHERE metrics go
function processRoute(event: Event, route: Route): ProcessedEvent {
    perform { kind: 'counter', name: MetricName('events.processed'), value: 1, tags: { route: route.id } };
    const result = transform(event, route);
    perform { kind: 'histogram', name: MetricName('events.latency_ms'), value: elapsed(), tags: { route: route.id } };
    return result;
}

// Handler decides: Prometheus? StatsD? Both? Test collector? All swappable.

Five implementations and seventeen files collapse into one typed effect that the compiler validates.

Problem 3: Auth Tokens Anyone Can Forge

The legacy system used a single shared HS256 symmetric token for ALL workers:

// All workers share the same symmetric auth secret
// HS256 symmetric means: every worker can FORGE admin tokens!
// No per-node identity. No revocation without rotating for ALL.

const isValid = authToken === this.masterAuthToken;  // Raw secret comparison

With branded types, per-worker tokens become type-enforced:

type WorkerToken = string & { __brand: 'WorkerToken', workerId: WorkerId, scope: TokenScope };
type LeaderToken = string & { __brand: 'LeaderToken' };

type TokenScope =
    | { kind: 'control_plane', permissions: ControlPermission[] }
    | { kind: 'data_plane', routes: RouteId[] }
    | { kind: 'metrics_only' };

// Functions declare what token scope they require
function deployConfig(token: WorkerToken & { scope: { kind: 'control_plane' } }): Result<...> {
    // Can ONLY be called with a control-plane scoped token
    // Data-plane tokens won't typecheck here
}

Now a compromised worker can’t forge admin tokens. The type system enforces token scope at compile time.

Problem 4: Control Flow Disguised as Errors

The legacy codebase used exceptions for control flow:

try {
    for (const event of events) {
        processEvent(event);
    }
} catch (e) {
    if (e instanceof SkipEventError) continue;    // Control flow disguised as error!
    if (e instanceof AppError) logger.warn(e);
    if (e instanceof PipelineError) { ... }
    // Unknown errors fall through and are silently swallowed
}

There were multiple error hierarchies (AppError, RESTError, RpcError, PipelineError) with no unified classification. With effects, control flow signals and failures are distinct and handled separately:

type ControlEffect =
    | { kind: 'skip', reason: string }
    | { kind: 'retry', after: Duration }
    | { kind: 'terminate', gracefully: boolean };

type FailureEffect =
    | { kind: 'transient', error: Error, retryable: true }
    | { kind: 'permanent', error: Error, retryable: false }
    | { kind: 'validation', field: string, message: string };

// Business logic declares intent — doesn't decide policy
function processEvent(event: RawEvent): Effect<ControlEffect | FailureEffect, ProcessedEvent> {
    if (!isRelevant(event)) {
        return perform { kind: 'skip', reason: 'irrelevant event type' };
    }
    const validated = validate(event);
    if (!validated.ok) {
        return perform { kind: 'validation', field: validated.field, message: validated.message };
    }
    return transform(validated.value);
}

// Handler decides policy — completely separate from business logic
const withPolicy = handle(processEvent(event), {
    skip: (effect, resume) => { metrics.increment('skipped'); resume(null); },
    transient: (effect, resume) => { queue.requeue(event); resume(null); },
    permanent: (effect, resume) => { deadLetter.send(event, effect.error); resume(null); },
    validation: (effect, resume) => { logger.warn('Validation failed', effect); resume(null); },
});

The business logic says “this event should be skipped” or “this operation failed transiently.” It doesn’t decide whether to retry, log, or dead-letter. That’s the handler’s job and handlers can be swapped independently.

Problem 5: No Circuit Breakers

The legacy system had no circuit breakers. When a downstream service failed, requests piled up until the process crashed:

dest.connect().catch(NOOP);  // If it fails, try again next time. Or don't. Who knows.

// Retry with infinite loop and no idempotency:
while (true) {
    try {
        await writeToFile(...);
        callback();
        break;
    } catch {
        await delay(1000);  // Retry forever. No backoff. No limit. No idempotency check.
    }
}

With effects, retry and circuit-breaking become composable middleware:

type RetryPolicy =
    | { kind: 'none' }
    | { kind: 'fixed', attempts: number, delay: Duration }
    | { kind: 'exponential', maxAttempts: number, baseDelay: Duration, maxDelay: Duration }
    | { kind: 'circuitBreaker', failureThreshold: number, resetAfter: Duration };

// Circuit breaker itself is a state machine — an ADT!
type CircuitState =
    | { kind: 'closed', failureCount: number }
    | { kind: 'open', openedAt: Date, failureCount: number }
    | { kind: 'halfOpen', testRequest: Promise<unknown> };

function circuitTransition(state: CircuitState, event: CircuitEvent): CircuitState {
    switch (state.kind) {
        case 'closed':
            if (event.kind === 'failure') {
                const newCount = state.failureCount + 1;
                if (newCount >= threshold) return { kind: 'open', openedAt: new Date(), failureCount: newCount };
                return { ...state, failureCount: newCount };
            }
            return { kind: 'closed', failureCount: 0 };
        case 'open':
            if (elapsed(state.openedAt) > resetTimeout) return { kind: 'halfOpen', testRequest: null };
            return state;
        case 'halfOpen':
            if (event.kind === 'success') return { kind: 'closed', failureCount: 0 };
            return { kind: 'open', openedAt: new Date(), failureCount: state.failureCount };
    }
}

Notice: the circuit breaker itself is modeled as an ADT with exhaustive state transitions. ADTs model the state. Effects separate the retry policy from the code that needs retrying. Together they create systems that are both correct and composable.


VII. Design Thinking: Transformations Over Entities

Here’s an insight that ties everything together: design the transformations first, then the things being transformed. A system’s architecture is defined by how data flows, not by what objects exist.

The God Class Problem: Architecture You Can’t See

// A pipeline manager — 1,300+ lines, 80+ methods
class PipelineManager {
    process(event: any) {
        if (this.shouldFilter(event)) return;     // filtering concern
        this.metrics.increment('processed');       // observability concern
        const result = this.transform(event);     // transformation concern
        this.route(result);                       // routing concern
        this.metrics.recordLatency(start);        // observability again
    }
}

The architecture is invisible. Everything is tangled. You can’t test transformation without routing. You can’t add observability without modifying the pipeline. When you model the same thing as typed functions, the architecture becomes visible:

// Each stage is a typed function with a clear input/output contract
fn parse(raw: RawEvent) -> Result<ParsedEvent, ParseError> { ... }
fn validate(parsed: ParsedEvent) -> Result<ValidEvent, ValidationError> { ... }
fn enrich(valid: ValidEvent) -> Result<EnrichedEvent, EnrichError> { ... }
fn route(enriched: &EnrichedEvent) -> RoutingDecision { ... }

// Composition IS the architecture — visible, testable, reorderable
fn process_event(raw: RawEvent) -> Result<EnrichedEvent, PipelineError> {
    let parsed = parse(raw)?;
    let valid = validate(parsed)?;
    let enriched = enrich(valid)?;
    Ok(enriched)
}

// Cross-cutting concerns are separate composable wrappers
let pipeline = WithMetrics::new("pipeline", process_event);
let pipeline = WithFilter::new(filter_config, pipeline);
let pipeline = WithRouting::new(route_table, pipeline);

Each stage is independently testable. Adding observability doesn’t touch business logic. Reordering is just reordering function composition. The types document the flow: RawEvent --> ParsedEvent --> ValidEvent --> EnrichedEvent. This is what “the arrows are the architecture” means the transformations between types are the system’s behavior.

Rust’s ? Is Railway-Oriented Programming Built In

Think of data processing as a railway with two tracks: success and failure. Data flows along the success track until something goes wrong then it switches to the failure track and skips all remaining stages:

// Each ? is a branch point onto the failure track
fn process_event(raw: RawEvent) -> Result<ClassifiedEvent, PipelineError> {
    let parsed = parse(raw)?;         // fails? switch to error track
    let valid = validate(parsed)?;    // fails? switch to error track
    let enriched = enrich(valid)?;    // fails? switch to error track
    let classified = classify(enriched)?;
    Ok(classified)
}

// Each piece tested in isolation:
#[test]
fn parse_handles_malformed_json() {
    let result = parse(RawEvent::new("not json"));
    assert!(matches!(result, Err(PipelineError::MalformedInput { .. })));
}

Rust’s ? operator is this pattern built into the language syntax. No special library, no monadic boilerplate and the language itself is railway-oriented.

Thinking in Transformations

Not all transformations are the same. Knowing which kind you’re building helps you choose the right pattern:

  • One-to-one (parsing, validation): every input produces exactly one output. These compose directly: parse >> validate >> enrich.
  • One-to-many (fan-out, splitting): one input produces multiple outputs. Use flatMap or stream splitting, one log line becomes multiple metrics events.
  • Many-to-one (aggregation): multiple inputs combine into one. Use windowed reduce, 1000 metric samples become a single P99 value.
  • Reversible (encoding, encryption): can be undone without loss. Good for serialization boundaries where you need to cross system edges.
  • Self-directed (state transitions): transforms a value into another of the same type. State machines are exactly this, e.g., State --> State. An ADT enum is the natural representation.

The legacy PipelineManager muddled all five together in one class. Separating them makes each stage’s contract explicit and independently testable.

Measuring Coupling Through Connections

Here’s a concrete way to see how much a legacy architecture costs. Count the connections:

Point-to-point (legacy): N services = N × (N-1) / 2 connections
  10 services  =    45 connections
  20 services  =   190 connections
  50 services  = 1,225 connections  ? quadratic growth

Data-oriented: N services = N connections (each talks to a shared typed data layer)
  10 services  =  10 connections
  20 services  =  20 connections
  50 services  =  50 connections   ? linear growth

The legacy system’s 125+ endpoints each know about each other implicitly through shared singletons, events, and direct calls. Adding endpoint #126 means understanding what it might break in endpoints #1–125.

With a data-oriented approach, each component only needs to understand the shared data schema instead of every other component. The tradeoff: schema design becomes your hardest decision. Data outlives code. You can rewrite a service in a weekend, but migrating a billion records takes months. Get the ADTs right before committing.

Stratified Design: Layers by Rate of Change

Within the functional core, code should be layered by how often it changes:

Layer 4 (changes weekly):    Business rules, feature flags, pricing logic
Layer 3 (changes monthly):   Domain logic, validation, workflow orchestration
Layer 2 (changes quarterly): Framework utilities, pipeline combinators, retry policies
Layer 1 (changes yearly):    Language extensions, data structures, core types

Each layer only calls downward. A change in Layer 4 (a new pricing rule) cannot break Layer 1 (your Result type). This eliminates cascading failures.

// Layer 1: Stable foundation (built into the language)
// Result<T, E>, Option<T>, Traits: From, Into, TryFrom

// Layer 2: Domain-specific combinators
async fn with_retry<T>(policy: &RetryPolicy, f: impl Fn() -> Fut<T>) -> Result<T, Error>;
async fn with_circuit_breaker<T>(state: &CircuitState, f: impl Fn() -> Fut<T>) -> Result<T, Error>;

// Layer 3: Business domain
fn validate_pipeline(config: &PipelineConfig) -> Result<ValidPipeline, Vec<ValidationError>>;
fn route_event(event: &ValidEvent, table: &RouteTable) -> RoutingDecision;

// Layer 4: Configuration and policies (changes frequently)
let route_table: RouteTable = load_config("routes.yaml")?;
let retry_policy = RetryPolicy::Exponential { max_attempts: 3, base_delay_ms: 100 };

Replace Imperative Loops with Pipelines

The legacy codebase had hundreds of imperative accumulation loops:

// Legacy: imperative accumulation (hundreds of instances)
const results = [];
for (const worker of workers) {
    if (worker.isActive()) {
        const metrics = await worker.getMetrics();
        if (metrics.cpuUsage > threshold) {
            results.push({ workerId: worker.id, cpu: metrics.cpuUsage });
        }
    }
}

Iterator combinators express the same thing as a pipeline with each step is independently readable and testable:

// Declare WHAT, not HOW
let results: Vec<_> = workers.iter()
    .filter(|w| w.is_active())
    .filter_map(|w| {
        let metrics = w.get_metrics();
        (metrics.cpu_usage > threshold).then(|| OverloadedWorker {
            worker_id: w.id.clone(),
            cpu: metrics.cpu_usage,
        })
    })
    .collect();

You can add or remove a stage without restructuring any loop. Each step in the chain has a clear type. And for a 1,200-line initialization sequence, the same idea applies:

// Instead of 1,200 lines of sequential initialization with implicit ordering:
let server = ServerBuilder::new(env)
    .with_logging()?
    .with_metrics()?
    .with_storage()?
    .load_pipelines()?
    .with_health_check()?
    .bind_endpoints()?
    .build();
// Each method returns the next builder phase.
// Ordering is explicit in the chain — not hidden at line 847.
// ? propagates errors cleanly — no nested try/catch.

Reactive Patterns: Derived State That Can’t Go Stale

The legacy codebase had derived values that went stale because updates were manually tracked:

class Dashboard {
    private totalEvents = 0;      // must remember to update
    private avgLatency = 0;       // must remember to update
    private activeWorkers = 0;    // must remember to update

    onMetric(metric) {
        this.totalEvents++;
        // avgLatency updated... somewhere else. Maybe. If someone remembers.
    }
}

The reactive pattern (the same idea behind React, Redux, and spreadsheets) makes derived values automatic:

// Source cells (the inputs you can change)
const events = createCell<EventLog>([]);
const workers = createCell<Worker[]>([]);

// Derived formulas (automatically recompute when inputs change)
const totalEvents = formula(() => events.get().length);
const activeWorkers = formula(() => workers.get().filter(w => w.isActive()).length);
const avgLatency = formula(() => {
    const recent = events.get().slice(-1000);
    return recent.reduce((sum, e) => sum + e.latency, 0) / recent.length;
});

// Can NEVER be stale — recomputes automatically when inputs change
// "Forgot to update" bugs are impossible

This is ValueCell (a mutable input) and FormulaCell (a derived computation) are the two primitives behind every reactive system from spreadsheets to React.


VIII. The Bigger Framework: Actions, Calculations, Data

Everything covered so far fits into a simple three-way classification from Eric Normand’s book Grokking Simplicity:

Data: Inert facts. Immutable. Serializable. Safe to copy, share, store, send.

type WorkerState = { kind: 'idle' } | { kind: 'configuring', request: ClusterRequest };
type JobEvent = { kind: 'started', workerId: string, at: Date };

Calculations: Pure functions. Same input always produces the same output. No side effects. Safe to call anywhere, anytime, as many times as you want.

function deriveState(events: JobEvent[]): JobState { ... }
function validate(event: RawEvent): Result<ValidEvent, ValidationError> { ... }

Actions: Depend on when or how often they run. I/O. Time. Network. The dangerous stuff.

async function saveToDatabase(state: JobState): Promise<void> { ... }
async function sendMetrics(metrics: Metric[]): Promise<void> { ... }

The legacy system had roughly 80% Actions, 15% Mixed (calculations that accidentally touched singletons or Date.now()), and 5% pure Calculations. The target is the Functional Core, Imperative Shell pattern:

The core is pure: no I/O, no time, no randomness. It takes Data in and produces Data out. It’s trivially testable, trivially parallelizable (no shared state), and trivially composable. The shell is thin, it translates between the real world and the pure core. Every antipattern in the legacy codebase came from violating this boundary: singletons injecting Actions into Calculations, mutable state making “pure” functions depend on timing, mixed I/O making business logic untestable without the full system running.

Consistent API Responses as Typed Envelopes

The legacy system had 125+ endpoints with inconsistent response formats:

GET /system/inputs  ? { items: IInput[] }
GET /system/outputs ? IOutput[]                    // No wrapper!
GET /jobs           ? PaginatedListResults<IJob>   // Different wrapper!

// Error formats inconsistent too:
throw new RESTError(JSON.stringify(data), code);   // JSON string as message!
throw new RESTError('Not found', 404);
throw new RESTError('Not found', 400);             // Wrong status code!

A typed response envelope makes inconsistency a compile error:

type ApiResponse<T> =
    | { ok: true, data: T, meta?: PaginationMeta }
    | { ok: false, error: ApiError };

type ApiError = {
    code: ErrorCode;       // Typed enum, not arbitrary string
    message: string;
    details?: FieldError[];
    traceId: TraceId;      // Branded — always present for debugging
};

// Both return the same shape. Always. Compiler enforces it.
function listInputs(req: Request): ApiResponse<Input[]> { ... }
function listOutputs(req: Request): ApiResponse<Output[]> { ... }

IX. Let Compiler Work for You

The compiler catches bugs in seconds. Tests catch them in minutes. Staging catches them in hours. Production catches them over days of incident response, root cause analysis, and post-mortems. The math is simple. Investing time in better types eliminates entire categories of bugs that would each cost 10-100x more downstream.


X. When NOT to Use This

These patterns aren’t universally optimal.

  • Don’t use ADTs when you’re still exploring. When you don’t know yet what the valid states ARE, encoding them as sum types locks you in prematurely. Start with loose types, discover the states through testing, then lock them down.
  • Don’t use ADTs for simple CRUD with few states. A blog post with {title, body, published} doesn’t need Draft | Published | Archived. If the state space is small and obvious, a boolean is fine.
  • Don’t use full effects systems in hot paths. Effect handlers add indirection. In inner loops processing millions of events per second, direct function calls beat effect dispatch. Use effects at the boundary, direct calls in the hot path.
  • Don’t adopt effects before your team understands them. If your team has never seen algebraic effects, introducing them when new Service(deps) works fine creates confusion without proportional benefit. The approximations (Reader pattern, context variables) are a gentler on-ramp.

The adoption gradient, from easiest to hardest:

Easy (adopt today):
  Boolean pairs ? sum types            (just types, zero learning curve)
  .catch(NOOP) ? explicit handling     (mindset shift only)

Medium (team discussion needed):
  Singletons ? parameter injection     (changes constructor signatures)
  Imperative loops ? map/filter/reduce (functional style shift)

Hard (architectural decision):
  Shared state ? actors/channels       (concurrency model change)
  Mixed I/O ? functional core/shell    (structural refactor)
  Full effect systems                  (new paradigm)

Start at the top. Each level delivers value independently. You don’t need to reach the bottom to benefit.


XI. The Migration Path (Incremental, Not Big Bang)

You don’t need to rewrite your system. Here’s the step-by-step path.

  • Step 1: Boolean pairs –> sum types (minutes per instance)
// Before
let isConnected: boolean;
let isAuthenticated: boolean;

// After
enum ConnectionState {
    Disconnected,
    Connected { socket: TcpStream },
    Authenticated { socket: TcpStream, token: AuthToken },
}
  • Step 2: Find every .catch(NOOP) and make a decision: Each one is a decision point: should it retry, log, propagate, or recover? At minimum, log it. Better: make it a Result so callers know.
  • Step 3: Singletons ? constructor parameters (one file at a time): Pick one singleton-using class. Pass the dependency as a constructor parameter instead of hunting for it globally. Test it with a stub.
  • Step 4: Centralize mode checks before eliminating them: Before you can replace 506 scattered mode checks, you need mode determination in ONE place:
// Step 1: Create the union type
type AppMode = { kind: 'leader', ... } | { kind: 'worker', ... } | ...;

// Step 2: Determine mode ONCE at startup
const mode: AppMode = determineMode(process.env);

// Step 3: Pass mode to subsystems — then replace checks one at a time
  • Step 5: Shared mutable state ? channels (one boundary at a time): Identify shared mutable state accessed by multiple async operations. Introduce a channel wrapper and don’t rewrite everything at once.
  • Step 6: New features go in first (pure core, then I/O): For every new feature, write the business logic as pure functions. Push all I/O to the boundaries.

What’s Available in Your Language Today

LanguageSum TypesExhaustivenessResult TypePattern Matching
Rustenum (first-class)Built-in, enforcedResult<T, E> + ?match (exhaustive)
TypeScriptDiscriminated unionsnever checkCustom or fp-tsswitch + narrowing
Swiftenum with associated valuesBuilt-inResult<T, E>switch
KotlinSealed classeswhen exhaustiveResult / Eitherwhen
Java 17+Sealed interfaces + recordsSwitch expressionsCustom or vavrPattern matching (21+)
Python 3.10+@dataclass unionsmatch (partial)Custom or returnsmatch statement
GoInterface + type switchNo built-in(T, error) tupleType assertions

Rust stands out because it was designed around these patterns: first-class ADTs, mandatory exhaustive matching, built-in Result/Option with the ? operator, ownership-based concurrency safety, and zero-cost newtypes. But you can apply these ideas in any language as the patterns are about thinking, not syntax.


XII. The Three Laws

All of this comes down to three principles:

  • If it can’t be represented, it can’t happen. Illegal states that don’t exist in the type system are bugs that don’t exist in production.
  • If it must be handled, it will be handled. When the compiler forces you to address every variant, every error, every edge case then nothing slips through.
  • If it’s composed from tested parts, the composition is tested. Pure functions that individually work correctly compose into pipelines that work correctly. No emergent failure modes from unexpected interactions.

Conclusion: Architecture as Enforcement

The legacy system I analyzed had documentation describing its intended architecture. It had design reviews. It had coding guidelines. None of it prevented 441 silent error swallows, 64-state boolean explosions, race conditions in shared mutable state, 5 redundant metrics implementations, or a shared auth token that let any worker forge admin credentials. Documentation describes intent. Tests verify behavior at a point in time. But types enforce invariants continuously on every line of code, in every file, for every developer, for the entire lifetime of the codebase.

ADTs make impossible states unrepresentable. Algebraic effects separate mechanism from policy. Together, they transform architecture from aspiration into enforcement. The compiler doesn’t take vacations. It doesn’t forget edge cases. In a world of distributed systems, concurrent operations, and ever-growing complexity, that’s not just good engineering practice, it’s the only approach that scales.


Related Blogs

  1. From Big Ball of Mud to Functional Pipeline
  2. The Reusability Trap: When DRY Becomes a Liability

June 18, 2026

Applying Formal Verification to Guard AI-Generated Code

Filed under: Agentic AI — admin @ 4:16 pm

How automated reasoning with Dafny and TLA+ reduces review burden, catches subtle bugs, and gives you a principled way to resist the pressure to ship without thinking


The Problem Keeps Getting Worse

Over the past year I’ve written about agentic coding from several angles such as how to keep design ownership with engineers, how to use TLA+ for executable specifications, how to apply property-based and fuzz testing for microservices, and how to structure the entire delivery process through SDLC skills that force AI to operate within human-defined constraints. Each of those things helps but none of them fully solves the core problem.

The core problem: AI-generated code is probabilistic, and at scale, probability catches up with you. Based on Brooks’ Mythical Man-Month breakdown, coding itself is roughly 14% of the software delivery process. Agentic AI has largely solved that 14%. It writes clean, well-formatted, plausible code faster than any human. But plausible is not the same as correct. And when you generate 10× more code, the other 86% of your pipeline like design, specification, review, testing, deployment doesn’t automatically scale with it. I keep watching three failure modes play out:

  • Hallucinations scale with complexity. An AI writing a 50-line function gets it right most of the time. An AI building a major feature in a large codebases with dozens of modules operators more probabilistically. It produces shallow modules instead of deep ones, duplicates logic, and makes locally correct decisions that violate global invariants. The code looks fine at the file level but the system breaks at the integration level.
  • Review becomes the bottleneck. When one engineer’s code output multiplies by 10×, review bandwidth doesn’t scale with it. I’ve watched teams respond in two ways: slow everything down to match review capacity, or cut the review process to maintain throughput. Amazon learned what cutting review does to production reliability. It’s not a lesson you want to repeat.
  • AI-generated code is harder to review than messy code. This is the counterintuitive one. Bertrand Meyer’s article AI for Software Engineering: From Probable to Provable names it precisely: clean, well-structured AI code creates a psychological safety bias. You stop reading as carefully. The concurrency bug in elegant code is harder to spot than the same bug in obviously messy code.

The Pressure to Abandon Quality

There’s a harder problem underneath all of this: the organizational pressure to treat 10× code output as “just faster developers” and to cut the review, specification, and verification processes accordingly. I’ve seen executive pressure to eliminate code review entirely, to lay off senior engineers who “just do reviews,” to skip integration testing because “the AI tested it.”

This is exactly backwards. When code output increases 10×, the need for rigorous verification increases proportionally not decreases. Joe Mager’s Monte Carlo simulation of agentic coding pipelines quantifies that at a defect rate of 1-in-40 commits with a 12-hour pipeline, you get 0.7% deployment success, essentially deadlock. He calls the safe zone the “valley of calm”: the region where defect rate × pipeline duration stays well below 1.

Formal verification is the tool that keeps you in the valley. It doesn’t slow the generative side down and the AI still generates code fast. It gates the output mathematically, so you catch invariant violations before they reach production rather than after. The practical solution is a dual-engine pipeline: a generative engine (the LLM) and a verification engine. The LLM generates fast. The verifier proves correctness. Human engineers own the specifications because you can’t outsource thinking. This post shows how to build that pipeline using a real RBAC system as the example. The companion repository is at github.com/bhatti/automated-reasoning.


From Logic AI to LLMs and Back

Modern LLMs work by predicting the next token, i.e., statistical, probabilistic, pattern-matching at scale. But AI didn’t start here. The dominant AI paradigm from the 1970s through the 1990s was symbolic and logical: knowledge representation, inference engines, expert systems, formal reasoning. We went from logic to probability. Now we need both.

timeline
    title AI Paradigms and Verification Approaches
    section Logic Era (1970s–1990s)
        1972 : Prolog — logic programming and knowledge representation
        1979 : Boyer-Moore theorem prover
        1986 : Eiffel introduces Design by Contract
        1987 : TLA created by Leslie Lamport
    section Hybrid Era (2000s–2010s)
        1999 : Z3 SMT solver (Microsoft Research)
        2005 : Alloy model finder
        2009 : Dafny created (Microsoft Research)
        2014 : TLA+ used at AWS for S3 and DynamoDB
    section LLM Era (2020s)
        2022 : ChatGPT and Copilot — probabilistic code generation goes mainstream
        2024 : Agentic coding — 10× code throughput becomes normal
        2025 : Spec-driven development movement emerges
        2026 : Formal verification as AI guardrail

Understanding this history matters for a practical reason: the tools from the logic era didn’t disappear when LLMs arrived. They got faster, more automated, and better integrated into real development workflows. The question today isn’t “logic or probability?”, it’s “how do we combine them?”

Prolog and Logic Programming

Prolog (1972) represents knowledge as facts and rules, then uses unification and backtracking to answer queries. For authorization policy, you write the what, not the how:

has_role(alice, viewer).
has_role(bob, editor).
role_inherits(editor, viewer).
role_grants(viewer, read, docs).
role_grants(editor, write, docs).

can_access(User, Action, Resource) :-
    has_role(User, Role),
    role_grants(Role, Action, Resource).
can_access(User, Action, Resource) :-
    has_role(User, Role),
    role_inherits(Role, Parent),
    role_grants(Parent, Action, Resource).

?- can_access(bob, read, docs).
% Yes — bob is editor, editor inherits viewer, viewer grants read:docs.

Design by Contract: Eiffel (1986)

Bertrand Meyer’s Eiffel language introduced Design by Contract (DbC): every method carries a formal contract such as preconditions, postconditions, class invariants that the runtime checks. I’ve been a fan of this approach for a long time, because it encodes intent alongside code rather than hoping a test suite happens to cover the right cases. DbC influenced:

  • Clojure: pre/post condition maps on functions
  • Ada/SPARK: formal proof obligations on subprograms
  • Java/C++: assert statements (though almost nobody enables them in production, which defeats the point)
  • Go: convention-based precondition checks that panic or return errors
  • Dafny: compile-time verification of contracts

The key DbC insight that gets lost in most production codebases: assertions should always be enabled in production. They’re not test-time scaffolding. They’re executable specifications that catch invariant violations the moment they occur, including input combinations no test ever anticipated.

The Verification Spectrum

Here’s how I think about the tools available, from informal to formally proven:

ApproachWhat it guaranteesEffortTools
Unit testsSpecific inputs passLowJUnit, Go testing
BDD/GherkinNamed scenarios passLow–MediumCucumber, Godog
Property-based testingRandom inputs satisfy propertiesMediumgopter, QuickCheck, Hypothesis
Fuzz testingMutated inputs don’t crashMediumgo-fuzz, AFL, libFuzzer
Contract testingAPI boundaries respectedMediumPact, api-mock-service
Static analysisType safety, null checksLowgo vet, Rust compiler
Design by ContractPre/post/invariants checked at runtimeMediumEiffel, Clojure pre/post, assertions
Model checkingAll reachable states are safeHighTLA+, SPIN, Alloy
Deductive verificationMathematical proof of correctnessHighDafny, Lean, Coq

The progression is from probabilistic to provable. The top rows test specific cases and find bugs. The bottom rows prove properties over all possible inputs and make entire classes of bugs impossible. AI-generated code needs both sides of this spectrum. Tests give you practical coverage fast. Proofs give you guarantees that no test suite can match. But here’s the catch I keep running into: when tests are also generated by AI, they may test the wrong thing as they optimize for passing, not for correctness. Formal specifications are the antidote. They state what correct is, mathematically, so even wrongly generated tests get caught when they conflict with the spec.


Automated Reasoning: The Technical Foundation

Before diving into code, let me explain what automated reasoning means. Automated reasoning means using software to answer mathematical questions about other software, without running it. Three activities matter here:

  1. Control flow analysis: what execution paths can the code take?
  2. Invariant discovery: what conditions hold regardless of which path it takes?
  3. Property verification: given a specification, does the code satisfy it for all inputs?

The critical distinction from testing: testing checks that specific inputs produce expected outputs. Automated reasoning proves that a property holds for every input the program could ever receive.

SAT: Boolean Satisfiability

The foundation is SAT (Boolean Satisfiability): given a formula with boolean variables, can you assign true/false values to satisfy all constraints simultaneously?

Example: (A v B) ^ (¬A v C) ^ (¬B v ¬C)
SAT solver: A=true, B=false, C=true  ?

SAT is NP-complete in theory but practically fast with modern CDCL (Conflict-Driven Clause Learning) solvers. Industrial solvers handle millions of variables routinely.

SMT: Satisfiability Modulo Theories

SMT extends SAT with theories and first-class reasoning about integers, real numbers, arrays, bitvectors, and strings. Where SAT works with booleans, SMT works with the kinds of values programs actually use:

(assert (= (+ x y) 10))
(assert (> x 3))
(assert (> y 3))
(check-sat)
--> sat; x=4, y=6

AWS runs SMT at extraordinary scale. Their Zelkova system runs a billion SMT queries per day to analyze IAM access policies. Zelkova encodes IAM policies as logical formulas and feeds them to Z3 and CVC4. The FMCAD 2018 paper describes how policies translate to first-order logic with string theories and how incremental SMT solving makes this practical at scale.

Constraint Logic Programming (CLP)

CLP extends logic programming with constraint domains. Rather than enumerating solutions by hand, you declare variables, domains, and constraints, and the solver searches:

from ortools.sat.python import cp_model
model = cp_model.CpModel()
x = model.new_int_var(0, 10, 'x')
y = model.new_int_var(0, 10, 'y')
model.add(x + y == 10)
model.add(x > 3)
model.add(y > 3)
solver = cp_model.CpSolver()
solver.solve(model)  # finds x=4, y=6

Google’s CP-SAT scheduler uses this approach and outperforms integer programming for VM migration scheduling because interval variables natively model time-continuous constraints.

The Formal Verification Landscape

Where to apply each:

  • Distributed systems with concurrency: TLA+ (exhaustive state space exploration)
  • Algorithm correctness and data structure invariants: Dafny (deductive proof)
  • Access policy analysis: SMT (Zelkova, Cedar, Z3 directly)
  • Memory safety: Rust’s type system, Verus, or Dafny ghost state

One finding from AWS’s work that surprises most people: formal verification often makes systems faster, not just safer. Their IAM authorization engine got a 50% performance improvement after verification as the process of proving correctness forced developers to eliminate redundant computation and latent bugs that happened to be performance bottlenecks. The S3 index subsystem moved from quarterly to monthly releases after applying automated reasoning. This directly addresses the organizational pushback: verification doesn’t slow you down.


Spec-Driven Development: The Movement Behind the Tools

The insight that specifications instead of code should drive AI development has gained serious traction. Projects like OpenSpec and Spec-Kit formalize this workflow. My own you-got-skills SDLC skills set encodes it: structured workflows for PRD refinement, TRD review, architecture, work breakdown, implementation, and formal QA where AI operates within human-defined constraints rather than inventing its own.

The spec-driven philosophy is: make invalid implementations unrepresentable. You can do this through types (Rust, Haskell), contracts (Eiffel, Dafny), or formal models (TLA+). When your specification is precise enough, AI hallucinations become immediately visible as verification failures rather than subtle production bugs. The verifier catches them instead of the reviewer or the on-call engineer at 2am. Formal verification shifts the time from debugging production incidents to writing specifications that make the rest of the process faster and more predictable.


TLA+ for Concurrency

I covered TLA+ extensively in my earlier post about an year go. I’ll show a targeted example specific to RBAC: the concurrent policy update problem.

The scenario: two admins simultaneously assign roles to the same principal. Without coordination, a check-then-act race violates Separation of Duty (SoD):

Admin A: check(submitter) --> no conflict ? intend to assign
Admin B: check(approver)  --> no conflict ? intend to assign
Admin A: assign(submitter) ?
Admin B: assign(approver)  ?  <-- SoD violated: both roles now held

The TLA+ spec models an optimistic-locking protocol and asks TLC to exhaustively check that SoD is never violated:

(* Safety: SoD never violated *)
SoDOK ==
  ~(\E r1, r2 \in assigned : r1 # r2 /\ <<r1,r2>> \in Conflicts)

(* Liveness: every pending assignment eventually completes *)
Liveness ==
  \A a \in Admins :
    [](phase[a] = "checking" => <>(phase[a] \in {"done", "idle"}))

With 2 admins and 3 roles ({Submitter, Approver, Viewer}), TLC explores 1,046 distinct states and finds no violations:

Model checking completed. No error has been found.
2349 states generated, 1046 distinct states found, 0 states left on queue.

Full spec is in tla/RBACPolicyChange.tla in the companion repo. Run it with:

java -jar ~/tla2tools.jar -config tla/RBACPolicyChange.cfg tla/RBACPolicyChange.tla

For the rest of this post I focus on Dafny, since I already covered TLA+ in depth and Dafny is where I spend most of my verification time now.


Dafny: Practical Deductive Verification

Dafny is a verification-aware programming language from Microsoft Research. It sits in the practical sweet spot: more powerful than static analyzers, far less manual effort than Coq or Lean. Dafny uses Z3 under the hood and verifies many programs automatically without manual proof steps. Importantly for this post, Dafny compiles to Go so you write your specifications in Dafny, your implementations in Go, and the type system and contracts carry through naturally.

The Amazon Dafny curriculum describes three roles Dafny plays simultaneously:

  1. Programming language: loops, classes, generics, standard data structures
  2. Proof assistant: write lemmas and Dafny proves them automatically
  3. Program verifier: attach requires/ensures to methods and Dafny proves they hold for all possible inputs

Design by Contract in Dafny

method Divide(a: int, b: int) returns (result: int)
  requires b != 0           // precondition: caller must ensure this
  ensures result * b == a   // postcondition: callee guarantees this
{
  return a / b;
}

If you call Divide(10, 0), Dafny rejects it at compile time not at runtime. That’s the shift from “testing catches bugs” to “bugs can’t be expressed.”


The Example: An RBAC System

I chose RBAC because it’s rich enough to demonstrate real verification value without being contrived. The companion project is a simplified version of my saas_rbac project. The domain model has six entities:

Why RBAC? It exhibits four bug classes that AI-generated code routinely gets wrong and each maps cleanly to a formal property:

  • Type safety: dangling references, e.g., a principal in org A assigned a role that references a resource in org B
  • Structural safety: role hierarchy must be a DAG, e.g., cycles cause infinite loops during claim resolution
  • Security safety: policy evaluation must be sound (no phantom permissions) and complete (no missed permissions)
  • Conflict safety: Separation of Duty must hold after every role assignment, including the symmetric direction AI almost always misses

Each of these is a property I state once in Dafny and prove once rather than hoping a test suite happens to exercise the right edge cases.


Step 1: Types and the System-Wide Invariant (rbac_types.dfy)

The first thing I write isn’t any method, it’s the ValidStore predicate: the system-wide invariant that every operation must preserve. Writing it out forces you to articulate what “correct state” actually means before writing a single line of logic.

predicate ValidStore(s: RBACStore) {
  // No dangling principal references
  && (forall id :: id in s.principals ==>
        s.principals[id].orgId in s.orgs)
  // Tenant isolation: role parents must be in same org
  && (forall rid :: rid in s.roles ==>
        (forall pid :: pid in SeqToSet(s.roles[rid].parentIds) ==>
           s.roles[pid].orgId == s.roles[rid].orgId))
  // Claim resources must exist in the store
  && (forall rid :: rid in s.roles ==>
        (forall c :: c in s.roles[rid].claims ==>
           c.resourceId in s.resources))
  // ... (8 more invariants)
}

A lemma proves the empty store satisfies it and Dafny verifies this automatically with no manual proof steps:

lemma EmptyStoreIsValid()
  ensures ValidStore(EmptyStore())
{}

The value here isn’t the lemma. It’s the discipline the predicate imposes. When you have to state every invariant precisely before writing code, the class of bugs you can introduce narrows dramatically. Every subsequent method carries requires ValidStore(s) and ensures ValidStore(result) and Dafny enforces this chain automatically.


Step 2: Policy Evaluation (rbac_policy.dfy)

The two most critical properties of any authorization system:

SOUNDNESS:    If Evaluate returns Allow, a valid claim chain EXISTS.
              No phantom permissions. No false positives.

COMPLETENESS: If a valid claim chain exists, Evaluate returns Allow.
              No missed permissions. No false negatives.

I write the ground-truth specification as a pure, non-executable predicate, then verify that the executable method matches it exactly:

// The specification — states what "correct" means mathematically
predicate PolicySpec(req: Request, store: RBACStore, ctx: EvalContext) {
  var principal := store.principals[req.principalId];
  exists c :: c in PrincipalClaims(principal, store.roles) &&
              ClaimGrants(c, req.action, req.resourceId, ctx)
}

// The implementation — Dafny proves it matches PolicySpec for all inputs
method Evaluate(req: Request, store: RBACStore, ctx: EvalContext)
    returns (decision: Decision)
  requires ValidStore(store)
  requires req.principalId in store.principals
  requires store.principals[req.principalId].orgId == req.orgId
  ensures decision == Allow ==> PolicySpec(req, store, ctx)    // SOUNDNESS
  ensures decision == Deny  ==> !PolicySpec(req, store, ctx)   // COMPLETENESS

Dafny verifies the loop implementation with a loop invariant that tracks “no match found in claims[0..i]”:

while i < |claimSeq|
  invariant decision == Deny ==>
    forall j :: 0 <= j < i ==>
      !ClaimGrants(claimSeq[j], req.action, req.resourceId, ctx)
  decreases |claimSeq| - i
{
  if ClaimGrants(claimSeq[i], req.action, req.resourceId, ctx) {
    decision := Allow;
    return;
  }
  i := i + 1;
}

The decreases clause proves termination and Dafny guarantees no infinite loops, for any input. When AI generates the implementation, if it introduces a subtle loop condition bug, Dafny catches it immediately rather than at a production incident. A bonus lemma proves monotonicity and adding claims can never turn an Allow into a Deny:

lemma MoreClaimsMonotonic(req, store1, store2, ctx)
  requires store1 has subset of claims of store2
  ensures PolicySpec(req, store1, ctx) ==> PolicySpec(req, store2, ctx)

Step 3: Role Hierarchy with No Cycles (rbac_role_hierarchy.dfy)

AddParent proves that cycles can never be introduced, regardless of what sequence of operations an API caller attempts:

method AddParent(child: RoleId, parent: RoleId, roles: map<RoleId, Role>)
    returns (result: map<RoleId, Role>, ok: bool)
  requires NoCycles(roles)
  ensures ok  ==> NoCycles(result)    // DAG invariant always preserved
  ensures !ok ==> result == roles     // rejection leaves the store unchanged
{
  var wouldCycle := child in Ancestors(parent, roles, |roles|);
  if wouldCycle { return roles, false; }
  // safe to add the parent edge
}

Ancestors computes the full ancestor set with bounded recursion, e.g., fuel of |roles| is sufficient for any valid DAG. This is a property that’s easy to state but extremely hard to test exhaustively: you’d have to enumerate all possible role graph topologies. Dafny proves it once, for all possible graphs.


Step 4: Separation of Duty (rbac_separation_of_duty.dfy)

SoD says certain role pairs must never be co-assigned and you can’t be both the invoice submitter and the invoice approver. The subtle bug AI code routinely misses is the symmetric case: checking (existing, new) but not (new, existing). This is exactly the kind of off-by-one semantic error that looks correct on inspection and only surfaces in edge-case inputs.

predicate SoDSatisfied(assignedRoles: set<RoleId>, conflicts: ConflictSet) {
  forall a, b ::
    a in assignedRoles && b in assignedRoles && a != b ==>
      (a, b) !in conflicts
}

method AssignRole(principal, newRole, conflicts)
  requires SoDSatisfied(SeqToSet(principal.roleIds), conflicts)
  ensures ok  ==> SoDSatisfied(SeqToSet(updated.roleIds), conflicts)
  ensures !ok ==> exists existing ::
    existing in SeqToSet(principal.roleIds) &&
    (existing, newRole) in conflicts   // proof witness for why it was rejected

Here’s what Dafny outputs when an AI generates the broken version that only checks one direction:

rbac_separation_of_duty.dfy(42,4): Error: a postcondition could not be proved
  ensures ok ==> SoDSatisfied(SeqToSet(updated.roleIds), conflicts)

Counterexample:
  principal.roleIds = ["approver"]
  newRole = "submitter"
  conflicts = {("submitter", "approver")}  ? (new, existing) direction missed

That counterexample shows exactly which input violates the contract, with a concrete example. Without Dafny, catching this requires either a carefully targeted test case or it shows up in production when someone discovers SoD can be bypassed by using conflict pairs in reverse order.


Step 5: Constraint Monotonicity (rbac_constraints.dfy)

Constraints make RBAC dynamic like time windows, geo fences, usage quotas. The key properties to prove are:

// Adding constraints can only reduce access, never increase it
lemma AddingConstraintReducesAccess(base, extra, ctx)
  ensures AllConstraintsHold(base + [extra], ctx) ==>
          AllConstraintsHold(base, ctx)

// Empty constraint list always passes (vacuous truth — no constraints = no restrictions)
lemma EmptyConstraintsAlwaysHold(ctx)
  ensures AllConstraintsHold([], ctx)
{}

// Higher usage makes quota constraints harder to satisfy
lemma HigherUsageHarder(limit, usage1, usage2, ctx)
  requires usage1 <= usage2
  ensures ConstraintHolds(MaxUsage(limit), ctx[usage:=usage2]) ==>
          ConstraintHolds(MaxUsage(limit), ctx[usage:=usage1])

These seem obvious. They are exactly the properties that break when AI generates constraint evaluation with subtle off-by-one errors or a flipped inequality direction (>= instead of >). Proving them once means you catch the implementation error in the Go translation from a failed test instead of production from an access control bypass.


The Go Implementation: Verified by Construction

The Go implementation translates the Dafny specifications directly. Every design decision traces back to a proved property.

Types Mirror Dafny Datatypes

// go/pkg/types/types.go

type Claim struct {
    ID          ClaimID
    Action      string
    ResourceID  ResID
    Constraints []Constraint   // empty = always passes (vacuous truth, proved by EmptyConstraintsAlwaysHold)
}

// NewTimeWindow enforces the Dafny precondition ValidConstraint at construction time
func NewTimeWindow(start, end int) (Constraint, error) {
    if start >= end || end > 24 {
        return Constraint{}, fmt.Errorf("invalid time window: start < end <= 24 required")
    }
    return Constraint{Kind: TimeWindowKind, StartHour: start, EndHour: end}, nil
}

// NewGeoFence — Dafny requires |regions| > 0
func NewGeoFence(regions []string) (Constraint, error) {
    if len(regions) == 0 {
        return Constraint{}, fmt.Errorf("geo fence requires at least one region")
    }
    return Constraint{Kind: GeoFenceKind, Regions: regions}, nil
}

// NewMaxUsage — Dafny requires limit > 0
func NewMaxUsage(limit int) (Constraint, error) {
    if limit <= 0 {
        return Constraint{}, fmt.Errorf("max usage limit must be positive")
    }
    return Constraint{Kind: MaxUsageKind, MaxCount: limit}, nil
}

Constraint Evaluation Maps Directly to Dafny

// go/pkg/constraints/constraints.go

// Holds mirrors Dafny's ConstraintHolds predicate exactly.
// Every case corresponds to a branch in the Dafny match expression.
func Holds(c types.Constraint, ctx types.EvalContext) bool {
    switch c.Kind {
    case types.TimeWindowKind:
        // Dafny: ctx.currentHour >= c.startHour && ctx.currentHour < c.endHour
        return ctx.CurrentHour >= c.StartHour && ctx.CurrentHour < c.EndHour
    case types.GeoFenceKind:
        // Dafny: ctx.currentRegion in c.allowedRegions
        for _, r := range c.Regions {
            if r == ctx.CurrentRegion {
                return true
            }
        }
        return false
    case types.MaxUsageKind:
        // Dafny: ctx.currentUsage < c.maxCount
        return ctx.CurrentUsage < c.MaxCount
    default:
        return false
    }
}

// AllHold evaluates a conjunction of constraints.
// Dafny proved: AllConstraintsHold([], ctx) == true (EmptyConstraintsAlwaysHold)
// Dafny proved: AllConstraintsHold(base + [extra], ctx) ==> AllConstraintsHold(base, ctx)
func AllHold(cs []types.Constraint, ctx types.EvalContext) bool {
    for _, c := range cs {
        if !Holds(c, ctx) {
            return false
        }
    }
    return true
}

Role Hierarchy: BFS with Proven Cycle Detection

// go/pkg/hierarchy/hierarchy.go

// HasCycle returns true if adding parent to child would create a cycle.
// Mirrors Dafny: child in Ancestors(parent, roles, |roles|)
func (r *Resolver) HasCycle(child, parent types.RoleID) bool {
    visited := map[types.RoleID]bool{}
    queue := []types.RoleID{parent}
    for len(queue) > 0 {
        current := queue[0]
        queue = queue[1:]
        if current == child {
            return true
        }
        if visited[current] {
            continue
        }
        visited[current] = true
        if role, ok := r.roles[current]; ok {
            queue = append(queue, role.ParentIDs...)
        }
    }
    return false
}

// AddParent adds a parent role with cycle guard.
// Mirrors Dafny: requires NoCycles, ensures NoCycles preserved or store unchanged.
func (r *Resolver) AddParent(child, parent types.RoleID) error {
    if r.HasCycle(child, parent) {
        return fmt.Errorf("adding parent %s to %s would create a cycle", parent, child)
    }
    role := r.roles[child]
    role.ParentIDs = append(role.ParentIDs, parent)
    r.roles[child] = role
    return nil
}

// TransitiveClaims collects all claims reachable through the role hierarchy.
// BFS bounded by number of roles — same as Dafny's fuel parameter.
func TransitiveClaims(roleID types.RoleID, roles map[types.RoleID]types.Role) []types.Claim {
    var claims []types.Claim
    visited := map[types.RoleID]bool{}
    queue := []types.RoleID{roleID}
    for len(queue) > 0 {
        current := queue[0]
        queue = queue[1:]
        if visited[current] {
            continue
        }
        visited[current] = true
        role, ok := roles[current]
        if !ok {
            continue
        }
        claims = append(claims, role.Claims...)
        queue = append(queue, role.ParentIDs...)
    }
    return claims
}

Store: Invariant Enforcement at Every Write

// go/pkg/store/store.go

// AssignRole mirrors Dafny AssignRole:
//   requires SoDSatisfied(current roles, conflicts)
//   ensures  SoDSatisfied(updated roles, conflicts) OR rejection with witness
func (s *Store) AssignRole(principalID types.PrinID, roleID types.RoleID) error {
    s.mu.Lock()
    defer s.mu.Unlock()

    principal, ok := s.principals[principalID]
    if !ok {
        return fmt.Errorf("principal %s not found", principalID)
    }
    role, ok := s.roles[roleID]
    if !ok {
        return fmt.Errorf("role %s not found", roleID)
    }
    // Tenant isolation — from Dafny ValidStore predicate
    if role.OrgID != principal.OrgID {
        return fmt.Errorf("tenant isolation: role org %s != principal org %s",
            role.OrgID, principal.OrgID)
    }
    // SoD conflict check — checks BOTH directions, per Dafny SoDSatisfied predicate
    for _, existingRoleID := range principal.RoleIDs {
        if s.hasConflict(existingRoleID, roleID) {
            return fmt.Errorf("separation of duty: role %q conflicts with existing role %q",
                roleID, existingRoleID)
        }
    }
    // Safe to assign — SoD preserved (Dafny ensures clause holds)
    principal.RoleIDs = append(principal.RoleIDs, roleID)
    s.principals[principalID] = principal
    return nil
}

Policy Engine Encodes the Soundness/Completeness Contract

// go/pkg/policy/policy.go

// Evaluate decides Allow or Deny for a request.
// Preconditions from Dafny requires clauses: request fields valid, principal exists, tenant matches.
// Postconditions from Dafny ensures clauses: Allow iff valid claim chain exists.
func (e *Engine) Evaluate(req types.Request, ctx types.EvalContext) (types.Decision, error) {
    if err := req.Validate(); err != nil {
        return types.Deny, fmt.Errorf("invalid request: %w", err)
    }
    principal, ok := e.store.GetPrincipal(req.PrincipalID)
    if !ok {
        return types.Deny, nil  // deny-by-default — proved by DenyByDefault lemma
    }
    if principal.OrgID != req.OrgID {
        return types.Deny, fmt.Errorf("tenant isolation violated")
    }
    // Walk transitive claims — same BFS algorithm as Dafny Evaluate method
    claims := hierarchy.PrincipalClaims(principal, e.store.AllRoles())
    for _, c := range claims {
        if constraints.ClaimGrants(c, req.Action, req.ResourceID, ctx) {
            return types.Allow, nil
        }
    }
    return types.Deny, nil
}

Property-Based Tests: The Bridge from Provable to Probable

Each Dafny lemma gets a matching gopter property-based test. Dafny proves for all possible inputs; gopter fires hundreds of random inputs and catches bugs in the Go translation where the Dafny spec is correct but the Go implementation diverges.

// go/pkg/policy/policy_test.go

// Mirrors: DenyByDefault lemma in rbac_invariants.dfy
func TestProp_NoRolesAlwaysDenied(t *testing.T) {
    props := gopter.NewProperties(gopter.DefaultTestParameters())
    props.Property("principal with no roles is always denied", prop.ForAll(
        func(action, resource string) bool {
            s := store.New()
            _ = s.AddOrg(types.Organization{ID: "org", Name: "org"})
            _ = s.AddResource(types.Resource{ID: resource, Name: resource, Kind: "api"})
            _ = s.AddPrincipal(types.Principal{ID: "p", OrgID: "org", Name: "P"})
            engine := policy.New(s)
            req := types.Request{OrgID: "org", PrincipalID: "p",
                Action: action, ResourceID: resource}
            decision, _ := engine.Evaluate(req, types.EvalContext{CurrentHour: 12})
            return decision == types.Deny
        },
        gen.AlphaString(), gen.AlphaString(),
    ))
    props.TestingRun(t, gopter.NewFormatedReporter(false, 80, os.Stdout))
}

// Mirrors: AddingConstraintReducesAccess lemma in rbac_constraints.dfy
func TestProp_ConstraintMonotonicity(t *testing.T) {
    props := gopter.NewProperties(gopter.DefaultTestParameters())
    props.Property("subset of constraints passing implies prefix passes", prop.ForAll(
        func(hour int, region string, usage int) bool {
            ctx := types.EvalContext{
                CurrentHour:   abs(hour) % 24,
                CurrentRegion: region,
                CurrentUsage:  abs(usage) % 100,
            }
            tw, _ := types.NewTimeWindow(9, 17)
            base := []types.Constraint{tw}
            geo, _ := types.NewGeoFence([]string{"us-east-1"})
            extended := append(base, geo)
            // If the extended (stricter) set passes, the base set MUST also pass
            if constraints.AllHold(extended, ctx) {
                return constraints.AllHold(base, ctx)
            }
            return true
        },
        gen.Int(), gen.AnyString(), gen.Int(),
    ))
    props.TestingRun(t, gopter.NewFormatedReporter(false, 80, os.Stdout))
}

// Mirrors: HigherUsageHarder lemma in rbac_constraints.dfy
func TestProp_QuotaMonotonicity(t *testing.T) {
    props := gopter.NewProperties(gopter.DefaultTestParameters())
    props.Property("higher usage never turns Deny into Allow for quota", prop.ForAll(
        func(limit int, lo int, delta int) bool {
            limit = abs(limit)%100 + 1
            lo = abs(lo) % 200
            hi := lo + abs(delta)%100   // hi >= lo guaranteed

            quota, _ := types.NewMaxUsage(limit)
            ctxLo := types.EvalContext{CurrentUsage: lo}
            ctxHi := types.EvalContext{CurrentUsage: hi}

            // If quota passes at HIGHER usage, it MUST pass at lower usage
            if constraints.Holds(quota, ctxHi) {
                return constraints.Holds(quota, ctxLo)
            }
            return true
        },
        gen.Int(), gen.Int(), gen.Int(),
    ))
    props.TestingRun(t, gopter.NewFormatedReporter(false, 80, os.Stdout))
}

// Mirrors: OwnClaimsIncluded lemma in rbac_role_hierarchy.dfy
func TestProp_OwnClaimsAlwaysIncluded(t *testing.T) {
    props := gopter.NewProperties(gopter.DefaultTestParameters())
    props.Property("role's own claims always appear in transitive claims", prop.ForAll(
        func(claimCount int) bool {
            claimCount = abs(claimCount)%5 + 1
            var claims []types.Claim
            for i := 0; i < claimCount; i++ {
                claims = append(claims, types.Claim{
                    ID:         types.ClaimID(fmt.Sprintf("c%d", i)),
                    Action:     fmt.Sprintf("action%d", i),
                    ResourceID: "res1",
                })
            }
            roles := map[types.RoleID]types.Role{
                "role1": {ID: "role1", OrgID: "org1", Claims: claims},
            }
            transitive := hierarchy.TransitiveClaims("role1", roles)
            for _, c := range claims {
                found := false
                for _, tc := range transitive {
                    if tc.ID == c.ID {
                        found = true
                        break
                    }
                }
                if !found {
                    return false
                }
            }
            return true
        },
        gen.Int(),
    ))
    props.TestingRun(t, gopter.NewFormatedReporter(false, 80, os.Stdout))
}

Running the full suite:

+ empty constraint list always passes: OK, passed 200 tests.
+ hour in [start,end) passes, hour outside fails: OK, passed 500 tests.
+ subset of constraints passing implies prefix passes: OK, passed 300 tests.
+ higher usage never turns Deny into Allow for quota: OK, passed 300 tests.
+ role's own claims always appear in transitive claims: OK, passed 200 tests.
+ adding a parent never reduces transitive claims: OK, passed 200 tests.
+ unknown principal is always denied: OK, passed 200 tests.
+ principal with no roles is always denied: OK, passed 200 tests.
PASS — 2300+ property-based test cases executed.

Each line corresponds to a Dafny lemma. The property tests don’t replace the proofs, they catch bugs in the Go translation that the Dafny verifier can’t see.


Adding a Feature the Verified Way

Let me walk through adding a RateLimit constraint from scratch. This is the exact workflow for extending a formally verified system, and it shows why the upfront cost is much lower than it looks.

Step 1: Add the datatype in Dafny

datatype ConstraintKind =
    | TimeWindow(startHour: nat, endHour: nat)
    | GeoFence(allowedRegions: seq<string>)
    | MaxUsage(maxCount: nat)
    | RateLimit(requestsPerMinute: nat)  // NEW

predicate ValidConstraint(c: ConstraintKind) {
    match c
    case TimeWindow(s, e) => 0 <= s < e <= 24
    case GeoFence(regions) => |regions| > 0
    case MaxUsage(max) => max > 0
    case RateLimit(rpm) => rpm > 0   // must be positive
}

Step 2: Define evaluation semantics

predicate ConstraintHolds(c: ConstraintKind, ctx: EvalContext) {
    match c
    case TimeWindow(s, e) => s <= ctx.currentHour < e
    case GeoFence(regions) => ctx.currentRegion in regions
    case MaxUsage(max) => ctx.currentUsage < max
    case RateLimit(rpm) => ctx.currentRequestRate < rpm   // NEW
}

Step 3: Write and prove a monotonicity lemma

// Lower rate limit is harder to satisfy — proved automatically
lemma LowerRateLimitHarder(rpm1: nat, rpm2: nat, ctx: EvalContext)
    requires rpm1 <= rpm2
    requires rpm1 > 0 && rpm2 > 0
    ensures ConstraintHolds(RateLimit(rpm1), ctx) ==>
            ConstraintHolds(RateLimit(rpm2), ctx)
{
    // Dafny proves this in under a second: if rate < rpm1 <= rpm2 then rate < rpm2
}

Step 4: Verify

$ dafny verify dafny/rbac_constraints.dfy
Dafny program verifier finished with 12 verified, 0 errors

Step 5: Implement in Go

func NewRateLimit(rpm int) (Constraint, error) {
    if rpm <= 0 {
        return Constraint{}, fmt.Errorf("rate limit must be positive")
    }
    return Constraint{Kind: RateLimitKind, MaxCount: rpm}, nil
}

// Add to Holds() in constraints.go:
case types.RateLimitKind:
    return ctx.CurrentRequestRate < c.MaxCount

Step 6: Write the matching property test

func TestProp_LowerRateLimitHarder(t *testing.T) {
    props := gopter.NewProperties(gopter.DefaultTestParameters())
    props.Property("lower rate limit is harder to satisfy", prop.ForAll(
        func(rpm1, rpm2, rate int) bool {
            rpm1 = abs(rpm1)%100 + 1
            rpm2 = rpm1 + abs(rpm2)%100   // rpm2 >= rpm1 guaranteed
            ctx := types.EvalContext{CurrentRequestRate: abs(rate) % 200}
            c1, _ := types.NewRateLimit(rpm1)
            c2, _ := types.NewRateLimit(rpm2)
            if constraints.Holds(c1, ctx) {
                return constraints.Holds(c2, ctx)
            }
            return true
        },
        gen.Int(), gen.Int(), gen.Int(),
    ))
    props.TestingRun(t, gopter.NewFormatedReporter(false, 80, os.Stdout))
}

That full loop — datatype –> predicate –> lemma –> verify –> implement –> test takes maybe 20 minutes for a new constraint type. The result ships with a mathematical proof that the Go implementation matches the specification.


The Complete Pipeline

The workflow in the companion repo ties everything together:

Running the full pipeline:

# Step 1: Verify formal specs
$ make verify-dafny
[DAFNY] Verifying rbac_types.dfy...              ? PASS
[DAFNY] Verifying rbac_policy.dfy...             ? PASS
[DAFNY] Verifying rbac_role_hierarchy.dfy...     ? PASS
[DAFNY] Verifying rbac_separation_of_duty.dfy... ? PASS
[DAFNY] Verifying rbac_constraints.dfy...        ? PASS
[DAFNY] Verifying rbac_invariants.dfy...         ? PASS
All 6 Dafny files verified successfully.

# Step 2: Model check concurrent protocol
$ make check-tla
[TLA+] Model checking RBACPolicyChange...
Model checking completed. No error has been found.
  2349 states generated, 1046 distinct states found.

# Step 3: Run property-based tests
$ make test
+ empty constraint list always passes: OK, passed 200 tests.
+ time window boundary conditions: OK, passed 500 tests.
+ constraint monotonicity: OK, passed 300 tests.
+ quota monotonicity: OK, passed 300 tests.
+ own claims in transitive closure: OK, passed 200 tests.
+ adding parent never removes claims: OK, passed 200 tests.
+ unknown principal always denied: OK, passed 200 tests.
+ no roles always denied: OK, passed 200 tests.
PASS — 2300+ property-based test cases executed.

# Step 4: Smoke-test the API
$ make run &
Server starting on :9090...

# alice can read docs (viewer role, time window 9-17, currently hour 10)
$ curl -s localhost:9090/evaluate -d '{
  "org_id":"acme", "principal_id":"alice",
  "action":"read", "resource_id":"docs",
  "hour":10, "region":"us-east-1", "usage":0
}' | jq .decision
"Allow"

# alice denied at 10pm — time constraint blocks access outside 9-17
$ curl -s localhost:9090/evaluate -d '{
  "org_id":"acme", "principal_id":"alice",
  "action":"read", "resource_id":"reports",
  "hour":22, "region":"us-east-1", "usage":0
}' | jq .decision
"Deny"

# bob can write docs from US (editor role + geo constraint us-east-1)
$ curl -s localhost:9090/evaluate -d '{
  "org_id":"acme", "principal_id":"bob",
  "action":"write", "resource_id":"docs",
  "hour":12, "region":"us-east-1", "usage":0
}' | jq .decision
"Allow"

# bob denied from EU — geo constraint blocks non-US regions
$ curl -s localhost:9090/evaluate -d '{
  "org_id":"acme", "principal_id":"bob",
  "action":"write", "resource_id":"docs",
  "hour":12, "region":"eu-west-1", "usage":0
}' | jq .decision
"Deny"

# SoD blocks carol (finance role) from also being submitter
$ curl -s localhost:9090/principals/carol/roles -d '{"role_id":"submitter"}'
{"error":"separation of duty: role \"submitter\" conflicts with existing role \"finance\""}

Who Writes the Specs? The Human-AI Division

This is the question I get asked most. Here’s how I answer it.

Humans must own:

  • What the invariants are, e.g., SoD, tenant isolation, deny-by-default, referential integrity
  • What the formal properties mean in the problem domain
  • Reviewing counterexamples from the verifier and refining specs accordingly
  • Architecture decisions: which tool for which problem, which 20% of the codebase to verify

AI can assist:

  • Dafny syntax, e.g.,LLMs generate valid Dafny from English property descriptions, as the TLA+ for the LLM era article demonstrates for TLA+
  • Boilerplate Go translated from Dafny type definitions
  • Test scaffolding for gopter properties
  • Translating Dafny lemmas into property test outlines

The feedback loop in practice:

  1. Human writes ValidStore capturing tenant isolation
  2. AI generates the AddPrincipal method in Go
  3. Dafny verifies the spec — or produces a counterexample
  4. If counterexample: human understands the bug (usually a missed invariant direction), refines the spec
  5. AI regenerates from the refined spec
  6. Repeat until proof succeeds

Marc Brooker’s analysis of what AI agents find easy versus hard makes the point precisely: agents succeed on tasks with good automated feedback and struggle on tasks without it. Formal verification is that feedback as it is mathematical, precise, and automatable. It turns the review loop into a tight iteration between human specifier and verifier, rather than a bottleneck where an engineer reads 1,000 lines of plausible AI code hoping to spot a subtle invariant violation. This directly counters the “cut review to go faster” argument: you don’t cut review instead you replace line-by-line code review with specification review, which is faster, higher leverage, and catches the bugs that matter.


When to Apply Formal Verification

Not every line of code needs formal verification. Here’s how I decide where to apply it:

Apply formal verificationSkip it
Authorization and access controlUI rendering logic
Cryptographic protocolsCRUD boilerplate
Distributed consensusSimple data transformations
Financial calculationsUser-facing text content
Schema migration validatorsLogging and metrics
Safety-critical state machinesConfiguration defaults

The pattern: apply where bugs are expensive like security, correctness, data integrity and where the specification can be stated mathematically. For the critical 20% of a codebase where correctness failures are severe, formal verification pays for itself on the first prevented production incident. For the other 80%, tests and code review are the right tools. This targeting also addresses the organizational pressure argument directly. You don’t need to formally verify everything, which would be impractical. You verify the parts where the cost of being wrong is highest. That’s a defensible, scoped investment that produces measurable risk reduction.


Getting Started with Dafny: Three Steps

  • Step 1: Start with types. Write ValidXxx predicates before writing any methods. This forces you to articulate what “correct state” means before writing code that’s supposed to produce it. The predicates are small, incremental, and require no theorem-proving expertise.
  • Step 2: Add contracts to one critical method. Pick the authorization check. Add requires/ensures. Let Dafny fail. Understand why. Add lemmas. The first proof is the hardest; subsequent ones follow the same pattern.
  • Step 3: Mirror each lemma with a property test. This catches translation bugs in the Go implementation and keeps spec and code in sync as the system evolves.
// Dafny lemma — proved by the verifier
lemma EmptyConstraintsAlwaysHold(ctx: EvalContext)
  ensures AllConstraintsHold([], ctx)
{}
// Matching gopter property — catches Go translation bugs
props.Property("empty constraint list always passes", prop.ForAll(
    func(hour int, region string) bool {
        ctx := types.EvalContext{CurrentHour: hour % 24, CurrentRegion: region}
        return constraints.AllHold(nil, ctx)
    },
    gen.Int(), gen.AnyString(),
))

Conclusion: You Can’t Outsource Thinking

The AI era has come full circle. In the 1980s, AI meant logic like Prolog, expert systems, formal inference. The 2020s flipped to probabilistic: statistical token prediction that generates plausible code at extraordinary speed. But plausible was never the goal. Correct is the goal. And correctness was always logic’s domain.

The Bertrand Meyer’s article From Probable to Provable captures the shift precisely: the engineering role moves from writing code to writing specifications. From debugging via console.log to managing verification pipelines. From reviewing AI-generated code line by line to reviewing the specs the verifier checks against.

The division of labor looks like this:

LayerOwnerActivity
SpecificationHuman engineerDefines invariants, contracts, correctness properties
Code generationAI (LLM)Produces candidate implementations fast
VerificationFormal tools (Dafny, TLA+, Z3)Proves or refutes correctness mathematically
TestingProperty-based + fuzzCatches translation bugs between spec and implementation
ReviewHuman engineerReviews counterexamples, refines specifications

This is the answer to the organizational pressure to skip review, reduce verification, and just ship. The pressure comes from observing 10× code output and concluding that verification overhead is blocking throughput. The data says the opposite: organizations that remove verification to increase throughput move from the valley of calm to the plateau of misery. They ship more code with lower reliability. The pipeline stalls.

Formal verification, applied selectively to the critical 20% of your codebase, keeps the defect rate low enough that the rest of the pipeline flows. It shifts human effort from reading AI-generated code line by line to writing the specifications that make wrong implementations immediately visible. That’s a higher-leverage use of engineering time and a better argument to make to management than “we need more review bandwidth.”

The tools are practical today:

  • Dafny verifies all 6 RBAC specification files in under 30 seconds on a laptop
  • TLC model-checks the concurrent update protocol in under a second
  • gopter runs 2,300+ property tests in under a second
  • Total upfront overhead: roughly 20% more time spent writing specs rather than debugging production

Mager’s valley of calm stays wide when your defect rate stays low. Formal verification is the most effective tool I’ve found for keeping it there.

Everything in this post runs from the companion repository: github.com/bhatti/automated-reasoning.


References

  1. AWS: An Unexpected Discovery – Automated Reasoning Often Makes Systems More Efficient
  2. CACM: Systems Correctness Practices at Amazon Web Services
  3. CACM: AI for Software Engineering – From Probable to Provable
  4. Dafny: Teaching Program Verification at Amazon
  5. Galois: Automated Lean Proofs for Every Type
  6. Jane Street: Formal Methods
  7. Brooker: What’s Easy, What’s Hard for AI Agents
  8. Mager: The Valley of Calm
  9. Mager: The New Calculus of AI-Based Coding
  10. AI Writes Code, You Own the Design
  11. Beyond Vibe Coding – TLA+ with Claude
  12. Contract Testing for REST APIs
  13. TLA+ for the LLM Era
  14. Use Prolog to Improve LLM Reasoning
  15. Google OR-Tools CP-SAT for Scheduling
  16. you-got-skills: SDLC Skills for AI-Assisted Development
  17. ProVerB: Program Verification Book

June 17, 2026

Building a Self-Improving AI Agent with Durable Actors: MiniHermes

Filed under: Agentic AI — admin @ 8:25 pm

What Is Hermes Agent?

Hermes Agent from Nous Research is very capable open agent that centers on three ideas that reinforce each other:

  • Structured system prompt with function-calling discipline. The system prompt teaches the model when to call a tool versus when to answer directly, how to format tool inputs as JSON, and how to interpret results and loop forward. The model learns that end_turn means the task is finished. This discipline makes Hermes far more reliable than agents running open-ended prompts.
  • Multi-step tool loop. After each LLM response, the agent checks: did the model request a tool? If yes, execute it, append the result, and call the LLM again up to a configured limit. This is what lets Hermes chain steps like “search –> read –> summarise” without the user driving each step by hand.
  • Self-critique and skill accumulation. After a complex task, Hermes reflects on the conversation and extracts a reusable skill, a named, structured description of the steps it took. The next time it encounters a similar request, it injects that skill into context and executes faster, without re-discovering the procedure from scratch.

These three properties make Hermes genuinely useful. But the reference implementation is a monolithic Python process. One crash loses every in-flight session. There is no distribution, no tenant isolation, no scheduled automation, and no provider failover. It is excellent research code and a fragile foundation for anything beyond a single-user demo.

MiniHermes keeps all three Hermes ideas and rebuilds the execution model on PlexSpaces, an actor-based distributed runtime. The result compiles to a single WASM binary, runs 12 actors under supervision, and adds durable state, fault isolation, distributed cron, context compression, and guardrails without changing how the core agent loop reasons.


The Problem: Stateless vs. Stateful Monolith

Most AI agents fall into one of two camps, and both have real problems.

  • Stateless agents are easy to deploy but forget everything between requests. You can’t reuse a procedure the agent learned last Tuesday. You can’t track that the user prefers metric units. Every conversation starts from zero. The workarounds like external caches, vector stores turn the agent into infrastructure glue rather than an intelligent system.
  • Stateful monoliths like the Hermes reference implementation go the other direction: one process owns everything. That’s clean for development, but fragile under load. When the process crashes, every active session vanishes. A bug in skill extraction can corrupt the memory that session management depends on.

The actor model offers a third path. Decompose the system into many small actors, each owning exactly one responsibility, communicating only through messages. When one crashes, the supervisor restarts just that actor. The others keep running.


PlexSpaces Primitives: The Foundation

Before walking through the actors, it helps to understand the primitives every actor has access to inside the WASM sandbox. These are the only operations available, no filesystem, no global state, no raw sockets. This constraint is deliberate: it is part of what makes the system auditable and safe.

KV: Durable Point Lookup

# Persist and restore session history across restarts
host.kv_put(f"session_history:{session_id}", json.dumps(messages))
raw = host.kv_get(f"session_history:{session_id}")
messages = json.loads(raw) if raw else []

KV stores anything keyed by an exact string: session history, skill metadata, cron job state, provider configuration. The durability facet checkpoints it automatically, so a restarted actor picks up exactly where it left off.

TupleSpace: Pattern-Matched Coordination

TupleSpace is not KV. Rather than point lookups, it supports wildcard queries:

# Index a skill under multiple trigger keywords
host.ts.write(["skill_trigger", "csv",         "skill-001"])
host.ts.write(["skill_trigger", "spreadsheet", "skill-001"])
host.ts.write(["skill_trigger", "pivot",       "skill-001"])

# Find every skill that might match — None is a wildcard
all_triggers = host.ts.read_all(["skill_trigger", None, None])
# ? [["skill_trigger","csv","skill-001"], ["skill_trigger","spreadsheet","skill-001"], ...]

# Audit log: all events of a specific type
events = host.ts.read_all(["audit", "tool_executed", None, None])

# Health snapshots: last N polls
snapshots = host.ts.read_all(["health_snapshot", None, None])

TupleSpace powers skill indexes, memory tiers, audit logs, and health snapshots, anything where you scan across many entries rather than fetching one by ID.

Design tradeoff. TupleSpace pattern matching scales well for hundreds to thousands of entries but is not a replacement for a vector database or SQL at large scale. For this POC it removes an external dependency entirely; a production system with millions of skills would add an embedding-based index alongside it.

BlobStorage: Large, Opaque Content

# Skill procedures can be several paragraphs — too large for KV values
host.blob.upload(f"skill_procedure_{skill_id}", procedure_text.encode())
procedure = host.blob.download(f"skill_procedure_{skill_id}").decode()

BlobStorage handles the full procedure text that would be awkward as a KV value and wasteful to pass in message payloads.

Channel: At-Least-Once Delivery

# Cron scheduler enqueues a job
host.channel.send("", "cron:pending", "cron_job", job_payload)

# Agent receives, processes, then acks — message redelivered if agent crashes before ack
msg, ok, _ = host.channel.receive("", "cron:pending", timeout_ms=5000)
if ok:
    # ... process the job ...
    host.channel.ack("", "cron:pending", msg["msg_id"])
    # or: host.channel.nack("", "cron:pending", msg["msg_id"], True)  # requeue

Channel provides the durability that host.send() does not. If the consuming actor crashes between receive and ack, the message is redelivered on restart. This is what makes recurring tasks survive node failures without a separate message broker.

DistributedLock: Cluster-Wide Leader Election

// Go — CronSchedulerActor.tick()
// TryAcquire returns false immediately if another node holds the lock
// TTL of 90s is longer than the 60s tick interval, preventing gaps
acquired, _ := host.Lock().TryAcquire("minihermes", "cron_leader", 90000)
if !acquired {
    return // another node is the leader this cycle
}
// Safe to fire jobs — only this node runs this block right now

Without DistributedLock, every node in a three-node cluster would fire every cron job simultaneously. The lock ensures exactly one leader schedules per tick.

SendAfter: Actor-Managed Timers

@init_handler
def on_init(self, config: dict) -> None:
    host.process_groups.join("svc:health_monitor")
    # Arm the first tick — no external cron daemon needed
    host.send_after(self.poll_interval_ms, "poll_tick", {"op": "poll_tick"})

@handler("poll_tick", "cast")
def poll_tick(self) -> None:
    # ... do poll work ...
    # Re-arm: each tick schedules the next
    host.send_after(self.poll_interval_ms, "poll_tick", {"op": "poll_tick"})

send_after replaces external schedulers for periodic work inside an actor. The actor manages its own timeline.

Ask vs. Send: Request-Reply vs. Fire-and-Forget

# host.ask() — blocks until a response arrives (or timeout)
llm_resp = host.ask(llm_id, "completion",
                    {"messages": messages, "tools": tools},
                    timeout_ms=30000)

# host.send() — returns immediately, caller never waits
host.send(audit_id, "log_event",
          {"event_type": "tool_executed", "detail": f"tool={name}"})

This distinction matters for latency. Audit events and async skill learning always use send(). The calling actor never waits for them. LLM completions and tool results use ask() because the outcome is needed before continuing.

IncrCounter: Lightweight Metrics

# Increment a named counter — visible to monitoring without any external metrics system
host.incr_counter("llm_completions_total", 1)
host.incr_counter("tool_executions_total", 1)
host.incr_counter(f"tool_{name}_total", 1)
host.incr_counter("skill_matches_total", len(matched_ids))

Every key operation in MiniHermes emits a counter. Aggregated across actors, these give a metrics dashboard without Prometheus or a separate telemetry pipeline.


Architecture: 12 Actors, One WASM Binary

MiniHermes compiles to a single WASM binary. The PlexSpaces supervisor boots 12 actors from it at startup, each with its own state, crash domain, and message contract.

The four actor behaviors map to four different runtime contracts:

BehaviorActorsWhat It Provides
GenServerAgent, LLM, Tools, Skills, Memory, Compressor, Cron, Session, HealthSynchronous request-reply with durable state
GenFSMGuardrailsGateValidated state machine — invalid transitions are rejected at runtime
GenEventAuditEventFire-and-forget event delivery; callers never block
WorkflowSkillExtractionWorkflowDurable multi-step execution with per-step checkpoints and cancel/query signals

Fault isolation. A bug in SkillStoreActor cannot corrupt AgentActor‘s session history. If SkillExtractionWorkflow crashes mid-extraction, it resumes from its last checkpoint without restarting the conversation. The one_for_one supervisor strategy restarts only the failed actor; everything else keeps running.

# app-config.toml
[supervisor]
strategy = "one_for_one"           # restart ONLY the crashed child
max_restarts = 10
max_restart_window_seconds = 60    # if 10 crashes in 60s, escalate to parent supervisor

Latency tradeoff. Each actor boundary costs one ask() call instead of an in-process function call. For an LLM agent this is negligible as LLM round-trips dominate at 100ms to 10s. The isolation and recoverability benefits far outweigh the sub-millisecond message overhead.


The Supervisor Tree and the Let-It-Crash Philosophy

Monolithic agent frameworks force every developer to write defensive error handling around every tool call, every LLM request, every memory write. MiniHermes takes the Erlang philosophy instead: let actors crash, and let supervisors restart them in a clean state.

When ToolExecutorActor crashes due to a bad tool payload, a timeout, or a WASM trap, the supervisor restarts it with clean state. The AgentActor‘s in-flight request receives a timeout error and can retry. Every other actor continues running. The audit trail, the cron scheduler, the skill store, the LLM gateway, none of them know a crash happened.

This is the opposite of a monolith, where one bad tool call can corrupt the process heap and take the entire agent down.


Security: WASM, Firecracker, and Actor Isolation

Security in MiniHermes comes from three concentric layers, not from application-level checks.

  • Layer 1 Actor message isolation. Each actor owns its state exclusively. No shared memory, no global variables. Communication happens only through host.ask() and host.send(). Even if a prompt injection tricks AgentActor into misbehaving, it cannot read LLMGatewayActor‘s stored API credentials or SkillStoreActor‘s procedure data as those live in separate actor state.
  • Layer 2 WASM linear memory sandbox. Every actor compiles to a WebAssembly module. The WIT (WebAssembly Interface Types) definition explicitly lists every operation the actor can call:
// wit/plexspaces-actor/host.wit
// Actors can ONLY call these imports — nothing else is accessible
interface host {
    send:       func(to: string, msg-type: string, payload: payload) -> result<_, actor-error>;
    ask:        func(to: string, msg-type: string, payload: payload, timeout-ms: u64) -> result<payload, actor-error>;
    kv-get:     func(key: string) -> result<payload, actor-error>;
    kv-put:     func(key: string, value: payload) -> result<_, actor-error>;
    http-fetch: func(link-name: string, method: string, path: string, request: payload) -> result<payload, actor-error>;
    ts-write:   func(tuple: list<string>) -> result<_, actor-error>;
    ts-read-all:func(pattern: list<option<string>>) -> result<list<list<string>>, actor-error>;
    // No filesystem. No env vars. No raw network. No process exec.
}

A malicious tool payload cannot exfiltrate environment variables or write to the filesystem because those syscalls do not exist in the WASM environment.

  • Layer 3 Firecracker. In a production deployment, each WASM runtime runs inside a Firecracker microVM, a lightweight KVM-based hypervisor that provides hardware-enforced memory and I/O isolation between tenants. A compromise in one tenant’s actor cannot affect another tenant’s data or execution even if the WASM sandbox were bypassed.

Tenant isolation. Every PlexSpaces operation propagates tenant context automatically. KV keys, TupleSpace tuples, process groups, and object registry entries are all scoped by tenant and namespace:

# Framework-enforced key scoping — no application code can bypass this
KV:          tenant-acme:prod:session_history:sess-001
TupleSpace:  tenant-acme:prod:["skill_trigger", "csv", "skill-001"]
PG:          tenant-acme:prod:svc:agent

Tenant acme cannot retrieve a session belonging to tenant globex. The framework rejects the request before it reaches any actor.


The Agent Loop

AgentActor drives the core conversation. When it receives a chat message, here is the full sequence:

User: "calculate 42 * 17 and remember the result"

  1. Restore session history from KV (survives restarts)
  2. Ask ContextCompressorActor: token budget > 75%?
     --> Yes: summarize the middle, keep the recent tail, archive original
  3. Ask SkillStoreActor: known procedures for "calculate" + "memory_store"?
     --> Found: inject skill into system prompt
  4. Ask ToolExecutorActor: list current tool schemas
  5. LOOP (max 8 iterations):
     a. Ask LLMGatewayActor: complete with these messages + tools
     b. stop_reason = tool_use:
        --> GuardrailsGate.check("calculator")   --> allow
        --> ToolExecutor.execute("calculator", {expr: "42*17"}) --> {result: 714}
        --> GuardrailsGate.check("memory_store") --> allow
        --> ToolExecutor.execute("memory_store", {key: "last_calc", value: "714"})
        --> Append results; continue loop
     c. stop_reason = end_turn --> break
  6. KV.put("session_history:sess-001", messages)   --  durable checkpoint
  7. send (fire-and-forget): SkillStoreActor.evaluate_for_learning
  8. send (fire-and-forget): AuditEventActor.log_event
  == "42 × 17 = 714. I've stored the result in your memory."

The Python implementation:

@actor
class AgentActor:
    system_prompt: str = state(default="You are a helpful AI assistant with access to tools.")
    messages: list     = state(default_factory=list)
    max_iterations: int = state(default=8)
    token_budget: int   = state(default=4096)

    @init_handler
    def on_init(self, config: dict) -> None:
        args = config.get("args", {})
        self.system_prompt = args.get("system_prompt", self.system_prompt)
        host.process_groups.join("svc:agent")
        # Publish capabilities for registry-based discovery
        host.registry.register(ctx="", object_type="actor", object_id=config["actor_id"],
                                object_category="agent",
                                capabilities=["chat", "tool_use", "memory"])

    @handler("chat")
    def chat(self, message: str = "", session_id: str = "") -> dict:
        # 1. Restore durable session
        if session_id:
            raw = host.kv_get(f"session_history:{session_id}")
            if raw:
                self.messages = json.loads(raw)
        self.messages.append({"role": "user", "content": message})

        # 2. Compress if over token budget
        comp_id, _ = pg_first("svc:context_compressor")
        if comp_id:
            resp = ask(comp_id, "check_and_compress",
                       {"messages": self.messages, "token_budget": self.token_budget})
            if resp and resp.get("compressed"):
                self.messages = resp["messages"]

        # 3. Inject matching skills
        skill_id, _ = pg_first("svc:skill_store")
        skill_context = ""
        if skill_id:
            resp = ask(skill_id, "match_skills", {"query": message})
            if resp and resp.get("skills"):
                skill_context = self._format_skills(resp["skills"])

        # 4. Get live tool schemas
        tool_exec_id, _ = pg_first("svc:tool_executor")
        tools = []
        if tool_exec_id:
            resp = ask(tool_exec_id, "list_tools", {})
            tools = resp.get("tools", []) if resp else []

        system = self.system_prompt
        if skill_context:
            system += f"\n\n## Relevant Skills\n{skill_context}"

        # 5. The tool loop — max_iterations prevents runaway execution
        final_response = ""
        for iteration in range(self.max_iterations):
            llm_id, _ = pg_first("svc:llm_gateway")
            llm_resp = ask(llm_id, "completion",
                           {"messages": [{"role": "system", "content": system}] + self.messages,
                            "tools": tools},
                           timeout_ms=30000)

            response    = llm_resp.get("response", {})
            stop_reason = response.get("stop_reason", "end_turn")
            self.messages.append({"role": "assistant",
                                   "content": response.get("content", ""),
                                   "stop_reason": stop_reason})

            if stop_reason == "end_turn":
                final_response = response.get("content", "")
                break

            if stop_reason == "tool_use":
                guard_id, _ = pg_first("svc:guardrails")
                for tc in response.get("tool_calls", []):
                    # Every tool call clears the guardrail first
                    if guard_id:
                        check = ask(guard_id, "check_tool",
                                    {"tool_name": tc["name"], "input": tc["input"]})
                        if check and check.get("decision") == "deny":
                            self.messages.append({"role": "tool",
                                                   "content": f"[denied: {tc['name']}]"})
                            continue
                    result = ask(tool_exec_id, "execute",
                                 {"name": tc["name"], "input": tc["input"]})
                    self.messages.append({"role": "tool",
                                          "tool_call_id": tc["id"],
                                          "content": json.dumps(result)})
                    host.send(audit_id, "log_event",
                              {"event_type": "tool_executed",
                               "detail": f"tool={tc['name']} session={session_id}"})
                    host.incr_counter("tool_executions_total", 1)

        # 6. Checkpoint session — durable across restarts
        if session_id:
            host.kv_put(f"session_history:{session_id}", json.dumps(self.messages))

        # 7+8. Async learning and audit — never block the response
        if skill_id:
            host.send(skill_id, "evaluate_for_learning",
                      {"messages": self.messages, "user_intent": message})
        host.incr_counter("agent_chats_total", 1)
        return {"status": "ok", "response": final_response, "session_id": session_id}

Step 7 uses host.send(), not host.ask(). Skill learning never adds latency to the response, it happens in the background while the user reads the answer.


The LLM Gateway: Hot-Swap and Circuit Breaker

LLMGatewayActor is the single point through which all LLM calls flow. It can switch providers at runtime without restarting, and it protects downstream actors from a flaky provider with a built-in circuit breaker.

# Switch from Ollama to Anthropic — takes effect immediately, no restart
curl -X POST http://localhost:8091/api/v1/actors/llm_gateway/switch_provider \
  -d '{"provider":"anthropic","model":"claude-opus-4-8"}'

# Or to OpenAI
curl -X POST http://localhost:8091/api/v1/actors/llm_gateway/switch_provider \
  -d '{"provider":"openai","model":"gpt-4o"}'

The circuit breaker lives in the actor’s durable state, it survives restarts:

@actor
class LLMGatewayActor:
    provider:              str  = state(default="ollama")
    model:                 str  = state(default="llama3.2")
    circuit_open:          bool = state(default=False)
    consecutive_failures:  int  = state(default=0)
    total_completions:     int  = state(default=0)

    @init_handler
    def on_init(self, config: dict) -> None:
        host.process_groups.join("svc:llm_gateway")
        host.send_after(30_000, "timer_tick", {"op": "timer_tick"})

    @handler("completion")
    def completion(self, messages: list = None, tools: list = None) -> dict:
        if self.circuit_open:
            # Fail fast — don't queue work behind a broken provider
            return {"status": "ok", "response": self._simulated_response(),
                    "circuit_open": True}
        try:
            result = self._call_provider(messages or [], tools or [])
            self.consecutive_failures = 0
            self.total_completions += 1
            host.incr_counter("llm_completions_total", 1)
            return {"status": "ok", "response": result}
        except Exception as e:
            self.consecutive_failures += 1
            if self.consecutive_failures >= 3:
                self.circuit_open = True
                host.warn(f"LLM circuit opened after {self.consecutive_failures} failures")
                host.incr_counter("llm_circuit_opens_total", 1)
            return {"error": str(e), "response": self._simulated_response()}

    @handler("timer_tick", "cast")
    def timer_tick(self) -> None:
        # Gradual recovery: one fault cleared per 30s tick
        # 3 faults ? 90s before circuit closes again — prevents flapping
        if self.circuit_open and self.consecutive_failures > 0:
            self.consecutive_failures -= 1
            if self.consecutive_failures == 0:
                self.circuit_open = False
                host.info("LLM circuit closed — provider available again")
        host.send_after(30_000, "timer_tick", {"op": "timer_tick"})

    @handler("switch_provider")
    def switch_provider(self, provider: str = "", model: str = "") -> dict:
        self.provider = provider
        self.model    = model
        # Switching resets the circuit — assume the new provider is healthy
        self.circuit_open         = False
        self.consecutive_failures = 0
        return {"status": "ok", "provider": provider, "model": model}

    def _call_provider(self, messages: list, tools: list) -> dict:
        if self.provider == "ollama":
            resp = host.http_fetch("ollama", "POST", "/api/chat",
                                   {"model": self.model, "messages": messages, "stream": False})
        elif self.provider == "anthropic":
            resp = host.http_fetch("anthropic", "POST", "/v1/messages",
                                   {"model": self.model, "messages": messages,
                                    "tools": tools, "max_tokens": 4096})
        elif self.provider == "openai":
            resp = host.http_fetch("openai", "POST", "/v1/chat/completions",
                                   {"model": self.model, "messages": messages, "tools": tools})
        return self._normalize(resp)

Every provider response normalizes to the same format before leaving the gateway:

{
  "content":    "42 × 17 = 714",
  "stop_reason": "end_turn",
  "tool_calls": [],
  "usage":      {"input_tokens": 112, "output_tokens": 18}
}

AgentActor never knows which provider answered. Switching providers is transparent to the rest of the system.

Design tradeoff. The circuit breaker in this POC uses a simple failure count threshold. A production implementation would add per-provider backoff, budget caps, and latency-based degradation.


Skill Learning: The Self-Improvement Loop

This is what separates MiniHermes from every standard agent loop. When the agent uses three or more tools in a single turn, it asynchronously extracts a reusable skill. The next time the user asks something similar, the agent injects that skill into the system prompt and skips the re-discovery phase entirely.

The Durable Extraction Workflow

SkillExtractionWorkflow uses the @workflow_actor behavior, which checkpoints state after each step. A node crash during step 2 of 3 resumes from step 2, not the beginning:

@workflow_actor
class SkillExtractionWorkflow:

    @run_handler
    def run(self, payload: dict = None) -> dict:
        user_intent  = payload.get("user_intent", "")
        tool_sequence = payload.get("tool_sequence", [])
        domain        = payload.get("domain", "general")
        llm_id        = payload.get("llm_id", "")

        # Three focused LLM passes — each optimizes for a different extraction goal.
        # Python runs them sequentially (shared LLM budget).
        # Go runs them in true parallel goroutines for lower latency.
        name_result      = self._analyse_name(llm_id, user_intent, tool_sequence)
        # ? workflow checkpoints here; crash-safe from this point

        procedure_result = self._analyse_procedure(llm_id, user_intent, tool_sequence)
        # ? checkpoint

        trigger_result   = self._analyse_triggers(llm_id, user_intent, domain)
        # ? checkpoint

        skill_id = f"skill-{host.now_ms()}"
        skill_store_id, _ = pg_first("svc:skill_store")
        if skill_store_id:
            ask(skill_store_id, "propose_skill", {
                "skill_id":        skill_id,
                "name":            name_result.get("name", "unnamed-skill"),
                "description":     name_result.get("description", ""),
                "procedure":       procedure_result.get("procedure", ""),
                "tags":            trigger_result.get("tags", []),
                "trigger_patterns": trigger_result.get("patterns", []),
            })
        return {"status": "ok", "skill_id": skill_id}

    @signal_handler("cancel")
    def cancel(self) -> None:
        # In-flight extraction can be cancelled without crashing the actor
        host.info("SkillExtraction cancelled")

    @query_handler("status")
    def query_status(self) -> dict:
        return {"task_id": self.task_id, "status": self.status, "progress": self.progress}

Three Storage Layers for Three Access Patterns

@handler("propose_skill")
def propose_skill(self, skill_id: str = "", name: str = "",
                  description: str = "", procedure: str = "",
                  tags: list = None, trigger_patterns: list = None) -> dict:

    # KV: metadata — fast exact-key lookup when the ID is known
    meta = {"skill_id": skill_id, "name": name, "description": description,
            "status": "active", "usage_count": 0,
            "created_at": host.now_ms(), "last_used_at": host.now_ms()}
    host.kv_put(f"skill_meta:{skill_id}", json.dumps(meta))

    # BlobStorage: full procedure text — potentially several paragraphs
    host.blob.upload(f"skill_procedure_{skill_id}", procedure.encode())

    # TupleSpace: keyword indexes — pattern scan at query time, no SQL needed
    for tag in (tags or []):
        host.ts.write(["skill_tag", tag, skill_id, name])
    for pattern in (trigger_patterns or []):
        host.ts.write(["skill_trigger", pattern, skill_id])

    host.incr_counter("skills_created_total", 1)
    return {"status": "ok", "skill_id": skill_id}

Why three layers? KV answers “give me skill X” in O(1). TupleSpace answers “which skills match this query?” without an index build step. BlobStorage keeps large procedure text out of both KV values and message payloads.

Skill Matching at Query Time

@handler("match_skills")
def match_skills(self, query: str = "") -> dict:
    query_words = set(query.lower().split())

    # Scan all trigger entries — None is a wildcard
    all_triggers = host.ts.read_all(["skill_trigger", None, None])

    matched_ids = set()
    for tpl in all_triggers:
        pattern = tpl[1].lower()
        if pattern in query_words or any(w in pattern for w in query_words):
            matched_ids.add(tpl[2])

    skills = []
    for skill_id in matched_ids:
        meta_json = host.kv_get(f"skill_meta:{skill_id}")
        if not meta_json:
            continue
        meta = json.loads(meta_json)
        if meta.get("status") != "active":
            continue
        # Load the full procedure only for matched, active skills
        meta["procedure"] = host.blob.download(f"skill_procedure_{skill_id}").decode()
        skills.append(meta)
        # Track usage for lifecycle decisions
        meta["usage_count"]    += 1
        meta["last_used_at"]   = host.now_ms()
        host.kv_put(f"skill_meta:{skill_id}", json.dumps(meta))

    host.incr_counter("skill_matches_total", len(skills))
    return {"status": "ok", "skills": skills}

Skills Age Out Automatically

Skills that go unused for 30 days transition to stale. After 90 more days they become archived. A daily send_after tick drives this, no external scheduler:

@handler("timer_tick", "cast")
def timer_tick(self) -> None:
    now             = host.now_ms()
    thirty_days_ms  = 30 * 24 * 60 * 60 * 1000
    ninety_days_ms  = 90 * 24 * 60 * 60 * 1000

    all_tags = host.ts.read_all(["skill_tag", None, None, None])
    seen     = set()
    for t in all_tags:
        skill_id = t[2]
        if skill_id in seen:
            continue
        seen.add(skill_id)
        meta_json = host.kv_get(f"skill_meta:{skill_id}")
        if not meta_json:
            continue
        meta = json.loads(meta_json)
        age  = now - meta.get("last_used_at", now)
        if meta["status"] == "active" and age > thirty_days_ms:
            meta["status"] = "stale"
            host.kv_put(f"skill_meta:{skill_id}", json.dumps(meta))
        elif meta["status"] == "stale" and age > ninety_days_ms:
            meta["status"] = "archived"
            host.kv_put(f"skill_meta:{skill_id}", json.dumps(meta))

    host.send_after(24 * 60 * 60 * 1000, "timer_tick", {"op": "timer_tick"})
active  --> (30 days unused) -->  stale  --> (90 more days) -->  archived

This prevents the skill store from accumulating noise from one-off tasks that will never recur.


Memory: Three Tiers, One Actor

MemoryActor manages three memory tiers with different durability and retrieval characteristics. The Hermes reference implementation stores facts in flat files; MiniHermes uses KV + TupleSpace + BlobStorage, with each tier mapped to a storage layer.

@actor
class MemoryActor:
    memory_count: int = state(default=0)

    @handler("store_memory")
    def store_memory(self, key: str = "", value: str = "",
                     scope: str = "global", tier: str = "reachable",
                     agent_id: str = "", session_id: str = "") -> dict:
        if not key:
            return {"error": "key required"}
        scoped_key = self._scoped_key(scope, agent_id, session_id, key)

        if tier == "deep":
            # BlobStorage: large, rarely needed, not scanned by default
            host.blob.upload(f"deep_memory_{scoped_key}", value.encode())
        else:
            # KV: durable point lookup
            host.kv_put(scoped_key, str(value))

        # TupleSpace index: queryable by scope and tier regardless of storage layer
        host.ts.write(["memory", scope, tier, key, str(value)[:64]])
        self.memory_count += 1
        return {"status": "ok", "key": key, "scope": scope, "tier": tier}

    @handler("recall_memory")
    def recall_memory(self, key: str = "", scope: str = "global",
                      agent_id: str = "", session_id: str = "") -> dict:
        scoped_key = self._scoped_key(scope, agent_id, session_id, key)
        value = host.kv_get(scoped_key)
        if not value:
            # Try deep tier
            try:
                value = host.blob.download(f"deep_memory_{scoped_key}").decode()
            except Exception:
                pass
        return {"status": "ok", "key": key, "value": value, "found": bool(value)}

    @handler("list_memories")
    def list_memories(self, scope: str = "global", tier: str = None) -> dict:
        pattern = ["memory", scope, tier or None, None, None]
        tuples  = host.ts.read_all(pattern)
        memories = [{"key": t[3], "value": t[4], "tier": t[2]}
                    for t in tuples if len(t) >= 5]
        return {"status": "ok", "memories": memories, "count": len(memories)}

    def _scoped_key(self, scope: str, agent_id: str, session_id: str, key: str) -> str:
        if scope == "agent"   and agent_id:   return f"mem:agent:{agent_id}:{key}"
        if scope == "session" and session_id: return f"mem:session:{session_id}:{key}"
        return f"mem:global:{key}"

The three scopes (global, agent, session) determine which facts survive which boundaries: session memories disappear with the session, agent memories persist across sessions, global memories are shared across all agents.


Distributed Cron: Recurring Tasks That Survive Node Failures

“Summarize my tasks every morning” is a natural request. Making it work reliably across a cluster requires solving three problems at once: who fires the job when there are three nodes, what happens if the firing node crashes mid-delivery, and how do you prevent duplicate execution? MiniHermes solves all three with two primitives:

// Go — CronSchedulerActor
func (a *CronSchedulerActor) tick() {
    // TryAcquire returns false immediately if another node holds the lock.
    // TTL of 90s exceeds the 60s tick interval, preventing leader gaps.
    acquired, _ := host.Lock().TryAcquire("minihermes", "cron_leader", 90000)
    if !acquired {
        return // another node leads this cycle — nothing to do
    }

    now := host.NowMs()
    for _, jobID := range a.JobIDs {
        job := a.loadJob(jobID)
        if now-job.LastRunAt >= job.IntervalMs {
            payload := map[string]interface{}{
                "job_id": job.JobID, "prompt": job.Prompt, "session_id": job.SessionID,
            }
            // Channel: at-least-once. If agent crashes before ack, job redelivers.
            host.Ch().Send("", "cron:pending", "cron_job", payload)
            job.LastRunAt = now
            a.saveJob(job)
        }
    }
}

The agent runs each cron job in an isolated session context so the job never bleeds into the user’s live conversation:

@handler("process_cron_job", "cast")
def process_cron_job(self, job_id: str = "", prompt: str = "",
                     session_id: str = "") -> None:
    cron_session = f"cron:{session_id}"

    # Stash the current interactive conversation
    saved_messages = self.messages[:]

    # Load the cron session's own history — completely separate from user sessions
    raw = host.kv_get(f"session_history:{cron_session}")
    self.messages = json.loads(raw) if raw else []

    self._run_agent_loop(prompt, tools=[])

    host.kv_put(f"session_history:{cron_session}", json.dumps(self.messages))
    self.messages = saved_messages  # restore user conversation

    host.send(audit_id, "log_event",
              {"event_type": "cron_executed", "detail": f"job_id={job_id}"})

Creating a recurring task takes one API call:

curl -X POST http://localhost:8091/api/v1/actors/cron_scheduler/create_job \
  -d '{
    "job_id":     "daily-digest",
    "prompt":     "Summarize today'\''s tasks and send a digest email",
    "schedule":   "every_24h",
    "session_id": "cron-digest"
  }'

Context Compression: Long Conversations Without Truncation

Every LLM agent eventually exceeds the model’s context window. The reference Hermes implementation truncates, it drops the oldest messages and loses context. MiniHermes compresses instead: ContextCompressorActor summarizes the middle of the conversation, keeps the recent tail intact, and archives the full original.

@handler("check_and_compress")
def check_and_compress(self, messages: list = None, token_budget: int = 4096) -> dict:
    messages        = messages or []
    estimated_tokens = sum(len(str(m)) // 4 for m in messages)

    if estimated_tokens < token_budget * 0.75:
        return {"compressed": False, "messages": messages}

    system_msgs  = [m for m in messages if m.get("role") == "system"]
    other_msgs   = [m for m in messages if m.get("role") != "system"]
    recent_count = max(4, len(other_msgs) // 3)
    middle       = other_msgs[:-recent_count]
    recent       = other_msgs[-recent_count:]

    if len(middle) < 2:
        return {"compressed": False, "messages": messages}

    # Archive the full original before compression — preserves audit trail
    if self.session_id:
        host.kv_put(f"full_history_archive:{self.session_id}", json.dumps(messages))

    llm_id, _ = pg_first("svc:llm_gateway")
    summary_resp = ask(llm_id, "completion", {
        "messages": [
            {"role": "system",
             "content": "Summarize this conversation history concisely. "
                        "Preserve key facts, tool results, and decisions."},
            {"role": "user", "content": json.dumps(middle)}
        ],
        "tools": []
    })

    summary_text = summary_resp.get("response", {}).get("content", "")
    summary_msg  = {"role": "assistant",
                    "content": f"[Conversation summary: {summary_text}]",
                    "is_summary": True}

    compressed = system_msgs + [summary_msg] + recent
    host.incr_counter("context_compressions_total", 1)
    return {"compressed": True, "messages": compressed,
            "original_count": len(messages), "compressed_count": len(compressed)}

Design tradeoff. LLM-based summarization costs tokens and adds latency to that one turn. The tradeoff is that the compressed context is semantically richer than simple truncation as the model retains the meaning of earlier turns, not just the most recent N messages. For a task-focused agent this matters: a calculation result from turn 3 is still relevant at turn 50.


Guardrails: Per-Tool Policy Enforcement Without Redeployment

GuardrailsGateActor implements a GenFSM that sits between every tool call and execution. Every call passes through it. Policies update at runtime via a single message — no redeploy, no restart.

@fsm_actor(states=["allow", "review", "approved", "denied"], initial="allow")
class GuardrailsGateActor:
    # tool_name ? "allow" | "deny" | "review"
    policies: dict = state(default_factory=dict)
    deny_count: int = state(default=0)

    @handler("check_tool")
    def check_tool(self, tool_name: str = "", input: dict = None) -> dict:
        policy = self.policies.get(tool_name, "allow")

        if policy == "deny":
            self.deny_count += 1
            host.incr_counter("tool_denials_total", 1)
            host.send(audit_id, "log_event",
                      {"event_type": "tool_denied", "detail": f"tool={tool_name}"})
            return {"decision": "deny", "reason": f"{tool_name} is blocked by policy"}

        if policy == "review":
            # FSM transitions to review — observable by operators via get_state
            self.fsm_state = "review"
            host.send(audit_id, "log_event",
                      {"event_type": "tool_review", "detail": f"tool={tool_name}"})
            # Production: pause here and await human approval via Channel
            self.fsm_state = "approved"
            return {"decision": "allow", "reviewed": True}

        return {"decision": "allow"}

    @handler("set_policy")
    def set_policy(self, tool_name: str = "", decision: str = "allow") -> dict:
        self.policies[tool_name] = decision
        host.send(audit_id, "log_event",
                  {"event_type": "policy_set",
                   "detail": f"tool={tool_name} decision={decision}"})
        return {"status": "ok", "tool_name": tool_name, "decision": decision}

    @handler("get_state")
    def get_state(self) -> dict:
        return {"fsm_state": self.fsm_state, "policies": self.policies,
                "deny_count": self.deny_count}
# Block a dangerous tool immediately — affects all in-flight and future calls
curl -X POST http://localhost:8091/api/v1/actors/guardrails/set_policy \
  -d '{"tool_name":"delete_file","decision":"deny"}'

# Route a sensitive tool through human review
curl -X POST http://localhost:8091/api/v1/actors/guardrails/set_policy \
  -d '{"tool_name":"send_email","decision":"review"}'

The GenFSM behavior validates every transition at runtime. Attempting allow --> approved without going through review first is rejected by the framework so that bugs in the policy logic cannot produce invalid states.


Tools: Runtime Registration and HTTPFetch Execution

Tools are not compiled in. Any HTTP endpoint can become a tool at runtime without redeploying the binary:

# Register a weather API as a tool — takes effect immediately
curl -X POST http://localhost:8091/api/v1/actors/tool_executor/register_tool \
  -d '{
    "name":        "weather",
    "description": "Get current weather for a city",
    "input_schema": {"type":"object","properties":{"city":{"type":"string"}}},
    "handler_type": "service_link",
    "handler_config": {"link_name":"openweather","path":"/data/2.5/weather","method":"GET"}
  }'

ToolExecutorActor dispatches registered tools via host.http_fetch() and the only way to make outbound network calls from within the WASM sandbox:

@actor
class ToolExecutorActor:
    tools: dict     = state(default_factory=dict)   # name ? spec
    exec_count: int = state(default=0)

    @init_handler
    def on_init(self, config: dict) -> None:
        self.tools = {t["name"]: t for t in _BUILTIN_TOOLS}
        host.process_groups.join("svc:tool_executor")

    @handler("register_tool")
    def register_tool(self, name: str = "", description: str = "",
                      input_schema: dict = None, handler_type: str = "builtin",
                      handler_config: dict = None) -> dict:
        self.tools[name] = {
            "name": name, "description": description,
            "input_schema": input_schema or {},
            "handler_type": handler_type,
            "handler_config": handler_config or {}
        }
        return {"status": "ok", "name": name}

    @handler("execute")
    def execute(self, name: str = "", input: dict = None) -> dict:
        input = input or {}
        if name not in self.tools:
            return {"error": f"unknown tool: {name}"}
        self.exec_count += 1
        host.incr_counter(f"tool_{name}_total", 1)

        spec = self.tools[name]
        if spec.get("handler_type") == "service_link":
            cfg  = spec.get("handler_config", {})
            resp = host.http_fetch(cfg["link_name"], cfg.get("method","GET"),
                                   cfg["path"], input)
            return {"result": resp}

        # Built-in handlers
        if name == "calculator":
            expr = input.get("expression", "0")
            try:
                result = eval(expr, {"__builtins__": {}})  # demo only — see gaps section
                return {"result": str(result)}
            except Exception as e:
                return {"error": str(e)}
        if name == "memory_store":
            mem_id, _ = pg_first("svc:memory")
            if mem_id:
                return ask(mem_id, "store_memory", input) or {}
        if name == "memory_recall":
            mem_id, _ = pg_first("svc:memory")
            if mem_id:
                return ask(mem_id, "recall_memory", input) or {}

        return {"result": f"[simulated] {name} executed"}

Service Discovery: Process Groups vs. Object Registry

MiniHermes demonstrates both discovery patterns side by side.

Process Groups — simple, built-in, zero configuration:

# Every actor announces itself on startup
host.process_groups.join("svc:agent")

# Callers find the first available member — location-transparent
agent_id, err = pg_first("svc:agent")
result = ask(agent_id, "chat", {"message": "Hello"})
// Go version — same pattern
agentID, err := host.PG().First("svc:agent")

Object Registry — richer, capability-aware, preferred for production:

# On startup — declare what this actor can do
host.registry.register(ctx="", object_type="actor",
                        object_id=self.actor_id,
                        object_category="skill_store",
                        capabilities=["match_skills", "propose_skill", "lifecycle"])

# Caller — find an actor that specifically supports skill matching
actors = host.registry.discover(ctx="", object_type="actor",
                                 object_category="skill_store",
                                 required_capability="match_skills")
skill_id = actors[0]["object_id"] if actors else None
// Go — capability-aware lookup
agentID, err := registryFirst("agent", "svc:agent", "tool_use")

Process groups answer “is there anyone in this group?” Registry answers “is there anyone in this group who can do this?” The registry is the better choice when multiple actor versions may be deployed simultaneously, or when different instances offer different capabilities.


Audit Trail and Health Monitoring

Non-Blocking Audit with GenEvent

AuditEventActor uses the GenEvent behavior. Senders call host.send() with fire-and-forget so audit logging never adds latency to the critical path:

@event_actor
class AuditEventActor:
    event_count: int = state(default=0)

    @init_handler
    def on_init(self, config: dict) -> None:
        host.process_groups.join("svc:audit")

    @handler("log_event", "cast")  # "cast" = fire-and-forget, no reply
    def log_event(self, event_type: str = "", detail: str = "",
                  timestamp: int = 0) -> None:
        ts = timestamp or host.now_ms()
        host.ts.write(["audit", event_type, ts, detail])
        self.event_count += 1

    @handler("query_events")
    def query_events(self, event_type: str = None) -> dict:
        pattern = ["audit", event_type or None, None, None]
        events  = host.ts.read_all(pattern)
        return {"status": "ok", "events": events, "count": len(events)}

The TupleSpace audit log is append-only by construction, there is no ts.delete() in the sandbox. Every tool call, policy change, skill creation, cron execution, and circuit event lands here and stays queryable by event type.

Health Monitor with SendAfter Polling

HealthMonitorActor never subscribes to membership change events. It polls every service group on a fixed interval and writes a snapshot to TupleSpace:

_SERVICE_GROUPS = [
    "svc:llm_gateway", "svc:tool_executor", "svc:agent",
    "svc:skill_store", "svc:guardrails", "svc:audit",
    "svc:cron_scheduler", "svc:session_manager", "svc:memory",
    "svc:context_compressor", "svc:health_monitor",
]

@actor
class HealthMonitorActor:
    poll_count:      int  = state(default=0)
    last_poll_ms:    int  = state(default=0)
    group_health:    dict = state(default_factory=dict)
    poll_interval_ms: int = state(default=5000)

    @init_handler
    def on_init(self, config: dict) -> None:
        host.process_groups.join("svc:health_monitor")
        host.send_after(self.poll_interval_ms, "poll_tick", {"op": "poll_tick"})

    @handler("poll_tick", "cast")
    def poll_tick(self) -> None:
        health = {}
        for grp in _SERVICE_GROUPS:
            try:
                members      = host.process_groups.members(grp)
                health[grp]  = len(members)
            except Exception:
                health[grp] = 0

        self.group_health  = health
        self.poll_count   += 1
        self.last_poll_ms  = host.now_ms()

        host.ts.write(["health_snapshot", self.last_poll_ms, json.dumps(health)])
        # Each tick reschedules the next — no external scheduler
        host.send_after(self.poll_interval_ms, "poll_tick", {"op": "poll_tick"})

    @handler("get_health")
    def get_health(self) -> dict:
        degraded = [g for g, c in self.group_health.items() if c == 0]
        return {
            "status":       "ok" if not degraded else "degraded",
            "group_health": self.group_health,
            "healthy":      len(self.group_health) - len(degraded),
            "degraded":     degraded,
        }

Polling converges to the true state on every tick regardless of event ordering, it’s always eventually consistent and never stale for more than one poll interval.


Primitives Scorecard

MiniHermes uses 16 distinct PlexSpaces primitives across 12 actors:

PrimitiveWhere UsedWhat It Enables
KV.Get/PutAll actorsSession history, skill metadata, cron jobs, provider config
TupleSpace.Write/ReadAllSkills, Memory, Audit, HealthTag index, memory tiers, audit log, health snapshots
BlobStorage.Upload/DownloadSkills, MemorySkill procedures, deep memory archives
Channel.Send/Receive/AckCronAt-least-once job delivery; redelivers on crash
DistributedLock.TryAcquireCronSingle scheduler leader per cluster
ProcessGroups.Join/FirstAll actorsLocation-transparent svc:* discovery
ObjectRegistry.Register/DiscoverAgent, Skills, Session, HealthCapability-aware routing
SendAfterLLM, Cron, Health, SkillsSelf-scheduling tick loops; replaces external cron
HTTPFetchLLM, ToolsOutbound calls to Ollama, OpenAI, Anthropic, tool APIs
AskAgent, Tools, CompressorRequest-reply across actor boundaries
SendAgent, Cron, AuditFire-and-forget: audit events, async skill learning
IncrCounterAll actorsMetrics on every key operation
Workflow (run/signal/query)SkillWorkflowDurable parallel skill extraction with cancel/query
Durability (checkpoint_interval)All stateful actorsState persistence across crashes and restarts
GenFSMGuardrailsValidated state machine; invalid transitions rejected
GenEventAuditNon-blocking event delivery; callers never wait

Known Gaps

MiniHermes is a proof of concept, not a production system. The same disclaimer applies here as in the MiniClaw post: the point is to demonstrate what the architecture can support, not to ship something you should run in production today.

  • Skill quality and safety. The extraction workflow uses LLM reflection without any validation layer. Extracted skills can be incorrect, subtly wrong, or even harmful if the original task involved a bad assumption. A production system needs automated skill evaluation, human review for high-impact skills, and version history with rollback.
  • Calculator eval. The built-in calculator tool uses Python’s eval() with empty builtins. This is a demo shortcut. In production, replace it with an AST-based evaluator or a sandboxed tool actor in its own WASM module with no outbound capabilities at all.
  • Skill matching at scale. TupleSpace keyword matching works well up to thousands of skills. For a large skill store, keyword overlap produces too many false positives. The fix is an embedding-based vector index for semantic similarity but that requires an embedding model and an external vector store.
  • Context compression quality. The compressor summarizes the middle of the conversation with a generic prompt. It does not distinguish between a casual exchange and a chain of tool results that the later part of the conversation depends on. Poor summarization can cause the agent to “forget” a result it needs. Production compression needs to identify load-bearing context and exclude it from summarization.
  • No per-session actor instances. AgentActor stores self.messages as actor state, which all chat calls within one actor share. This is safe when there is one actor per session, but the POC maps many sessions to one actor instance. A production deployment should either run one actor per session or explicitly key all state by session_id.
  • No prompt injection defense. Tool results flow back into the conversation without any sanitization. A malicious tool response could attempt to override the system prompt. Production systems need input/output validation and possibly an LLM-as-judge layer between tool results and the next LLM call.
  • Circuit breaker threshold is fixed. Three consecutive failures opens the circuit. A slow provider that times out 20% of the time would never trip the breaker. Production needs adaptive thresholds based on error rate windows, not just consecutive failure counts.
  • No credential management. The LLM gateway reads provider API keys from service link configuration, which in this POC are stored in app-config.toml. A production system needs the phantom-token pattern from MiniClaw: the gateway resolves a real key from actor-private KV and never echoes it in any response or log.

MiniHermes vs. MiniClaw: Complementary, Not Competing

DimensionMiniClawMiniHermes
Primary focusSecurity and multi-tenant isolationSelf-improvement and operational resilience
Agent topologyMulti-agent orchestration with sub-tasksSingle self-improving long-lived agent
Session modelEphemeral per-requestLong-lived with LLM-based compression
Skill learningNone — static tool catalogAutomatic from conversation, durable workflow
SchedulingNoneDistributed cron with DistLock + Channel
LLM integrationSimulated onlyReal Ollama + OpenAI + Anthropic, hot-swap
Provider managementNoneHot-swap + gradual circuit breaker
Memory tiersSingle KV scopeCore / Reachable / Deep across three storage layers
GuardrailsWASM + actor isolation (structural)GenFSM gate with per-tool runtime policies
Credential handlingPhantom token in actor-private KVService link config (see gaps)
ObservabilityTupleSpace audit, health pollingSame, plus IncrCounter metrics on every operation

MiniClaw establishes the security foundation with WASM isolation, tenant enforcement, credential proxying, blast-radius containment. MiniHermes builds on that same foundation to add learning, resilience, and operational flexibility. A production system would combine both.


Building and Running

Prerequisites

Go implementation:

brew tap tinygo-org/tools && brew install tinygo
cargo install wasm-tools
npm install -g @bytecodealliance/jco

Python implementation:

pip install -e path/to/sdks/python

Ollama (optional — falls back to simulated LLM):

brew install ollama
ollama run llama3.2   # pulls ~2GB on first run

All tests pass without any LLM running. When Ollama is available, LLMGatewayActor switches automatically from the simulated fallback to real inference.

Build and Test

# Python
cd examples/python/apps/minihermes
./build.sh                       # componentize-py ? WASM Component Model binary
pytest test_minihermes.py -v     # unit tests, no live node required

# Go
cd examples/go/apps/minihermes
./build.sh                       # TinyGo ? wasm-tools ? component binary
go test ./... -v                 # unit tests, no live node required

Integration Tests Against a Live Node

# Start a PlexSpaces node first — see docs/getting-started.md
cd examples/go/apps/minihermes
./test.sh 8091                   # 21 steps, roughly 2 minutes

The test script covers the full actor tree:

# Basic agent chat
ask "agent" '{"op":"chat","message":"Hello","session_id":"test-1"}'

# Tool use — triggers guardrail check before execution
ask "agent" '{"op":"chat","message":"Calculate 42 * 17","session_id":"test-1"}'

# Hot-swap LLM provider
ask "llm_gateway" '{"op":"switch_provider","provider":"anthropic","model":"claude-opus-4-8"}'

# Register a new tool at runtime
ask "tool_executor" '{
  "op":"register_tool","name":"weather",
  "description":"Get weather for a city",
  "input_schema":{"type":"object","properties":{"city":{"type":"string"}}},
  "handler_type":"service_link",
  "handler_config":{"link_name":"openweather","path":"/data/2.5/weather","method":"GET"}
}'

# Create a cron job
ask "cron_scheduler" '{
  "op":"create_job","job_id":"morning-digest",
  "prompt":"Summarize pending tasks","schedule":"every_24h","session_id":"cron-main"
}'

# Block a tool via guardrails
ask "guardrails" '{"op":"set_policy","tool_name":"delete_file","decision":"deny"}'

# Query health across all service groups
ask "health_monitor" '{"op":"get_health"}'

# Query audit trail for tool executions
ask "audit_event" '{"op":"query_events","event_type":"tool_executed"}'

Future Enhancements

These patterns extend naturally once the actor foundation is in place:

  • Vector memory. Replace TupleSpace keyword matching in match_skills with embedding-based similarity search. The interface stays the same and SkillStoreActor still answers match_skills messages.
  • Multi-agent skill sharing. When one agent extracts a skill, broadcast the skill ID via TupleSpace to all other agent instances. Each agent loads the skill on the next match. The fleet improves together.
  • Streaming responses. Replace ask() for LLM completions with chunked Channel delivery. The agent sends each token back to the user as it arrives instead of buffering the entire response.
  • Skill versioning. Store each procedure update as a new BlobStorage object with a version suffix. SkillStoreActor tracks the current version in KV and can roll back if a skill causes regressions.
  • TypeScript implementation. The same 12-actor pattern compiled to TypeScript WASM that are useful for teams already working in the Node ecosystem.

Conclusion

MiniHermes is a proof of concept, not a production agent platform. What it demonstrates is a way of thinking about agent systems that is different from the standard monolith approach. The Hermes Agent design from Nous Research gives us three powerful ideas: prompt discipline, multi-step tool loops, and skill accumulation. Those ideas work whether the agent runs in one Python process or across 12 actors. What changes is everything else, e.g., what happens when a component crashes, how you update a policy without restarting, how you prevent one tenant’s data from touching another’s, and how you keep conversations going past the model’s context limit.

The actor model with PlexSpaces provides a set of primitives like KV, TupleSpace, BlobStorage, Channel, DistributedLock, SendAfter, GenFSM, GenEvent, Workflow that map directly onto the operational problems an agent system faces. State durability, fault isolation, leader election, non-blocking audit, validated state machines, durable workflows: each is one primitive. The full source for both Python and Go implementations lives at github.com/bhatti/PlexSpaces. The architecture is meant to be a starting point, not a finished product.


References

June 16, 2026

Growing as a Software Engineer in the Age of Agentic Coding

Filed under: Computing — admin @ 10:14 am

A self-guided path for junior and mid-level engineers whose core skills are quietly eroding


I have observed a contrast productivity gap watching a senior engineer use an agentic coding tool and watching a junior engineer use the same tool. The senior engineer moves faster, catches more problems, and produces better outcomes, while the junior engineer often ships code that looks finished but quietly breaks at the seams. The reason is not the tool, it is what each engineer brings to the tool.

Senior engineers are more effective with agentic AI because they have already built the skills that make AI useful: writing precise specifications, designing systems that hold together under real load, spotting code smells in a diff, understanding trade-offs between correctness and performance, maintaining the conceptual integrity of a codebase across hundreds of changes. These skills aren’t separate from coding experience, they are products of it, built up over years of writing code, breaking things, debugging production incidents, and internalizing the consequences of design decisions.

Junior and mid-level engineers haven’t built those skills yet, which used to be fine because the path to building them was clear: you wrote code, made mistakes, got reviewed by someone who caught what you missed, and learned. Repeat for several years. The trouble is that agentic coding short-circuits exactly that path. When an agent generates the code, a junior engineer faces a key problem: they cannot reliably distinguish between code that is actually correct and code that is plausibly correct. Agentic code looks right in the narrow context where it was generated, the function compiles, the tests pass, the logic seems sound. But a system is not a collection of locally correct functions. It is a web of interacting decisions, constraints, and invariants that only hold together if someone understands the whole. Senior engineers have built that whole-system mental model through years of implementation.

Some companies have responded to this by stopping junior engineer hiring entirely, reasoning that agents can now fill entry-level roles. This is a serious mistake, and a slow-moving disaster. It optimizes for short-term output while eliminating the pipeline through which every senior and principal engineer is eventually produced. Today’s junior engineers are tomorrow’s architects. When companies stop hiring and developing them, they are consuming the seed corn and they will feel it in three to five years when there are no experienced engineers left to review what the agents produce.

The risk doesn’t stop at junior engineers. Senior engineers face a subtler version of the same problem. When you stop writing code regularly, the skills built through writing it, the intuition for design, the eye for code smells, the ability to hold a large system in your head begin to decay. Specification writing becomes abstract rather than grounded. Architecture decisions lose their connection to implementation reality. Code review gets shallower because you’re no longer maintaining the mental model of how things fit together. The most important thing being lost is not any individual skill but shared understanding, what Fred Brooks called conceptual integrity, the coherence of design philosophy across an entire system that only exists when the people building it have deeply internalized how it works.

This post is about what to do about all of it. How to deliberately build the skills that agentic coding doesn’t hand you. How to maintain the skills you’ve built, as the nature of the work changes. And how to grow from junior to senior to principal in an era when the traditional feedback loop between design and implementation has been broken.


How Engineers Used to Grow

For decades, the career path followed a recognizable arc. You joined at entry level, wrote code, broke things, got your code torn apart in review, made better mistakes, and gradually developed what researchers call tacit understanding, the ability to look at a system and feel what is wrong before you can fully articulate why.

The Dreyfus model of skill acquisition describes this progression across five stages:

StageCharacteristicsDecision-makingKnowledge
NoviceFollows rules rigidly, no situational judgmentNoneContext-free
Advanced BeginnerRecognizes patterns, treats all aspects equallyWithout contextLimited
CompetentPlans consciously, sees longer-term goalsAnalyticalIn context
ProficientGrasps situations holistically, uses maximsAnalytical –> IntuitiveHolistic
ExpertNo rules needed, intuitive grasp of situationsIntuitiveDeep tacit

The Japanese martial arts concept of Shu Ha Ri mirrors this exactly. First you follow the form faithfully (Shu learn the rules). Then you find the exceptions and break with tradition (Ha question the rules). Then form dissolves into natural action (Ri transcend the rules). You cannot skip stages. The competent engineer who writes their first distributed system will make mistakes the expert would never make because they haven’t yet built the mental model that only comes from doing the work and suffering the consequences.

What made this work was consequence. Writing code gave you direct feedback. A missing lock caused a race condition. A clever abstraction became unmaintainable by the third person to touch it. A shared base class six levels deep broke four products when a parent changed. These lessons were visceral, and they stuck. The Dreyfus model would say you accumulated the situational exposure that moves you from rule-following to intuition.

Books codified what masters had learned. The Pragmatic Programmer showed how to develop craft. Code Complete provided the vocabulary for code quality. A Philosophy of Software Design showed what makes modules deep or shallow. The Mythical Man-Month showed why adding engineers to a late project makes it later, coordination cost, not coding hours, drives timelines. Research quoted there put coding at roughly 14% of total project effort. Requirements, design, testing, debugging, coordination, documentation, and operations consumed the rest. Agentic coding has compressed that 14% toward zero. The other 86% remains entirely human.


The Disruption: Design and Build Are No Longer Learned Together

In traditional software development, design and build were not two separate activities. They were one activity experienced from two angles. When you wrote the code yourself, you felt every consequence of your design decisions in real time. A bad abstraction made your own implementation painful. A missing transaction boundary caused a bug you personally had to trace to its source at midnight. You didn’t just observe these consequences. You lived them, and that is what made the lessons stick.

Agentic coding severs this connection. The engineer writes a specification, the agent produces code, and the engineer reviews the result. This workflow feels productive. It is often highly productive, for engineers who already have the judgment to specify well and review rigorously. But for engineers who are still building that judgment, it removes the primary mechanism that builds engineering judgment.

The specific failure mode for junior and mid-level engineers is the plausibility trap. Agent-generated code looks correct in the narrow local context: functions are clean, tests pass, the logic holds for the cases the spec described. What the code often lacks is correctness at the system level, the consistency guarantees that span service boundaries, the failure modes that only appear under concurrent load, the invariants that hold only if you understand the domain well enough to define them. A senior engineer reviewing that code has a whole-system mental model built through years of implementation experience. They feel when something is off even before they can articulate why. A junior engineer doesn’t have that model yet, and reviewing agent-generated code without it is like trying to spot a structural flaw in a building you’ve never seen the blueprints for.

Bertrand Meyer, in his analysis in Communications of the ACM makes this point precisely: AI-generated code creates a dangerous psychological bias because it is significantly harder to spot a subtle logical flaw in well-structured generated code than in the messy human-written code reviewers are used to. Cleanliness produces false confidence. Agents write plausible code. Plausible is not the same as correct, and the gap between the two is exactly where junior engineers without deep mental models get stuck.

For senior engineers, the risk is different but equally real. Specification, design, architecture, code review, and debugging are not static skills you acquire once and keep forever. They are maintained through practice through the practice of building systems, not just reviewing them. When senior engineers stop writing code regularly, their design intuitions gradually lose contact with implementation reality. Architectural decisions start floating free from the constraints that make them achievable. Specifications become abstract rather than grounded in how things actually work. Code review gets shallower because the reviewer is no longer holding a live mental model of how the system fits together under pressure. The decay is slow and invisible until it isn’t. I had previously seen this decay when principal/staff stopped writing code and became pure architects but agentic coding is making it more prevalent.

In AI Writes Code. You Own the Design. Here’s How to Keep It That Way, I described how AI agents resemble offshore teams more than co-located colleagues: they have a narrow context window, they lack shared understanding of your codebase, they produce locally correct work that misses the bigger picture, and they have no memory between sessions. Every session starts from zero. Amazon AWS teams learned this the hard way, AI-generated code that looked right, passed review, and then caused production incidents. Their response was to significantly tighten review policies. When a production incident costs customers millions or exposes a security breach, you cannot file a bug against Cursor or Claude Code. The engineer who approved the change is accountable.

What’s at stake, underneath all of this, is shared understanding. Fred Brooks called it conceptual integrity, the coherence of design philosophy that runs through an entire system. Conceptual integrity doesn’t live in a document. It lives in the heads of engineers who have thought deeply about the system, implemented parts of it themselves, debugged its failures, and built up a shared mental model of how the pieces fit. That shared model is what gets lost when design and implementation are permanently separated. It is also the most important and hardest-to-recover thing a team can lose. Code can be rewritten. Conceptual integrity, once gone, takes years to rebuild. Instead, the system accumulates exactly what Brooks warned against: many locally reasonable decisions that don’t cohere into a coherent whole.

We cannot give up on junior and mid-level engineers developing real skills, even as agents handle more of the typing. Code reviews by senior engineers help but we mostly learn by doing, not by watching. Engineers still need to understand how code works, how it fits into the larger system, and what the trade-offs mean under real conditions. What we build and what we understand are not separate. Pulling them apart entirely is a quality risk the whole team will pay for, slowly at first and then all at once.


The Two Skill Trees

Engineering growth has always required two parallel tracks. Agentic coding affects each differently.

Hard skills are the technical capabilities: designing systems, understanding trade-offs, debugging complex failures, recognizing code smells, reasoning about correctness under concurrency, and mastering both functional and non-functional requirements. The traditional path built these incrementally through writing code, breaking things, and fixing them. Agentic coding removes that feedback loop without replacing it.

Soft skills are the interpersonal and organizational capabilities: writing clearly, building consensus, managing ambiguity, estimating honestly, communicating with non-technical stakeholders, mentoring others, and owning outcomes across an entire project lifecycle. These have always mattered for growth from mid-level to senior and from senior to principal. Agentic coding hasn’t reduced their importance, it has raised the bar, because the differentiating value of a senior engineer shifts away from code production and toward judgment, communication, and design thinking.

Both tracks require deliberate practice. The diagram below shows the full skill landscape across career levels, mapped to the hard/soft split.


The Career Levels in Detail

Before getting to specific advice, it helps to have a clear picture of what each level actually requires.

Junior: You own software components and work on well-defined problems. You produce high-quality code under guidance, learn from review feedback. You collaborate across the full development lifecycle including code, tests, deployment, documentation but you rely on peers and managers for guidance on design. You are expected to deliver reliably within a clear scope, and you actively seek to learn.

Mid-level: You are an autonomous contributor, owning features, not just components. You design software solutions for difficult problems, though you still seek guidance on architectural strategy. You coach junior engineers. You make priority trade-offs between feature work and operational work. You participate meaningfully in code reviews not just catching bugs, but providing direction.

Senior: You lead multi-engineer projects and own team-level architecture. You work on complex problems with multiple conflicting constraints. You write for both technical and non-technical audiences. You solve problems that don’t yet have a defined technology strategy. You balance short-term delivery against long-term architectural health. You become a force multiplier and your presence should make everyone meaningfully better.

Staff / Principal: You lead across an organization, not just a team. You define technical strategy and roadmaps that span multiple teams. You take on intrinsically hard problems like major bottlenecks and undefined high-impact opportunities. You align teams toward cohesive technical visions. You earn influence through credibility and results, not title.

Junior engineers are largely tactical. Seniors span tactical and operational. Staffs/Principals are primarily operational and strategic.


Hard Skills: What to Build Deliberately

1. Learn to Write Specifications

The most immediately practical hard skill in the agentic era is writing precise specifications. Agents produce what you specify. Vague specifications produce code that fills gaps with training-data assumptions, which may or may not match your domain. This is a learnable craft.

I wrote you-got-skills framework to demonstrate how to build specification skills with use of RFC 2119 discipline. Write a one-page spec before prompting an agent, every time. For significant features, use the full design document structure I previously shared: problem statement, proposal with trade-offs, alternatives considered, non-functional requirements, and rollout plan. The act of writing this forces you to make decisions you were previously leaving implicit.

2. Build a Design Sense

One of the clearest failure patterns in junior and mid-level engineers using agentic coding is an underdeveloped design sense. They can describe what they want. They struggle to explain why one design is better than another, or to recognize when generated code silently violates conceptual integrity, which is identified in The Mythical Man-Month as the single most important property of a well-designed system.

Build this sense deliberately. Read A Philosophy of Software Design and practice the deletion test: if you deleted this module, where would the complexity go? Deep modules with small interfaces earn their place. Shallow pass-throughs add indirection without value. AI defaults to shallow modules, lots of small classes, each delegating to the next. Learning to recognize this pattern and push back on it is a concrete skill you can develop right now.

In Applying Domain-Driven Design and Clean/Hexagonal Architecture to MicroServices, I shared how Domain-Driven Design can employed for an application architecture. When AI generates code for your domain, it has no idea what your domain means. Practice making invalid states unrepresentable. Sum types that enumerate valid states, state machines that encode valid transitions, parse-don’t-validate at boundaries, These design patterns matter more in the agentic era because the compiler becomes your code reviewer when humans can’t catch everything. When AI generates code within a well-typed system, category errors that would slip through casual review become compile errors.

Study the Stable Dependencies Principle: depend in the direction of stability. As illustrated in the reusability trap analysis, the most expensive bugs often don’t come from duplicated code, they come from code shared prematurely. Recognizing when DRY has become a liability is a senior-engineer skill that requires real practice to develop.

3. Develop a Nose for Code Smells and Code Review

Reviewing AI-generated code is not casual reading. Agents write clean, plausible code. The bugs that slip through are not obvious, they’re missing idempotency tokens, race conditions that appear only under concurrent load, enum values that propagate without being handled by all consumers.

Build a structured review practice. Apply two explicit passes. The first pass looks for correctness and security: logic errors (off-by-one, null handling, TOCTOU races), security holes (injection, missing auth checks, hardcoded secrets), data loss risks, and error swallowing. The second pass looks for design: are modules deep or shallow? Are invalid states representable in the type system? Does this code separate commands from queries? Is the complexity justified by the actual problem, or has the agent added abstractions for a feature used by twelve people?

Practice this on every pull request you review, whether AI-generated or human-written. The structured passes build the intuition that experienced engineers call a “nose for code smells”.

4. Master Non-Functional Requirements

Most junior engineers understand functional requirements. Senior engineers understand non-functional requirements , how reliably, under what conditions, and with what failure behavior. This is arguably the most important distinction on the path from mid-level to senior.

When you read a feature request, train yourself to immediately ask: what is the latency budget, and at what percentile? What is the consistency model between these two data stores? What happens if this operation half-succeeds? What’s the blast radius if this component fails completely? What happens at 10x current load? These questions are what agents cannot answer from a vague prompt. In Failures in MicroService Architecture, I shared a number of production issues that I experienced with distributed systems. You can apply the outbox pattern, circuit breaker, retry with jitter, bulkheads and other patterns to remedy common production issues.

5. Keep Your Hands in the Code

Agentic coding creates pressure to delegate implementation entirely. Resist it. You do not build judgment about systems you have never built yourself. Write the spike yourself before committing to full design. Implement the critical path at least once, even if an agent later handles the boilerplate. Trace the execution of generated code in a debugger until you understand what it actually does before approving it for production.

This matters most for debugging. When something fails in production, the mental model you’ve built through implementation is what lets you form hypotheses quickly. Engineers who have only reviewed AI-generated code without deeply understanding it will struggle to diagnose the failures that code produces. The CACM analysis shows that AI-generated code introduces logical and concurrency bugs in clean-looking code that humans find harder to spot than equivalent bugs in messy human-written code.

Furthermore, as agentic coding produces more and more code we don’t fully own mentally, understanding decay sets in. Storey calls this cognitive debt. A team can have low technical debt while sitting on a mountain of cognitive debt where no one can confidently predict the impact of a change. Over time, no single engineer holds the complete picture of how the system works. This makes production incidents progressively harder to diagnose. Keeping your hands in the code, owning critical-path implementations, using agents to explain generated code you don’t immediately understand.

6. Learn Formal Methods Basics

One underappreciated direction for junior/mid-level engineers is to begin learning specification and verification techniques, not as academic exercises but as practical tools for the agentic era. I shared my experience applying TLA+ for specifications in Beyond Vibe Coding: Using TLA+ and Executable Specifications with Claude. But, you can start with property-based testing: instead of writing examples, write invariants that your system must maintain regardless of input. Start with static analysis tools and learn to interpret what they find. Write explicit pre-conditions and post-conditions for complex functions, even as comments. These habits build the specification discipline that makes your agent prompts more precise and your reviews more effective.


Soft Skills: What Separates Mid-Level from Senior from Principal

7. Write with Precision and Clarity

Writing is the highest-leverage soft skill for any engineer who wants to grow. Design documents, post-mortems, and stakeholder communications all require the same underlying capability: translating technical thinking into prose that creates shared understanding.

Practice this deliberately. Write a design document for every significant thing you build, using the full structure described in How Not to Write a Design Document: problem statement, proposal, trade-offs, alternatives considered, non-functional requirements, rollout plan. Show it to a senior engineer. Ask what questions it fails to answer. Design documents are also how you develop design skills. A bad design doc does exactly what a bad design does: it makes the solution sound inevitable, skips trade-offs, and pushes hard questions into implementation. That feels fast until production starts collecting interest on every shortcut.

8. Bring Clarity to Ambiguity

The most important skill a senior engineer develops is the ability to look at a fuzzy problem and make it concrete. This is the single most valued contribution that humans still provide in the agentic era: not the code, but the thinking that makes the code correct.

Ambiguity reduction works in both directions. On the problem side: understand the actual customer need before finalizing a solution, push back on specs that describe a solution rather than a problem, ask what the real constraint is. On the solution side: identify which design decisions are reversible versus which are one-way doors. Practice this in every design review, every planning discussion, every incident retrospective.

9. Build Alignment and Consensus

The transition from proficient to expert in any domain requires operating at the social level: building consensus among people with competing interests, aligning a technical direction through an organization that has other priorities, and navigating disagreements constructively. The trust equation from Maister et al. shows that trust has four components: credibility, reliability, intimacy (safety), and self-orientation (does it serve the system or you?). Engineers who lose influence at the senior and principal level almost always fail on the fourth element. Proposals that come across as serving “my architecture” rather than “our actual problem” collapse trust fast.

Build alignment by listening before proposing. Spend time understanding what actually hurts the team before advocating for a technical direction. Frame proposals in terms of reduced toil, reduced uncertainty instead of architectural purity. Find a long-standing pain and solve it visibly. The Aikido principle from Jerry Weinberg applies here: center, enter, turn. First be aware of yourself and what you want to accomplish. Then enter the world of the other person. Then together turn the energy in a more effective direction.

10. Communicate Upward in Business Terms

Translating technical decisions into business impact separates senior engineers from principals. The ability to tell a VP concisely, what the risk is, and what it costs to address it. Learn the metrics that matter to leadership: revenue impact, customer retention, incident cost, deployment frequency, engineer productivity. Practice expressing technical proposals in those terms. This is the failure mode highlighted in How Senior Engineers Lose Trust: communicating technical complexity without translating it into business impact, focusing on engineering outputs.

11. Estimate Honestly and Decompose Work Well

Engineers who consistently underestimate erode trust. Engineers who consistently overestimate become known as blockers. Honest estimation with explicit uncertainty ranges, clear assumptions, and candid identification of the biggest risks is a key skill.

Three practices make estimation better. First, decompose into vertical slices, not horizontal layers. A vertical slice cuts through all layers and produces something independently demoable. Horizontal slicing delays feedback as you don’t know if the feature works until the last layer is complete. Second, use three-point estimation for commitments: (Best + 4×MostLikely + Worst) / 6, and present ranges rather than single numbers. Capacity is never 100%. Budget explicitly for KTLO like operational work, incident response, and technical debt.

12. Own Outcomes Beyond Your Code

The clearest signal of an engineer ready for senior responsibility is willingness to own the work nobody wants to do: the failing test that has been skipped for months, the runbook that was never written, the technical debt accumulating in the corner nobody touches, the onboarding documentation that every new hire struggles with. This is what some call being the janitor, taking responsibility for team health and code health. It builds organizational trust faster than any individual feature. Own incidents that aren’t yours. When a production problem occurs on your team, treat it as your problem regardless of who wrote the code. In Writing Post Mortems That Actually Make You Better: A Practitioner’s Guide, I explained how to use the Five Whys and the Swiss Cheese model for documenting incident post-mortems.

13. Become a Go-To Person

Focused expertise builds the kind of reputation that earns you higher-impact work. The path to being a go-to person has three branches: project ownership, technology expertise, and domain expertise. Pick one to start. Host a learning session on something you know well. Write about it internally. Help others who are stuck on it.

14. Mentor Others

Teaching is one of the fastest ways to consolidate your own understanding. When you explain a design decision to a junior engineer, you discover exactly what you do and don’t understand. When you give code review feedback that helps someone see a flaw they missed, you sharpen your own eye.

In the agentic era, junior engineers need mentorship more than ever because the traditional mechanism of learning through building and breaking code is less available. Senior engineers who help juniors understand why AI-generated code works the way it does, how to critique it structurally, and how to reason about trade-offs are providing something genuinely important. The psychological safety research from Google’s Project Aristotle applies here: teams where members feel safe raising concerns, asking questions, and challenging designs outperform teams where they don’t. You build that culture one mentoring conversation at a time.


The T-Shape and Broken Comb Model

The most useful framework for thinking about hard skill investment is the T-shape: one area of genuine depth combined with broad familiarity across adjacent areas (the horizontal bar). As engineers progress toward principal level, the shape often becomes what practitioners call a broken comb, multiple verticals of depth across different domains, connected by broad horizontal understanding. A principal engineer might go deep in distributed systems, in observability, and in the security model of their specific domain, while maintaining enough breadth to lead design conversations across the full stack.


A Concrete Self-Guided Growth Plan

Here is a practical, time-bounded path for engineers at each stage.

If you are a junior engineer (0–3 years):

  • Write a one-page spec before prompting an agent. Compare what the agent produced to what you specified.
  • Ask to implement at least one non-trivial feature entirely yourself, even if it takes longer.
  • Read The Pragmatic Programmer and Code Complete.
  • Request structured feedback on every code review you submit.
  • Use agents to explain generated code you don’t understand.

If you are a mid-level engineer (3–6 years):

  • Write a full design document for the next significant feature you build. Share it with a senior engineer and ask specifically what questions it fails to answer.
  • Own one domain on your team completely: its documentation, its monitoring, its failure modes, its onboarding.
  • Start hosting one internal learning session per quarter on something you know well. Write it up afterward.
  • Apply a structured two-pass review to every pull request you review. Track what you catch over a month.
  • Read A Philosophy of Software Design. Apply the deletion test and bounded context thinking to your current codebase.
  • Write one post-mortem per incident using the Five Whys structure.

If you are a senior engineer aiming for staff/principal:

  • Lead one project that coordinates work across multiple engineers. Own the design and run the design review. Drive the post-project retrospective.
  • Translate one technical proposal into business impact language: metrics, incident cost, customer effect.
  • Mentor junior engineers specifically in how to critically evaluate AI-generated code.
  • Identify the most painful systemic problem on your team, the thing everyone complains about and nobody fixes. Fix it, document it, and share what you learned.

Ongoing, at every level:

Keep your hands in the code. The fraction of code that engineers write themselves will keep shrinking, but understanding what the code does, how it fits the larger system, and what its failure modes are requires someone who can read it critically, reason about it deeply, and debug it under pressure.


What We Cannot Give Up

There is real pressure in many organizations to reduce engineering involvement in requirements, design, and review to automate the entire lifecycle. This deserves serious assessment. Agents accelerate delivery but they do not absorb accountability. When code fails in production, the customer doesn’t care whether the bug was introduced by a human or a model. The engineer who approved it is responsible. Code review, even partially automated still requires human engineers who understand the system well enough to know what they’re reviewing. Junior engineers who bypass the developmental stages that build that understanding will produce reviews that miss what matters. Organizations that accept this trade-off in exchange for short-term velocity will eventually pay compounding interest.

The goal is not to resist agentic coding. The productivity gains are real and the trend is irreversible. The goal is to keep all three in check: technical debt in the code, cognitive debt in the team’s shared understanding, and intent debt in the artifacts. Agentic coding, used carelessly, accelerates all three simultaneously.


Further Reading

June 13, 2026

The Reusability Trap: When DRY Becomes a Liability

Filed under: Computing,Technology — admin @ 11:32 am

Reusability sounds like an obvious good practice. Write it once, use it everywhere. Don’t repeat yourself or DRY principle was popularized by The Pragmatic Programmer book. Every senior developer preaches it. But the most expensive production bugs I’ve seen didn’t come from code that was duplicated. They came from code that was shared when it shouldn’t have been. This post is about what happens when reusability becomes an obsession. I’ll show you the patterns that cause the most damage, and what to do instead. And I’ll end with a new angle that I think is underappreciated: why agentic AI coding assistants work dramatically better on well-designed, modular codebases and how the reusability trap actively makes them worse.


The Prophets Already Warned Us

The software industry has been here before. Fred Brooks warned about over-engineering in The Mythical Man-Month (1975):

“The general tendency is to over-design the second system, using all the ideas and frills that were cautiously sidetracked on the first one.”

Brooks also observed something cutting about reuse in practice: barriers to reuse sit on the consumer side, not the producer side. Yourdon estimated that reusable components require twice the effort of a one-shot component. Brooks put the multiplier at three. Parnas put it plainly:

“Reuse is something that is far easier to say than to do. Doing it requires both good design and very good documentation. Even when we see good design, which is still infrequently, we won’t see the components reused without good documentation.”

More recently, Sandi Metz landed on the same truth from a different angle:

“Duplication is far cheaper than the wrong abstraction.”

And Rob Pike, in the Go Proverbs:

“A little copying is better than a little dependency.”

These aren’t arguments against sharing code. They’re arguments against sharing code prematurely before the right abstraction reveals itself. The cost of the wrong abstraction is front-loaded with apparent savings and back-loaded with compounding debt.


Part 1: Inheritance, The Reuse That Keeps on Costing

The Promise vs. The Reality

One of pillar of object oriented languages is inheritance for reuse. Two classes share behavior? Extract a base class, done. Here’s the actual cost breakdown:

ApproachCost to CreateCost to ChangeBug Blast Radius
Duplicated code (2 copies)2× (independent)Local
Shared base class (inheritance)0.8×5–20× (understand all subclasses)Cascading
Composition1.2×1× (swap implementation)Local

The savings of inheritance are front-loaded. Every future change requires understanding the entire hierarchy. In a system with 100+ subclasses, that’s not a 20% savings, it’s a 2000% tax on every modification.

Anti-Pattern: The Fragile Base Class

I worked on a system where a senior executive was obsessed with reusability. The result was inheritance chains 10 levels deep. The worst example: a control-plane listener that inherited from a data-plane input class, just to reuse TCP socket handling.

WorkerListener --> TcpDataInput --> BaseTcpIn --> BaseInput --> BaseStatusReporter --> Serviceable --> EventEmitter

The listener’s actual job was: accept TCP connections from workers, validate auth tokens, register workers, distribute config bundles, and receive heartbeats. But it inherited an event processing pipeline it never used, IP whitelisting via regex it never used, proxy protocol support it never used, and socket idle timeouts that could kill healthy long-lived worker connections.

This nested hierarchy was a continuous source of bugs when making changes in the parent classes and broke products that inherited the unexpected changes.

The Stability Trap

There’s a design principle that explains exactly why the fragile base class is so dangerous: Stable Dependencies Principle (SDP), from Agile Software Development says:

Depend in the direction of stability. A component is stable when many things depend on it and few things it depends on can change underneath it. A component is instable when few things depend on it and it changes frequently. The principle gives you a metric for this:

I = Ce / (Ca + Ce)

Where Ca is the number of components that depend on the component (afferent couplings, things that would break if you changed it), and Ce is the number of components it depends on (efferent couplings, things that could change and break it). I = 0 means maximally stable (everyone depends on it, it depends on nothing). I = 1 means maximally instable (nothing depends on it, it depends on everything).

The SDP rule: if component A depends on component B, then B’s instability score should be lower than A’s. You should depend on things that are more stable than you are, never less. Now look at what inheritance actually does to these scores.

TcpDataInput in the example above has many consumers, it’s a shared base class used across the data plane. High Ca. That makes it look stable. But it’s also an actively maintained class that changes as data-plane requirements evolve like new connectors, security patches, protocol changes. Every change is a potential breaking change for every class that inherits it.

Inheritance creates a hidden stability inversion. The consuming class looks stable (high Ca, others depend on it), but it secretly depends on something instable (low I score from its own perspective, it changes for reasons the consumers don’t control).

This is why the principle matters beyond just “don’t change base classes carelessly.” The architecture itself needs to route dependencies in the direction of stability. Abstract interfaces are maximally stable (I = 0 by definition — they contain no implementation to change). Concrete implementations are instable. So:

  • Stable components should depend on abstract interfaces, not concrete implementations.
  • Instable components (leaf classes, frequently changing logic) should sit at the edge, depending inward toward stable abstractions.

The LSP Smell Test

Liskov Substitution Principle says: if S is a subtype of T, you should be able to substitute S anywhere T is expected without breaking anything. You’re violating LSP and inheritance is the wrong tool when you find yourself:

  • Overriding methods just to disable inherited behavior
  • Checking instanceof in calling code
  • Adding if (this instanceof ChildClass) in the parent
  • Setting this.checkDiskUsage = new NOOPDiskUsageChecker() in the constructor

I’ve seen a RingBufferOut that extended FileSystemOutput and used approximately 200 lines of it, a 5% utilization rate. It disabled disk usage checking, eliminated staging/upload separation, disabled orphan file reconciliation, and completely overrode bucket naming and retention logic. The ring buffer carried 2,700 lines of dead weight: cloud upload logic, parquet format support, staging directory management, none of which it used. The “savings” from inheritance were illusory. The dead weight made every change a minefield.

The rule of thumb: if you override more than 30% of inherited methods, or disable features in your constructor, you want composition, not inheritance.

Anti-Pattern: The Serviceable Base That Taxes Everything

A “Serviceable” base class forced EventEmitter onto 102 subclasses:

class Serviceable extends EventEmitter {
  private static INSTANCES: Serviceable[] = []; // Global tracking

  constructor(interval: number) {
    super(); // EVERY subclass is now an EventEmitter — whether it emits or not
    Serviceable.INSTANCES.push(this);
    this.serviceInterval = setInterval(() => this.service(), interval);
  }

  static destroyAll(): void {
    Serviceable.INSTANCES.forEach(s => s.destroy()); // kills everything, all at once
  }
}

// Result: 102 classes inherit this. Many NEVER emit events:
class DiskUsageReporter extends Serviceable {}  // never emits
class BackupManager extends Serviceable {}       // never emits
class HealthMonitor extends Serviceable {}       // never emits
class MetricsBatcher extends Serviceable {}      // never emits

The reasoning was: many components need a periodic timer, and EventEmitter is useful, let’s put both in a base class for reusability. The result: 102 classes carry EventEmitter’s overhead regardless of whether they ever emit a single event. Worse, the static INSTANCES array creates hidden coupling between all 102 subclasses. A destroyAll() call kills backup managers, metric batchers, and health monitors indiscriminately, no lifecycle ordering, no dependency-aware shutdown.

Fix it with composition:

// Timer is a composable utility — not an inheritance tax
class ServiceTimer {
  constructor(private callback: () => Promise<void>, private intervalMs: number) {}
  start(): void { this.handle = setInterval(() => this.callback(), this.intervalMs); }
  stop(): void { clearInterval(this.handle); }
}

class MetricsBatcher {
  private timer: ServiceTimer;

  constructor(interval: number) {
    this.timer = new ServiceTimer(() => this.flush(), interval);
  }
  // No EventEmitter. No global instance tracking. No forced API surface.
}

Each class composes only what it needs. Lifecycle is explicit. Testing is trivial.

Anti-Pattern: Depth-5 Inheritance for a Simple HTTP POST

The SaaS observability output in one system needed to POST metrics to a single endpoint with an API key and gzip compression. Reasonable enough. But it inherited from a 5-level chain:

BaseOutputter (~1K LOC) --> HTTPOut (~2K LOC) --> HTTPLoadBalancedOut (~400 lines)
  --> BatchedHTTPOut (~200 lines) --> BaseSaaSOut --> VendorMetricsOut

Total inherited before any vendor-specific code: ~4K lines. What the SaaS output actually needed: POST to one endpoint, one API key header, gzip compression, retry on 429/5xx. What it actually inherited: DNS resolution, endpoint health tracking, weighted routing, full request construction across TLS and proxy, cookie management, pipeline wiring, and backpressure signaling. Developers knew it was wrong. A TODO in production code said:

// TODO: create new class that handles multiple HTTP destinations
// instead of cascading inheritance chain

But inheritance makes fixing it prohibitively expensive. Every existing subclass depends on the hierarchy. The wrong abstraction becomes load-bearing. Fix it with a middleware stack (decorator pattern):

type HttpMiddleware = (req: HttpRequest, next: NextFn) => Promise<HttpResponse>;

const retrying: HttpMiddleware = (req, next) => retryWithBackoff(next, req, { maxRetries: 3 });
const compressing: HttpMiddleware = (req, next) =>
  next({ ...req, body: gzip(req.body), headers: { ...req.headers, 'Content-Encoding': 'gzip' }});
const authenticating = (apiKey: string): HttpMiddleware =>
  (req, next) => next({ ...req, headers: { ...req.headers, 'DD-API-KEY': apiKey }});

class SaaSMetricOutput {
  private transport: HttpTransport;

  constructor(config: SaaSOutputConfig) {
    // Build transport as middleware — no 3,350-line inheritance
    this.transport = buildTransport([
      authenticating(config.apiKey),
      compressing,
      retrying,
    ]);
  }
}

The SaaS output shrinks to ~100 lines. Adding a new vendor requires composing the right middleware, not reading a 5-level hierarchy.

Anti-Pattern: Empty Subclasses as Configuration

A system had 12 subclasses of an S3-compatible output. Seven were empty:

export class StorjS3Out extends S3Output {}        // 3 lines
export class CloudflareR2Out extends S3Output {}   // 3 lines
export class AlibabaCloudS3Out extends S3Output {} // 3 lines
// Each carries 4,500+ lines: local staging, orphan reconciliation,
// parquet writing, dead letter dirs — for cloud providers that need none of it

Each existed only for type registration in a factory map. Variant behavior is configuration, not subclasses:

const S3_PROVIDERS: Record<string, S3ProviderConfig> = {
  storj:         { pathStyle: true, region: 'global' },
  cloudflare_r2: { pathStyle: true, region: 'auto' },
  alibaba:       { pathStyle: false, endpoint: '{region}.aliyuncs.com' },
};

The Fix: Composition with Focused Interfaces

Each composed dependency has a focused interface. You can swap IWorkerAuth for mTLS without touching transport. You can test connection tracking with a fake server. A bug fix in data-plane TLS cannot reach WorkerListener.


Part 2: Cyclomatic Complexity, The Tax on Reused Code

When a class serves five different purposes, every execution path has to be guarded. When a module supports four modes, the mode checks spread like mold into every file that imports it. In one real system: 320 files contained topology checks (isLeader, isWorker, isEdge). 186 files checked feature flags deep in domain logic. 488 files accessed process.env directly. This is the direct consequence of reusing the same codebase to serve incompatible purposes.

// This pattern, scattered across hundreds of files:
if (ProcessInfo.isLeaderMode()) {
  this.startDistributedLeader();
  if (FeatureFlags.check('search-v2')) { /* ... */ }
  if (license.tier === 'enterprise') { /* ... */ }
} else if (ProcessInfo.isWorkerMode()) {
  this.connectToLeader();
  if (ProcessInfo.isRunningInCloud()) { /* ... */ }
} else if (ProcessInfo.isEdgeMode()) {
  this.startMinimalPipeline();
  if (FeatureFlags.check('edge-metrics')) { /* ... */ }
}

Every new mode requires touching 20+ files. You cannot test one mode without loading all mode code. Cyclomatic complexity of a single bootstrap method exceeds 20. Adding a deployment mode means auditing hundreds of files for hidden conditionals.

Anti-Pattern: Feature Flags as Global Conditionals

The same problem appears with feature flags. When they’re scattered inline across 186+ files, they become indistinguishable from mode checks, entitlement checks, and license checks, all mixed together:

if (FeatureFlags.check('auth-token')) {
  const { TokenStore } = require('./auth/TokenStore');
  rpc.register(new TokenStore(conf), TokenStore.ID);
}
if (FeatureFlags.check('data-insights') && Product.isWorker(mode)) { /* ... */ }
if (FeatureFlags.check('search') && license.tier === 'enterprise') { /* ... */ }

The fix is to resolve capabilities once at startup and inject them as either real implementations or no-ops:

interface ISearchCapability {
  registerEndpoints(router: Router): void;
  executeQuery(query: Query): Promise<Results>;
}

class NoOpSearch implements ISearchCapability {
  registerEndpoints(): void { /* no-op */ }
  async executeQuery(): Promise<Results> { return Results.empty(); }
}

// Resolve ONCE at startup — never scattered inline
function resolveCapabilities(flags: FeatureFlags, license: License): AppCapabilities {
  return {
    search: flags.check('search') && license.allows('search')
      ? new SearchModule(config)
      : new NoOpSearch(),
  };
}

// Boot is clean
async function boot(caps: AppCapabilities, router: Router): Promise<void> {
  caps.search.registerEndpoints(router); // dead code path simply doesn't exist
}

The Fix: Strategy Pattern + Policy Injection

Define behavior as strategy interfaces. Create one implementation per mode. Resolve the policy once at startup, everything else receives it:

class NodePolicyFactory {
  static create(role: NodeRole, license: License): NodePolicy {
    // THIS is the ONLY place that mode-switches
    switch (role) {
      case 'leader': return {
        processing: { maxWorkers: 0, enableSearch: true },
        behavior: new LeaderBehavior(),
      };
      case 'edge': return {
        processing: { maxWorkers: 1, maxHeapMB: 512, enableSearch: false },
        behavior: new EdgeBehavior(),
      };
    }
  }
}

// All other code receives the policy — zero mode checks
class PipelineEngine {
  constructor(private policy: NodePolicy) {}
  async start(): Promise<void> {
    const workerCount = this.policy.processing.maxWorkers; // no if-else
  }
}

Runtime complexity goes from O(modes × flags × tiers) to O(1).


Part 3: The God Class, Reuse at the Wrong Granularity

When developers try to build a “reusable” class that serves many purposes, they often produce a God Class where a single class that does everything so it can serve everyone. One system had classes like:

FileLinesResponsibilities
ApplicationServer~2K LOCBootstrap, mode detection, process spawning, metrics, REST startup, shutdown
FileSystemOutput~3K LOCStaging, upload, cleanup, metrics, parquet, reconciliation
ProcessManager~1.5K LOCProcess lifecycle, metrics init, license, git, config helpers, warm pool
HttpBaseInput~2K LOCHTTP server, TLS, health, auth, parsing, compression, routing, proxy
RemoteConnection~2,5K LOCWorker lifecycle, config push, metrics, commands, upgrades

The problem isn’t the line count. It’s that every responsibility changes for different reasons at different times. When the metrics subsystem needs a change, you’re editing the same file that controls TLS configuration. When a new output format is added, you’re touching the same class that manages staging directories.

HttpBaseInput is a good example of the architectural layer problem. It mixed transport (TCP socket management, TLS), protocol (NDJSON parsing, compression), authentication (token validation, auth state machine), application logic (field extraction, time parsing), metrics (request counts, latency histograms), and load balancing, all in one class. Every HTTP-based input (Splunk HEC, OTLP, Elastic, Datadog) inherited all ~2K lines. Changing the TLS configuration risked disrupting field extraction. Adding a health endpoint risked breaking authentication middleware. Fix it by separating layers:

// Each layer is independent — compose at construction time
class SplunkHecInput {
  constructor(
    private transport: IHttpServer,        // Layer 1: socket, TLS
    private auth: IAuthenticator,          // Layer 2: token validation
    private protocol: ISplunkHecParser,    // Layer 3: /services/collector format
    private pipeline: IEventSink,          // Layer 4: deliver events downstream
    private metrics: IInputMetrics,        // Cross-cutting: counters, latency
  ) {}
}
// Changing TLS (transport) cannot break Splunk parsing (protocol)
// Testing protocol parsing requires NO HTTP server — just pass mock events

Part 4: Missing Layers, REST Endpoints Doing Direct I/O

Here’s a less obvious form of the same problem. REST handlers that reach directly into the filesystem:

class AppsEndpoint {
  async handlePut(req: Request): Promise<Response> {
    await writeFile(targetPath, req.body);          // direct fs
    await mkdir(artifactDir, { recursive: true });
    const files = await readdir(configDir);
    // No abstraction, no transaction, no testability
  }
}

This prevents swapping storage backends, adding transaction semantics, unit testing without filesystem mocks, and centralized corruption detection. The application layer reached through the persistence layer, a layer violation that makes both layers impossible to change independently. The fix is a persistence abstraction:

interface IConfigStore {
  read(path: ConfigPath): Promise<Buffer>;
  write(path: ConfigPath, data: Buffer): Promise<void>;
  transaction<T>(fn: (tx: IConfigTransaction) => Promise<T>): Promise<T>;
}

class AppsEndpoint {
  constructor(private store: IConfigStore) {}

  async handlePut(req: Request): Promise<Response> {
    await this.store.transaction(async (tx) => {
      await tx.write(targetPath, req.body);
      await tx.write(metadataPath, metadata);
      // Atomic: both succeed or both roll back
    });
  }
}

Part 5: CRUD as Architecture, Generic APIs That Serve Nobody

CRUD generators are another form of pathological reuse. One model, one handler, one UI pattern for everything. They deliver APIs optimized for the database schema rather than user intent.

// "Reusable" CRUD generator applied to 40+ resources
createCrudEndpoints('workers', workerSchema, workerStore);
createCrudEndpoints('pipelines', pipelineSchema, pipelineStore);

// PUT /workers/:id demands ALL 10 fields, even though:
//   "Rename a worker" only needs { description }
//   "Move to a group" only needs { group }
//   "Scale up" only needs { maxProcesses, heapSizeMB }

Callers must research which fields matter for their specific operation. Concurrent callers doing GET –> modify one field –> PUT back create race conditions. The fix models what users actually do, not what the database stores:


Part 6: npm and the Dependency Chain Problem

Inheritance abuse at the code level has a direct analog at the package level. I used PERL’s CPAN extensively in the 1990s with the Mason web templating system. It worked beautifully until it didn’t. Then came Maven, pip, npm, RubyGems, Cargo. Each language built its own package ecosystem. Each package could depend on other packages, creating dependency trees that look like fractals. We never developed mature patterns for managing these at scale. The npm ecosystem exemplifies the chaos. In 2016, a developer unpublished left-pad, an 11-line function that padded strings with spaces. Thousands of projects broke overnight. Babel, React, and countless applications depended on it through layers of transitive dependencies. This pattern repeats. I’ve seen production applications import packages for:

  • is-odd / is-even: check if a number is odd (n % 2 === 1)
  • is-array: check array type (JavaScript has Array.isArray() built-in)
  • string-split: split text

The MIT Sloan Management Review and ACM both document the risks of software reuse at scale. The core finding: reuse shifts risk from “building the wrong thing” to “inheriting the wrong dependency chain.” A single Go project might pull in hundreds of transitive dependencies, each a potential security vulnerability. Both costs are real. Only the first one gets measured.


Part 7: Reusing Security Tokens, The Shared Blast Radius

The most dangerous form of reuse isn’t in code. It’s in credentials.

class InstanceSettings {
  // One token — shared by every worker in the fleet of thousands
  authToken: string = crypto.randomBytes(16).toString('hex');
}

if (req.headers['x-auth-token'] !== this.authToken) {
  return res.status(401).json({ error: 'Unauthorized' });
}

A single compromised worker exposes every worker. Revoking one worker’s access requires rotating the shared secret for the entire fleet, a coordinated operation that takes the whole fleet offline simultaneously. In one system, we shared same token between the control plane and the data plane for euse optimization. This caused innumerable bugs when control plane changed its token scheme from opaque tokens to JWT. The fix is per-identity tokens with short TTLs:

class WorkerTokenIssuer {
  async issueToken(identity: WorkerIdentity): Promise<AccessToken> {
    return this.mint({
      sub: identity.clientId,           // unique per worker
      scopes: identity.scopes,           // minimal privilege
      exp: Date.now() + this.tokenTTLMs, // short-lived
      jti: ulid(),                        // unique — enables revocation
    });
  }

  async revokeWorker(clientId: string): Promise<void> {
    await this.revocationList.add(clientId);
    // Other 9,999 workers unaffected
  }
}

Every system managing thousands of agents at scale like Datadog, Prometheus exporters, Kubernetes kubelets issues per-agent certificates or short-lived tokens. Shared credentials aren’t a cost saving. They’re a single blast radius for your entire fleet.


Part 8: Shared Modules, How Common Code Slows Teams

Shared or “common” modules feel like the right call. One place for utilities, helpers, shared models. Every team uses the same battle-tested code. No duplication. In practice, these modules become the most contested real estate in the codebase.

Team A needs a small change to a shared validation function. They open a PR. But the common module is owned by a platform team that maintains a release cadence. Team A waits for the next release window. Team B is blocked on a different change to the same file. Both PRs conflict. The platform team spends a sprint mediating merge conflicts they didn’t create. I’ve seen this pattern repeat at multiple companies:

  • A common module starts as a home for genuinely shared utilities, timestamp parsing, config validation, ID generation.
  • Teams start adding features to it because “it’s already shared.” Team A adds a flag to change behavior for their use case. Team B adds a different flag. The module grows a conditional for every team’s edge case.
  • The module that was supposed to prevent duplication becomes the largest source of complexity, merge conflicts, and broken builds in the codebase.

Brooks identified the organizational dimension of this in The Mythical Man-Month: corporate-level reuse “implies changes in project accounting and measurement practices to give credit for reusability.” Teams get credit for shipping features, not for investing in shared infrastructure. The incentives push toward adding to common quickly, and away from the expensive work of designing a proper stable interface. The result is that common gets additions but rarely deletions, refinements, or principled breaking changes.

What works instead:

  • Narrow, stable libraries: utilities with pure functions (parseTimestamp, generateId), no state, no side effects. These can be shared safely because they have no behavior to conflict over.
  • Published interfaces, not shared implementations: agree on the contract, let each team implement. If two teams share an interface rather than a class, their implementations evolve independently.
  • Internal packages with semantic versioning: treat shared code like a real library. Pin versions per team. Break changes intentionally and explicitly. Don’t silently couple release trains.
  • Copy for divergence: if Team A and Team B both need slightly different behavior from a shared function, copy it. Let each version evolve toward its actual use case. The right abstraction will reveal itself only after divergence, not before.

Part 9: The Monolithic Binary, Inheritance Made Physical

Inheritance abuse has a physical consequence: it makes separation architecturally impossible. When WorkerListener extends TcpDataInput, you cannot compile WorkerListener without the entire data-plane input hierarchy. You cannot deploy the leader without bundling all input connector code. When HeartbeatSender extends TcpSender, you cannot deploy a worker without bundling all output connector code. The result in one system: a single binary exceeding 200MB containing all modes, all 150 connectors, and all feature code, regardless of which node role deployed it.

SystemArchitectureAgent Size
Monolithic inheritance systemSingle binary, all modes200–400MB
Datadog AgentGo binary, plugin-based~50MB
Fluent BitC binary, plugin-based10–30MB
VectorRust binary, feature-flagged30–50MB
TelegrafGo binary, registry pattern~60MB

The inheritance chain creates a compile-time dependency graph that makes separation physically impossible even if you wanted a “leader-only” binary, the import chain through inheritance pulls in every connector. Competitors use composition-based plugin architectures from the start:

// Telegraf: no class inherits from another — each plugin is independent
func init() {
    inputs.Add("kafka", func() telegraf.Input { return &KafkaInput{} })
}
// Adding a plugin: add one file. No core file modified.
// Building a minimal binary: don't compile that file.

Each module declares its activation events. The kernel loads only modules matching the current role and entitlements. A bug in the Kafka connector cannot affect S3. Adding a connector requires zero changes to core.


Part 10: Shared Mutable State, The Singleton Tax

In one system, we had 474 singletons. That’s how many I counted in one codebase.

Configuration.instance().loadSystem('app');
GitMgr.instance().ignore();
AuthTokenAuthority.instance().createToken(claims);
InputMgr.instance().getInput(id);
// ... 20+ more

Every singleton creates invisible coupling: any code can access any singleton without declaring the dependency. Creation and destruction order is undefined. Tests cannot provide mocks without manipulating global state. Request-scoped, group-scoped, and process-scoped data all use the same pattern. Module-level mutable state is the same problem in a different form. One system had 30+ pipeline functions with module-level variables:

let _primaryCache = new Map();
let _numEventsReceived = 0;

exports.process = (event) => {
  const key = _expression.evalOn(event);
  _primaryCache.get(key).count++;  // global mutation in hot path
  _numEventsReceived++;
};

There’s no isolation between pipeline instances sharing the same module, and race conditions emerge the moment processing is parallelized. The fix is closure-encapsulated state — state is local to the instance, not the module:

function createProcessor(config: ProcessorConfig): Processor {
  let primaryCache = new Map<string, CacheEntry>(); // local to THIS instance

  return {
    process(event) {
      const key = config.keyExpr.evalOn(event);
      const entry = primaryCache.get(key) ?? createEntry();
      entry.count++;
      return entry.count <= config.maxToAllow ? event : null;
    },
  };
}

const processor1 = createProcessor(config1);
const processor2 = createProcessor(config2); // completely independent

Part 11: The New Angle, Agentic AI Thrives on Modular Code

Here’s something I’ve observed that doesn’t get written about enough: the quality of AI-generated code degrades sharply with the complexity of the codebase it works in.

Agentic coding tools like Claude Code, Cursor, Copilot in agent mode, and others are transformative for well-structured codebases. But point them at a codebase with deep inheritance hierarchies, scattered conditional logic, god classes, and shared mutable singletons, and the output becomes unreliable in predictable ways.

Why Bad Structure Amplifies AI Mistakes

  • Context window exhaustion. When a class inherits from a 7-level hierarchy, understanding what any method does requires reading across 3+ directories and thousands of lines. AI tools have a finite context window. A god class of 2,000+ lines, a shared common module with hundreds of exports, or a deep inheritance tree consumes that window before the model even reaches the code it’s supposed to change. The model ends up reasoning from partial context and partial context produces confident-looking but wrong code.
  • Conditional logic compounds errors. When 320 files contain mode checks and 186 contain scattered feature flag conditionals, the model has to track implicit state through the entire call graph to reason correctly about any change. Every missed conditional is a latent bug. I’ve seen AI agents introduce a change that was correct for isLeaderMode() but silently wrong for isEdgeMode()because the conditional branching was too diffuse to track reliably.
  • Inheritance hierarchies hide side effects. When a model generates code for a leaf class in a deep hierarchy, it may not realize that super.init() triggers a chain of side effects through five parent classes, or that overriding getTimeout() will be called in 12 different contexts. The model sees the method signature. It doesn’t see the full inheritance contract. The result looks plausible but breaks at runtime.
  • Shared mutable state creates invisible dependencies. A model generating a new component might not know that a singleton it touches is also modified by three other components during the same request lifecycle. In a clean dependency-injected system, those dependencies are declared. In a singleton-heavy system, they’re invisible and invisible dependencies produce bugs that are hard to reproduce and harder to explain to an AI that’s trying to help you fix them.

What AI Agents Do Well and Where Structure Helps

The pattern I keep seeing: AI agents work best when they can work on one focused thing at a time. A well-designed system with:

  • Small classes with single responsibilities
  • Explicit interfaces and dependency injection
  • Focused modules with clear boundaries
  • No cross-domain inheritance
  • Composition over inheritance throughout

The cleanest formulation I’ve found: the codebases that benefit most from AI-assisted development are exactly the codebases that already practice good design.


The Decision Framework

MechanismSafe WhenDangerous When
Copy-paste2–3 instances, likely to divergeNever
Shared utility functionPure logic, no state, no side effectsWhen it accumulates parameters to serve all callers
Shared interfaceMultiple implementations of same contractWhen the interface grows to satisfy one implementation
CompositionReusing behavior across unrelated concernsAlmost never dangerous
InheritanceTrue “is-a”, LSP holds, < 30% overrideDifferent domains, constructor disabling, >30% override
Common moduleStable, narrow, pure utilitiesAnything with mutable behavior, ownership ambiguity
CRUD generatorSimple reference dataResources with distinct business operations
Shared config/tokenNeverAlways

Conclusion: Duplication You Can See vs. Coupling You Can’t

The drive for reusability is real. Duplicated logic is a real cost. But the engineers who warn against premature abstraction like Brooks, Metz, Pike, Beck, Parnas are pointing at something specific: coupling is invisible at creation time and expensive at change time. Duplicated code can be changed independently. The wrong abstraction propagates changes to every consumer. A shared inheritance hierarchy means a security fix in the control plane can take down the data plane. A shared token means one compromised worker compromises the fleet. A shared common module becomes the shared surface for every team’s bugs and merge conflicts.

And now there’s a new dimension to this: a well-structured, modular codebase with clear boundaries and composition over inheritance is also the codebase where AI agents work reliably. The investment in clean design pays dividends across every developer you add whether human or AI.

The safest question to ask before sharing anything: what happens when this needs to change? If the answer is “nothing else breaks,” share it. If the answer is “everything that depends on it,” think harder about whether you’re creating an abstraction or a trap. Start with duplication. Let the right abstraction reveal itself. Then share via composition, narrow interfaces, and well-bounded modules. The cost of the wrong abstraction always exceeds the cost of a little repetition.


June 1, 2026

Killing the State Machine: Declarative AI Coding Agents with an Orchestration System

Filed under: Agentic AI — admin @ 7:16 pm

Background

I have built a number of agentic systems over the last year. I built a PII detection system with LangChain and Vertex AI that scans documents and redacts sensitive data without human review. I built an API compatibility guardian using LangGraph that catches breaking changes before they reach production. And I built a production-grade enterprise AI platform on vLLM serving multiple teams and use cases. Most recently I wrote a complete guide to production AI agents with MCP and A2A.

Alongside that work I have been using agentic coding tools heavily by letting AI write code while I own the architecture and design. I documented that approach in AI Writes Code, You Own the Design, which covers how to use skills with structured methodology files to make AI coding agents produce consistent, reviewable, architecturally sound output instead of chaos.

But there’s a deeper layer of context. Over ten years ago, before GitHub Actions and GitLab Runner existed as concepts, I built a distributed orchestration engine for automating heterogeneous tasks with declarative syntax. It used Docker, Kubernetes, shell scripts, and custom worker types to handle diverse workloads. The core insight then is the same insight that applies now: scheduling, fault tolerance, retries, timeouts, observability, and capacity management are solved problems. Your application should not implement them. That engine became Formicary, which I open-sourced. This post shows how I applied Formicary to automated agentic coding workflows and why enterprises keep making the same expensive mistake.


The Problem I Keep Seeing

When teams build AI coding agents like systems that pick up GitHub issues, plan implementations, write code, run tests, and open PRs, they reach for the obvious approach: a coordinator process, a state machine, custom pollers. The initial version works. Then it accumulates. I have seen enterprises building custom solutions with 50K+ lines of TypeScript. Look inside these systems and you find the same failure modes every time:

  • No per-phase timeouts. If the AI model hangs during implementation, the process runs until a global job timeout kills it — often 90 minutes later, after consuming an expensive model session and blocking other work.
  • Silent work drop. When the worker pool fills, the system silently skips newly discovered issues instead of queuing them.
  • Context loss between phases. The planner writes a plan file. The implementer starts a fresh AI session and re-explores the entire codebase from scratch. The planning work gets thrown away.
  • Custom DAG reinvention. The state machine handles branching: tests fail -> retry, model blocked -> notify human. This is just a DAG with exit-code routing. It’s already solved, and the custom version is always underpowered.
  • Crummy restarts. Retry a failed issue and the agent reuses the same branch name. Git conflict. Failure. Start over.
  • Infrastructure lock-in. You can’t run it on a laptop because it’s tangled with Kubernetes pod lifecycle management.
  • High cost per new feature. Adding a security review phase means new state transitions, new code, a new deployment takes days of engineering time.

The root mistake is treating orchestration as application logic. These teams write scheduling, capacity management, artifact passing, observability, and retry logic inside their agent code. Every one of those concerns is already solved by mature orchestration frameworks. Stop writing that code.


The Declarative Replace

I have used a 50K+ lines TypeScript agent system in an enterprise environment, which I replaced with a few declarative workflow definitions such as:

ai-gh-issue-picker.yaml   (~100 lines)  — polls GitHub, submits jobs
ai-gh-implement.yaml      (~500 lines)  — plan -> implement -> test -> verify -> PR -> monitor -> learn
ai-gh-cleanup.yaml        (~80 lines)   — stale workspace and branch cleanup

No orchestration code. No state machine. No custom pollers. No retry logic. No timeout management. Formicary handles all of it.

Here is every decision, with the reasoning.


Decision 1: Replace Custom Pollers with a Cron Job

Custom polling processes run continuously, consume resources, and require their own deployment lifecycle. I replaced the GitHub issue poller with a Formicary cron job:

job_type: ai-gh-issue-picker
cron_trigger: "0 * * * * * *"   # every minute (7-field cron)
max_concurrency: 1               # only one picker at a time

skip_if: >-
  {{if ge (CountByJobTypeAndState "ai-gh-implement" "PENDING") 10}} true {{end}}

The skip_if fires at the scheduler level before any worker is allocated, before any task runs. If 10 implement jobs are already pending, Formicary skips the entire picker invocation silently. Zero worker cost.

The gather-issues task fetches GitHub issues labeled ai-ready, moves each label to ai-in-progress, and writes a compact issues.json. I wrote it in Python rather than bash because Python eliminates the jq/base64/subshell-scoping traps that plagued the original version:

import json, os, subprocess

repo = f"{os.environ['GH_ORG']}/{os.environ['GH_REPO']}"

def gh(*args):
    r = subprocess.run(["gh"] + list(args), capture_output=True, text=True)
    return r

r = gh("issue", "list", "-R", repo,
       "--label", os.environ["PICKUP_LABEL"], "--state", "open",
       "--limit", os.environ.get("MAX_PENDING", "10"),
       "--json", "number,title,url")
issues = json.loads(r.stdout) if r.returncode == 0 else []

for issue in issues:
    gh("issue", "edit", str(issue["number"]), "-R", repo,
       "--remove-label", os.environ["PICKUP_LABEL"],
       "--add-label", os.environ["INPROGRESS_LABEL"])

issues_json = json.dumps(issues, separators=(',', ':'))
with open("issues.json", "w") as f:
    f.write(issues_json + "\n")
print(f"::set-output name=IssuesJSON::{issues_json}")

The submit-jobs task uses SubmitJobsFromJSON, a Formicary template function that submits one implement job per issue directly through the DB. A unique index on user_key (keyed as ai-gh-implement-{org}-{repo}-{number}) rejects duplicate submissions at the constraint level. No pre-flight lookups, no race conditions:

environment:
  SUBMITTED_IDS: >-
    {{if .IssuesJSON}}{{SubmitJobsFromJSON "ai-gh-implement" .IssuesJSON
        (printf "GitHubOrg=%s" .GitHubOrg) (printf "GitHubRepo=%s" .GitHubRepo)}}{{end}}
  PENDING_COUNT: '{{CountByJobTypeAndState "ai-gh-implement" "PENDING"}}'

Decision 2: Replace the State Machine with a DAG

A 12-state custom state machine becomes a named DAG in YAML. The full pipeline looks like this:

Exit-code routing handles every branch. No code required:

- task_type: implement
  on_exit_code:
    COMPLETED: unit-test
    "2": notify-blocked    # model signals blocked
    "3": fix-tests         # tests failing
  on_failed: notify-blocked

The unit-test task verifies commits exist, shows the diff, then detects and runs the project’s test suite, it checks for Makefile, Cargo.toml, package.json, go.mod, or pytest and runs whichever it finds. If no commits were made, it fails immediately. If tests fail, it routes to fix-tests. The self-verify task runs a separate AI reviewer session that runs tests, checks correctness, checks security, and verifies the implementation matches the issue. A fresh context catches mistakes the implementer’s context was blind to. If self-verify cannot resolve a problem, create-pr still runs but the PR body explicitly states what remains unresolved. Silently creating PRs with known failures is a common failure mode in imperative systems, I designed against it.


Decision 3: Give Every Phase Its Own Timeout

The biggest operational gap in imperative agents is missing per-phase timeouts. I gave every task its own:

- task_type: plan
  timeout: 15m

- task_type: implement
  timeout: 45m

- task_type: unit-test
  timeout: 10m

- task_type: self-verify
  timeout: 15m

- task_type: cleanup
  always_run: true    # runs even if the job fails
  timeout: 1m

always_run: true on cleanup guarantees Formicary removes the workspace and branch regardless of outcome. Without it, stuck jobs leak temporary directories and dead branches indefinitely.


Decision 4: Flow Context Forward Through Artifacts

Imperative bots lose context between phases because each phase is a separate pod with no shared state. The planner’s work gets discarded. I solved this years ago with a shared workspace and an artifact chain:

Each task declares its dependencies and Formicary downloads the upstream artifacts automatically:

- task_type: self-verify
  dependencies:
    - setup       # downloads meta.env
    - implement   # downloads impl_result.json, impl_conversation.txt, impl_diff.patch
  script:
    - |
      TASK_DIR="$PWD"          # capture executor dir before any cd
      source "$TASK_DIR/meta.env"
      cd "$WS/repo"
      # all artifacts available in $TASK_DIR/

One critical detail: save TASK_DIR="$PWD" before any cd. Artifacts must be written back to the executor’s working directory, not to the repo:

TASK_DIR="$PWD"
source "$TASK_DIR/meta.env"
cd "$WS/repo"
# ... do work ...
jq ... > "$TASK_DIR/result.json"   # write to TASK_DIR, not to repo

The implementer now reads PLAN.md that the planner wrote. Context survives across phases.


Decision 5: Use Nonces to Make Restarts Safe

One issue with imperative implementation was that when a job retried a failed issue, it reused the same branch name. Git conflict. In the workflow definition, I added a 4-byte random hex nonce to every branch:

NONCE=$(head -c 4 /dev/urandom | xxd -p)
BRANCH="ai/{{.IssueNumber}}-${SLUG}-${NONCE}"
# e.g., ai/42-fix-login-timeout-a3f1

retry: 1 on the implement job submits a fresh attempt with a new nonce -> new branch -> no conflicts. The ai-gh-cleanup job removes stale branches after PR merge.


Decision 6: Stream Output and Extract Structured Status

I need two things simultaneously: real-time visibility of what the agent is doing, and structured status for routing decisions. claude --print streams output through tee, while the prompt instructs Claude to output a JSON status object on its final line:

claude --print --dangerously-skip-permissions --model "$MODEL" --max-turns 100 \
  "$(cat /tmp/impl_prompt.txt)" 2>&1 | tee "$TASK_DIR/impl_conversation.txt"

# Extract the last JSON object with a "status" key
STATUS_JSON=$(grep -oE '\{[^{}]*"status"[^{}]*\}' \
  "$TASK_DIR/impl_conversation.txt" | tail -1)
STATUS=$(echo "$STATUS_JSON" | jq -r '.status // "UNKNOWN"')
[ "$STATUS" = "BLOCKED" ] && exit 2
[ "$STATUS" = "TESTS_FAILING" ] && exit 3

--dangerously-skip-permissions is required. Without it, Claude only produces text describing what it would do, zero file changes, zero commits. With it, Claude actually reads files, writes code, and runs tests. This gives me four things at once: real-time streaming to the Formicary dashboard, exit-code routing from the status field, artifact data for downstream tasks, and the full AI conversation captured as a debuggable artifact.


Decision 7: Encode Methodology in Skills

I don’t ask Claude to “write some code.” I embed skill instructions that encode engineering discipline into every prompt. I wrote about this approach in depth in AI Writes Code, You Own the Design, the core idea is that freeform prompting produces inconsistent output, while skill-encoded prompting produces output that follows a contract.

claude --print --model opus --max-turns 30 \
  "Use the ygs-wbs skill approach:
   1. Explore the codebase
   2. Decompose into vertical-slice tasks
   3. Write PLANS/{issue-slug}-{number}-plan.md with acceptance criteria"

If you-got-skills is installed on the worker, Claude discovers /ygs-wbs as a slash command automatically. The prompt-embedded version works either way, no dependency on the skills package being present.

The four skills that shape this pipeline:

PhaseSkillWhat it enforces
planygs-wbsVertical slices, acceptance criteria, explicit scope
implementygs-implementAtomic commits, tests after each task, scope guardrails
fix-testsygs-investigateRoot cause analysis, not symptom masking
self-verifyygs-code-reviewRun tests, check correctness, fix critical issues

Each skill acts as a contract. “Plan vertically, commit atomically, stop when blocked” produces far more consistent and reviewable output than open-ended instructions.


Decision 8: Make the Dashboard Show What’s Happening

Formicary’s job description field accepts markdown. Every submitted implement job carries clickable links to the issue, branch, and PR:

{
  "job_type": "ai-gh-implement",
  "description": "#42: Fix login timeout | [org/repo](https://github.com/org/repo)",
  "params": {
    "IssueLink": "[#42: Fix login timeout](https://github.com/org/repo/issues/42)",
    "BranchLink": "[ai/42-fix-login-a3f1](https://github.com/org/repo/tree/ai/42-fix-login-a3f1)",
    "PRLink": ""
  }
}

The PRLink starts empty and the create-pr task populates it once the PR exists. Every job in the dashboard now shows exactly what it’s working on with one-click navigation to the relevant GitHub page.


Decision 9: Capture Everything as Artifacts

Every task uploads artifacts with when: always including on failure. This is what makes debugging possible rather than a guessing game:

ArtifactContents
plan_conversation.txtFull AI conversation during planning
plan_result.jsonStatus, complexity, task count, summary
impl_conversation.txtFull AI conversation during implementation
impl_result.jsonStatus, files changed, commit count
impl_diff.patchComplete git diff of all changes
impl_commits.txtList of commits made
test_output.txtTest suite output with pass/fail details
verify_result.jsonTest pass/fail, critical findings, any fixes
verify_conversation.txtFull AI conversation during self-verify

Every task also sets report_stdout: true, Formicary streams output to the dashboard websocket in real time. Combined with tee, you see the full AI conversation live as it happens. The workspace also persists locally at ~/claude_workspace/{issue}-{nonce} so you can cd into it after a run and inspect exactly what happened.


Decision 10: Monitor PRs and Capture Learnings

Imperative bots typically run a PR comment poller that fires every few minutes, scanning for mentions. I replaced it with a task inside the implement job that lives as long as the PR stays open:

The monitor-pr task:

  1. Polls for new PR review comments every 2 minutes
  2. Feeds each new comment to Claude, applies the change, commits, and pushes
  3. Replies on the PR confirming the fix
  4. Tracks processed comment IDs in $WS/.processed_comments to avoid re-processing
  5. Exits when the PR merges or closes

The learn task runs after the PR closes. It reviews all PR comments, reviewer feedback, and the implementation conversation, then writes a structured learning entry to ~/claude_workspace/learn_context/ using the ygs-learn skill methodology: what went well, what to improve, patterns to remember for this codebase. Over time the agent gets better at this specific repo, not just better in general.

- task_type: monitor-pr
  method: SHELL
  timeout: 24h

- task_type: learn
  method: SHELL
  # reviews PR feedback, writes to ~/claude_workspace/learn_context/

Decision 11: Support Multiple Trackers with Minimal Changes

The pipeline is intentionally tracker-agnostic. Only two tasks touch the issue tracker API: gather-issues in the picker, and create-pr plus monitor-pr in the implement job. Everything else: plan, implement, unit-test, self-verify, learn works identically regardless of tracker.

To support Jira and Bitbucket, I cloned the YAML files and swapped six commands:

  • gh issue list -> acli jira search --jql ...
  • gh issue edit -> acli jira issue update
  • git clone git@github.com: -> git clone git@bitbucket.org:
  • gh pr create -> acli bitbucket pr create
  • gh pr view -> acli bitbucket pr get
  • gh api .../comments -> acli bitbucket pr comment list

Result: ai-jira-issue-picker.yaml and ai-jira-implement.yaml, the same complete pipeline, different API calls. Both use the Atlassian CLI (acli) configured at ~/.config/acli/config.json.


What Formicary Gives You Without Writing a Line

When I started applying Formicary to agentic coding, I wasn’t sure it had everything needed. It had almost all of it already:

  • Cron: scheduling with 7-field syntax (including seconds)
  • Per-task timeouts: the feature imperative bots most consistently lack
  • Exit-code routing (on_exit_code): conditional DAG without custom code
  • always_run: true: guaranteed cleanup regardless of failure
  • Artifact: passing between tasks via S3
  • Encrypted secrets: with automatic log redaction
  • max_concurrency: capacity management declared in YAML
  • retry + delay_between_retries: automatic backoff
  • Go template functions: variable substitution in scripts
  • SHELL executor: runs on a laptop with no Kubernetes
  • KUBERNETES executor production-grade pod-per-task isolation
  • Markdown in job descriptions: visible, clickable in the dashboard

Two additions were made specifically for this use case.

Native Kubernetes secret injection. The naive pattern passes API keys through the orchestrator as template variables, which stores them in the job definition. The new pattern lets the kubelet inject them at pod start time, the value never touches Formicary:

container:
  image: ghcr.io/formicary-ai/agent-worker:latest
  env_from:
    - secret_ref: claude-bedrock-settings
    - secret_ref: ai-agent-secrets

Or for a single named key:

container:
  env_value_from:
    - name: ANTHROPIC_API_KEY
      secret_name: ai-agent-secrets
      key: anthropic-api-key

Per-task service accounts work the same way for IRSA on AWS or Workload Identity on GCP:

container:
  service_account: ai-agent-irsa-sa

CountByJobTypeAndState template function. The original capacity check made an HTTP API call requiring a token, an available endpoint, and network round-trip time. The new function queries the job database directly at the scheduler level before any worker is allocated:

skip_if: >-
  {{if ge (CountByJobTypeAndState "ai-gh-implement" "PENDING" "EXECUTING") 10}} true {{end}}

If the count hits the threshold, Formicary skips the entire job invocation with zero cost. The script also does a fine-grained check using the configurable MaxPendingJobs variable. Two layers: cheap early termination at the scheduler, tunable limits inside the task.


The Numbers

MetricImperative BotFormicary Declarative
Lines of orchestration code~50,000 LOC~700 lines YAML
State machine states12+0 (implicit in DAG)
Custom pollersMultiple0
Per-phase timeoutsNoneYes, per-task
Context between phasesLost (new pod, new session)Preserved via artifact chain
Runs locally without K8sNoYes (SHELL executor)
K8s isolation in productionPod-per-jobPod-per-task
Time to add a new phaseHours to daysMinutes (copy task block, change prompt)
Restart safetyBranch conflictsNonce-based, no conflicts
Real-time outputText logs onlyDashboard streaming + tee
Diagnostics on failureText logsFull AI conversations + diffs as artifacts
Capacity check costHTTP API callDB query at scheduler level
VerificationLimitedunit-test + self-verify (separate AI session)
Multi-trackerOne tracker hardcodedClone YAML, swap 6 commands
Continuous learningNonelearn task after every PR close
Secret injectionEnv vars on hostNative Kubernetes env_from / env_value_from

Getting Started

Option A: SHELL executor (local dev, fastest path)

This is where to start. The SHELL executor runs scripts directly on the host and inherits ~/.claude/settings.json, gh auth login, and all other host credentials automatically, no secrets configuration needed.

# 1. Prerequisites (one-time)
npm install -g @anthropic-ai/claude-code
gh auth login

# 2. Start Formicary (queen + embedded ant worker)
docker pull plexobject/formicary
docker run plexobject/formicary

# 3. Deploy workflow definitions
git clone https://github.com/bhatti/formicary.git
cd docs/examples
./deploy-ai-workflows.sh --mode shell --repo your-org/your-repo --setup-labels

# 4. Set org config so the picker knows where to look
curl -X POST http://localhost:7777/api/orgs/default/configs \
  -H 'Content-Type: application/json' \
  -d '{"name":"GitHubOrg","value":"your-org"}'
curl -X POST http://localhost:7777/api/orgs/default/configs \
  -H 'Content-Type: application/json' \
  -d '{"name":"GitHubRepo","value":"your-repo"}'

# 5. Label an issue — the picker fires within 1 minute
gh issue edit 1 --repo your-org/your-repo --add-label "ai-ready"

# 6. Watch it run
open http://localhost:7777

Option B: Kubernetes with Bedrock via Tailscale

Pods can’t resolve Tailscale hostnames by name, but they can reach the IP. Resolve it once:

TAILSCALE_IP=$(python3 -c "import socket; print(socket.gethostbyname('ai'))")

kubectl create namespace formicary-ai

kubectl create secret generic claude-bedrock-settings \
  --namespace=formicary-ai \
  --from-literal=ANTHROPIC_BEDROCK_BASE_URL=http://${TAILSCALE_IP}/bedrock \
  --from-literal=CLAUDE_CODE_USE_BEDROCK=1 \
  --from-literal=CLAUDE_CODE_SKIP_BEDROCK_AUTH=1 \
  --from-literal=ANTHROPIC_DEFAULT_OPUS_MODEL=us.anthropic.claude-opus-4-6-v1 \
  --from-literal=ANTHROPIC_DEFAULT_SONNET_MODEL=us.anthropic.claude-sonnet-4-6 \
  --from-literal=ANTHROPIC_DEFAULT_HAIKU_MODEL=us.anthropic.claude-haiku-4-5-20251001-v1:0

kubectl create secret generic ai-agent-secrets \
  --namespace=formicary-ai \
  --from-literal=github-token=$(gh auth token)

If the Tailscale IP changes, regenerate the secret with --dry-run=client -o yaml | kubectl apply -f -.

Option C: Standard Anthropic API key

kubectl create secret generic ai-agent-secrets \
  --from-literal=anthropic-api-key=sk-ant-... \
  --from-literal=github-token=$(gh auth token)

Job YAMLs reference it with env_value_from, so the key is injected by the kubelet and never passes through Formicary.


Ten Lessons

  • Timeouts are not optional. AI models hang. Give every phase its own timeout. A global job timeout is not a substitute when the plan phase hangs, you want to retry that phase, not restart the whole job from scratch.
  • Structured JSON output unlocks routing. Ask the AI to output {"status": "DONE|BLOCKED|TESTS_FAILING", ...} on its final line. Route on that field. Extract metadata for dashboards.
  • Flow context forward. If planning and implementation run in separate sessions with no shared artifacts, the implementer re-explores the entire codebase and discards all planning work. Pass PLAN.md as an artifact. Cost and quality both improve.
  • Use nonces for idempotency. Branch names, workspace paths, artifact names, all need a per-run nonce. Never reuse a name across retry attempts.
  • Guarantee cleanup. Set always_run: true on cleanup tasks. Workspaces and branches accumulate fast. One stuck job should not leave garbage forever.
  • Let the orchestrator manage capacity. Set max_concurrency on the job and use skip_if with a scheduler-level DB query. Don’t write custom capacity management code, it will be wrong.
  • Skills are the real leverage. The quality gap between freeform prompting and methodology-encoded prompting is large. Invest in skill definitions. The skill is a contract: “plan vertically, commit atomically, stop when blocked.” Consistent contracts produce consistent, reviewable output. I covered this in depth in AI Writes Code, You Own the Design.
  • Declarative wins operationally. Adding a security review phase to the declarative version takes minutes: copy a task block, write a prompt, add an on_completed route. The same change to an imperative system takes days. The asymmetry grows with every phase you add.
  • Capture everything on failure. Upload artifacts with when: always. When something fails, you want the full AI conversation, the git diff, and the test output — not just “job failed.”
  • Build a feedback loop. Most AI coding systems run, merge, and forget. The learn task after every PR close gives the agent a memory of what works and what doesn’t in this specific codebase. Over time, that compounds.

References

The job definitions described in this post are in docs/examples/ in the Formicary repository. See docs/ai-agents.md for the full setup guide.

May 26, 2026

The Complexity Trap: Why Simple, Bug-Free Systems Can Hurt Your Career

Filed under: Computing — admin @ 10:06 pm

I have worked for both large tech companies and startups. Two patterns kept showing up across every company I worked at startup and large company alike that both punish the engineers doing the right thing.

At startups, the pressure is entirely on shipping features. Engineers who move fast and ship constantly get rewarded. Security, observability, scalability become “future problems.” The engineers who slow down to build things properly, who push back on cutting corners, get treated as obstacles. The corners get cut anyway. When the system eventually breaks under load or gets breached, nobody connects it back to the decisions made two years earlier. The engineers who raised concerns are long gone or drowned out.

At large companies, a different trap. Ship something clean with simple design, solid implementation, few follow-up bugs and people move on. Nobody notices the problems that didn’t happen. Nobody gets promoted for the outages that never occurred. But ship something overengineered, watch it fall apart in production, spend months firefighting and suddenly you’re a hero. The tech lead who pushed patches at 2am gets noticed. Management reads the complexity as evidence of a hard problem solved. The tech lead gets promoted and moves to the next team. The engineers left behind inherit the mess.

Same outcome, different path. In both cases, the engineers who built things well are invisible. The ones who created the problems or thrived on them get ahead.


Essential vs. Accidental Complexity

In The Mythical Man-Month, Fred Brooks defined two kinds of complexity. Essential complexity is the irreducible difficulty built into the problem domain itself. Accidental complexity is the difficulty we add through poor abstractions, unnecessary coupling, and artificial layers. Larry Tesler’s Law of Conservation of Complexity says essential complexity can’t be eliminated, only moved. Push it out of the user interface and it lands in your middleware.

What most companies reward the accidental kind. Many moving parts, multiple failure modes, a fleet of services with their own deployment pipelines as these look like a hard problem solved by smart engineers. A system that just works, simply and reliably, signals nothing. The people who built it must have been working on something easy. I saw this repeatedly at larger companies. Senior engineers with years of incremental, principled improvements couldn’t get promoted because their work wasn’t considered “complex enough.” The implicit rule was clear: elegance doesn’t get you promoted.


War Stories

The database migration that became a platform. At a large tech company, we needed a simple migration from one database to another but it turned into a real-time data synchronization system. Suddenly there were shadow testing components, reconciliation pipelines, anti-entropy jobs for fixing discrepancies, and runbooks for each failure mode. The project stretched from months into years. The original problem, move data from A to B, never required any of it. But the complexity generated headcount, resources, and career advancement that a clean migration would never have produced.

The microservices migration that never finished. A monolith-to-microservices transition ran so long the team ended up maintaining both systems simultaneously. The migration date kept slipping. Nobody could tell you which services were fully cut over. The codebase became a graveyard of abandoned halfway points. Years of engineering time consumed, several promotions justified. The engineers who eventually inherited it had no idea what was intentional and what was just never cleaned up.

The Erlang rewrite. At a FinTech company, senior executive decided to rewrite an order management system from Java to Erlang, not for a specific technical reason, but because Erlang was interesting. Brooks called this the second-system effect: when engineers rewrite something they think they now understand, they pile in everything they held back the first time. The effort was far larger than anyone expected. Management abandoned it partway through. The team was left with two halves of the same system in two different languages, domain knowledge split across both.

The Go rewrite. The same executive years later decided to rewrite a Java financial system in Go because Go was what the industry was talking about. Years passed, the migration stalled. Some parts in Go, most still in Java. The team gave up. Meanwhile the actual urgent problems like data consistency, observability, performance at scale went unaddressed because everyone’s attention was on the rewrite. Nobody owned the full picture of dependencies or understood the consistency guarantees. Meanwhile, sales sold the system as a low-latency and four nine availability but in practice it was based on false illusion due to poor observability.

The postscript at that second company: when AI became the new shiny thing, the pattern played out again. Engineers who built flashy demos got promoted. The people fixing real infrastructure problems had nothing visible to show.


Conceptual Integrity Breaks Down as Organizations Grow

In the original Mythical Man-Month, Brooks argued that the most important property a system can have is conceptual integrity, one coherent design philosophy, with someone who holds the whole system in mind and says no to things that don’t fit. His prescription was a chief architect with real authority over what goes in and what stays out. That works when one person can still comprehend the system. As organizations grow and systems get divided among teams, nobody has that view anymore. Each team makes locally reasonable decisions. Accidental complexity accumulates not from individual mistakes but from the disconnect between groups who can’t see each other’s work.

Cross-cutting concerns like security, authentication, observability are where this gets dangerous fastest. I saw one system where authentication behaved differently depending on whether you were on-premises or in the cloud, and whether you were hitting the control plane or data plane. Secrets in some places, JWTs in others, config files in some environments, environment variables in others, a wall of conditional logic tying it together. No single person understood the whole thing. That mess led to a significant security breach and customer churn. Nobody designed it. It grew, one locally reasonable decision at a time.


Two Different Failure Modes

Startups and large companies both get this wrong, but for opposite reasons.

Startups are under pressure to ship customer-facing features. Security, observability, performance, operational burden become “future problems.” Sometimes that’s the right call. A startup that dies building the perfect architecture ships nothing. But the technical debt from ignored non-functionals doesn’t disappear. It accumulates, and it usually arrives all at once right when the company is trying to scale. That’s the worst possible time to deal with it.

Large companies have the opposite problem. The incentive structure rewards visible complexity. Tech leads propose ambitious architectures, staff up around them, ship something complicated, and move to the next team before the consequences mature. The engineers who inherit the system didn’t choose the design, can’t fully explain it, and can’t safely simplify it because they don’t understand what each piece is actually doing.

In both cases, the people who make the architectural decisions aren’t around to live with them. That gap between decision and consequence is the core of the problem.


The Goldilocks Principle

The approach that actually works is simpler than it sounds: start with the least complex architecture that handles the real requirements. Add complexity only when something forces you to.

Not simple for its own sake, e.g., if the domain genuinely requires distributed coordination, the design should say so. But the default should be: prove the complexity is necessary before building it. “This is how I’ve seen it done at bigger companies” and “this technology is interesting” are not justifications. Neither is designing for scale you don’t have. I’ve watched teams build for ten million users when they had ten thousand, then spend two years maintaining infrastructure that served no real requirement.

Vertical slices enforce this discipline. When you ship thin, end-to-end cuts of real functionality that a user can actually touch then you find out fast whether your design is right. The feedback loop is short. A wrong assumption costs a week, not six months. You can correct before the mistake becomes load-bearing.


AI Accelerates This Problem

With tools like Claude Code and Cursor, the implementation bottleneck is largely gone. A team using AI assistants can build a distributed system with five services in the time it used to take to build one. That’s progress if the design is right. If the incentive structure still rewards accidental complexity, AI just produces it faster.

In When Copying Kills Innovation: My Journey Through Software’s Cargo Cult Problem, I shared the cargo-cult behavior like adding components because they look sophisticated happens at higher velocity now. An AI agent given a vague prompt and no design constraints defaults to patterns common in its training data. That means microservices when a monolith would do, event buses when a direct call would do, five abstractions where two would do.

As I wrote in AI Writes Code. You Own the Design., the thinking parts like the what and why can’t be delegated to an agent. AI handles the how. Engineers who can identify essential complexity, strip the accidental kind, and hold a design together are more valuable now than before. But only if the organization’s reward structure reflects that.


How Do You Fix the Reward Structure?

I don’t have a clean answer. But here’s where the levers are.

  • Reward outcomes, not artifacts. Most promotion processes credit visible artifacts: the design doc for a complex system, the heroic incident response, the fleet of services owned. The outcomes that actually matter, a system that stayed up for two years, a migration that finished in six weeks, a design that five new engineers understood on day one are harder to see and usually go uncredited. Engineering leaders have to explicitly define what good engineering looks like and measure it over time horizons long enough to see consequences.
  • Make accountability follow decisions. Connect tech leads to the consequences of their architectural choices twelve to eighteen months later. Not as punishment as designs fail for unforeseeable reasons. But an engineer who never sees what their decisions cost never updates their model. Right now the feedback loop doesn’t exist for most people who make these calls.
  • Credit the “no.” The engineers who prevent bad architectures from being built are the hardest to recognize. The bad system was never built, so there’s nothing to point to. If you want more of this behavior, name it explicitly and credit it explicitly. Otherwise the rational move for any ambitious engineer is to propose the complex thing and let someone else clean it up.
  • Add a simplicity lens to design reviews. Most design reviews ask: will this work? Fewer ask: is this more complex than it needs to be? Formally asking “what would we remove without losing essential functionality?” changes the conversation. The burden of proof shifts to adding a component, not removing one.

The Conversation Worth Having

Brooks wrote that conceptual integrity is the most important consideration in system design. What the book doesn’t address is that most organizations are structured to undermine it like rewarding the engineers who add complexity and moving them on before they face the consequences. The engineers who hold the line against unnecessary moving parts, who ship systems that work quietly for years, who say “we don’t need this” and mean it are doing some of the hardest work in software. In most companies, they’re not the ones getting promoted.

With AI accelerating the implementation layer, the judgment required to distinguish essential from accidental complexity matters more than it ever has. If the reward structure doesn’t change to reflect that, we’ll just build the wrong things faster.


Related reading:

May 19, 2026

AI Writes Code. You Own the Design. Here’s How to Keep It That Way

Filed under: Computing,Methodologies — admin @ 9:48 pm

The Eternal Quest to Make Coding Simpler

I wrote my first program in BASIC on an Atari in the 1980s with line numbers, GOTOs, no debugger. Turbo Pascal changed everything: integrated editing, instant compilation, step-through debugging. Then Borland C++, then Visual Basic, then Eclipse, then IntelliJ. This pattern where new tool arrives, productivity jumps, complexity catches up has repeated itself every few years across my entire three-decade career.

In the early 1990s, 4GL tools promised to eliminate coding entirely. dBase, FoxPro, PowerBuilder — the pitch was always the same: “Business users can build their own applications.” Simple CRUD apps were easy. Real systems with business logic, error handling, and concurrent users turned out harder than writing code from scratch. UML consumed the next decade. I spent years with Rational Rose doing forward and backward engineering from class diagrams. The generated code was rigid scaffolding that fought you. Diagrams drifted from reality within weeks, because maintaining two representations of the same truth is inherently unsustainable.

The lesson I keep relearning: every attempt to separate “what to build” from “how to build it” through tooling alone produces rigid, brittle systems. The gap between specification and implementation is a thinking problem. Tools that hide it make things worse.


The AI Inflection Point

Around 2020, I started using GitHub Copilot for autocomplete. ChatGPT and Claude helped with isolated problems — boilerplate, algorithm refreshers. Useful but incremental. Then Claude Code arrived in early 2025, and everything changed. I’ve used it for 100% of my coding for over a year, not as autocomplete but as a full development partner: architecture, implementation, testing, debugging, deployment. The productivity gains are real. The failure modes are real too. Amazon AWS teams learned this the hard way, AI-generated code that looked right, passed superficial review, then caused production incidents. Their response was to tighten review policies significantly. I’ve seen the same pattern repeatedly: AI ships code that introduces subtle bugs in unfamiliar codebases, silently violates domain invariants, or creates architectural inconsistencies that compound over weeks. The problem isn’t that AI writes bad code. It writes locally correct code that doesn’t fit the bigger picture.


The Memento Problem

People compare AI coding agents to interns. That analogy breaks in one critical way: AI agents suffer from anterograde memory loss. Like the protagonist in Memento, every session starts from zero. An intern who made a mistake yesterday remembers it today. They build mental models of your codebase, internalize conventions through repetition. An AI agent? Session ends, memory gone. Tomorrow it will make the exact same architectural mistake, violate the same naming convention, choose the same wrong abstraction. It doesn’t learn from correction, it only learns from context provided in each session.

This is why rules, conventions, and structured knowledge aren’t optional nice-to-haves for AI-assisted development. They’re the equivalent of Leonard’s tattoos and photographs, which is the external memory system that makes coherent action possible despite the inability to form new long-term memories. I built these skills because I got tired of repeating the same corrections. Every session I found myself saying “no, we use Result types here, not exceptions” or “no, that should be a sum type” or “no, you need an idempotency token on that create endpoint.” The skills encode these corrections permanently so I stop repeating myself.

The Outsourcing Parallel

Every offshore engagement I’ve run hit the same wall: limited overlap hours, different definitions of ‘done,’ and a gap between what I envisioned and what arrived. Formal process wasn’t optional, it was the only thing that worked. What I learned: formal process wasn’t optional with outsourced teams. The teams that succeeded had detailed specs, explicit acceptance criteria, structured handoffs, and review gates. The teams that failed relied on “they’ll figure it out” and got back code that met the requirements on surface. This spawned CMM, RUP, Six Sigma — frameworks so heavy the documentation cost exceeded its value. Agile won because lightweight feedback loops beat upfront specification when communication bandwidth is high. Agile methodologies won because they recognized that lightweight, iterative feedback loops beat heavyweight upfront specification for teams with high-bandwidth communication.

AI agents resemble outsourced teams more than co-located colleagues. They have a narrow context window — like limited overlap hours across time zones. They lack shared understanding of your codebase. They produce locally correct work that misses the bigger picture. The lesson from outsourcing holds: formal process works when communication bandwidth is constrained. These skills apply that lesson with minimum ceremony — just enough structure to preserve conceptual integrity across sessions, without recreating the documentation burden that killed RUP.

Production agent systems need tiered memory: short-term (current session), medium-term (project conventions), and long-term (organizational knowledge). These skills are the middle tier, project-level knowledge that persists across sessions without requiring permanent documentation. They’re the bridge between ephemeral conversation and hard-coded policy.


Conceptual Integrity in the Age of AI

Fred Brooks wrote this in The Mythical Man-Month (1975). Martin Fowler recently reminded us it’s never been more relevant:

“I will contend that conceptual integrity is the most important consideration in system design. It is better to have a system omit certain anomalous features and improvements, but to reflect one set of design ideas, than to have one that contains many good but independent and uncoordinated ideas.”

This principle has never been more relevant. When an AI agent generates code, it produces locally correct solutions like the function works, the test passes, the API responds. But without conceptual integrity, each generated piece reflects a different design philosophy. One module uses exceptions, another uses Result types. One endpoint follows REST conventions, another doesn’t. One service uses the outbox pattern for events, another dual-writes to the database and message queue. Over time, the codebase becomes exactly what Brooks feared: “many good but independent and uncoordinated ideas.”

Code serves two purposes: machine instructions and conceptual modeling. AI commoditizes the first. The second, the model that captures how your domain actually works, remains yours to own. Generate code 10x faster without protecting that model, and you get systems 2x harder to maintain. Spec-driven development frameworks like OpenSpec and Spec-Kit push toward treating prompts as first-class delivery artifacts, versioned, reviewed, maintained alongside code. That’s the gap these skills fill. They encode conceptual integrity, design philosophy, conventions, quality standards into reusable artifacts that survive across sessions.


What You Own vs. What AI Owns

“We adopted AI coding but it hasn’t increased revenue.” Of course not. AI doesn’t solve what to build, it accelerates how to build it. You still need product/market fit, customer feedback, and domain expertise. More importantly: when AI causes a security incident or production outage, you can’t fire it. You’re accountable. Here’s the ownership boundary I enforce:

You OwnAI Accelerates
What to build (product vision)How to build it (implementation)
Why it matters (business context)Boilerplate and mechanical translation
Quality standards and conventionsApplying those standards consistently
Architecture decisionsExploring design alternatives quickly
Security postureChecking against known vulnerability patterns
Production accountabilityMonitoring, alerting, runbook generation
Domain knowledgeTranslating that knowledge into code

The skills encode this boundary explicitly: you drive the what and why; AI executes the how within guardrails you define. Every skill in the set reinforces this split.


Why Formalized SDLC Works Better with AI

I’ve worked in both worlds: big-company SDLC with architecture reviews, security reviews, production readiness checklists and startups where you discuss an idea over coffee and ship by afternoon. AI works better with the formalized approach. The reason is the same one that sank outsourcing arrangements with vague requirements: if you can’t state precisely what you want, the other party fills gaps with assumptions. Here’s why structure helps specifically with AI:

  • Structure gives AI context. A well-written PRD tells the agent why it’s building something, what constraints matter, which edge cases to handle. Without this, AI fills gaps with assumptions from training data, which may not match your domain.
  • Checkpoints catch drift early. When AI generates 800 lines in one session, reviewing it as a monolithic diff is overwhelming. I learned this the hard way. Now I break work into smaller tasks and enforce checkpoints every 5 files where build and test must pass before proceeding. Small, verified increments compound into reliable systems.
  • Conventions reduce error surface. When you explicitly state “use Result types for errors, never exceptions” and “all IDs are ULIDs, never UUIDs” then AI follows them. Without explicit conventions, it defaults to whatever was most common in training data, which varies wildly by context.
  • Smaller increments compound. AI excels at small, well-defined tasks with clear acceptance criteria. This isn’t new wisdom as vertical slicing and thin end-to-end increments have been SDLC best practice for decades. What’s good for human developers turns out to be good for AI too
  • Sloppy codebases amplify AI mistakes. In clean, well-structured code with clear module boundaries, AI makes fewer errors. It can hold the relevant context. In sprawling, inconsistent codebases with 2000-line files and mixed conventions, AI hallucinates patterns, mixes styles, and creates subtle inconsistencies. Well-structured code isn’t just readable for humans, it’s how AI holds context without drifting.

The Skills: A Structured SDLC for AI-Assisted Development

Here’s the full lifecycle, with each phase mapped to a skill and the key lessons that shaped it:


Phase 1: Requirements Refinement (/ygs-refine-prd)

I’ve watched AI build the wrong thing fast more times than I can count. The root cause is always the same: vague requirements. When I tell an agent “build a notification system,” it picks a design based on training data patterns. When I tell it “build a notification system that MUST deliver within 500ms for P0 alerts, SHOULD batch P2 notifications into hourly digests, and MAY support user-defined routing rules” then it builds something specific and testable. The refine-prd skill forces this precision through structured questioning. It interviews me relentlessly: one question at a time, providing its recommended answer, waiting for my feedback before continuing. It challenges vague language: “fast means what: 100ms? 1 second? Faster than the current system?” It pushes me to define concrete scenarios with Given/When/Then acceptance criteria borrowed from OpenSpec.

Key lessons encoded:

  • RFC 2119 keywords force commitment. Labeling requirements as MUST (P0), SHOULD (P1), or MAY (P2) prevents the “everything is critical” trap. I’ve seen projects fail because nobody ranked requirements, so the team optimized for P2 features while P0 requirements remained unmet.
  • Capabilities mapping reveals brownfield complexity. Categorizing changes as New/Modified/Removed surfaces the reality that most “new features” actually modify existing behavior, which is always harder than greenfield and needs different estimation.
  • Non-goals prevent scope creep. Explicitly stating what you will NOT build is as important as defining what you will. Without non-goals, AI treats every tangent as in-scope.

This is where you own the what. The AI sharpens your thinking, but the product decisions stay yours.


Phase 2: Technical Design (/ygs-refine-trd)

Without a technical design document, AI makes architectural decisions implicitly and they’re often wrong. I watched an agent choose microservices for a problem that needed a single process with good module boundaries. Another time it introduced an event bus between components that were always co-located and synchronous. Both were “correct” patterns applied to wrong contexts. The refine-trd skill challenges my technical approach through structured questioning, then produces a design document with explicit trade-off analysis and requirements traceability with every design decision maps back to a PRD requirement with rationale. For larger efforts spanning multiple components, I use a comprehensive design doc template that I previously shared in my blog. It covers the full lifecycle: from problem statement through architecture, alternatives analysis, non-functional requirements, rollout plan, and inline ADRs recording every key decision with its rationale and reversibility. The most powerful design tool isn’t testing, it’s the type system. When I rebuilt a Rust observability pipeline around algebraic data types and explicit state machines, entire bug categories disappeared:

Making Invalid States Impossible

The most powerful design tool isn’t testing, it’s the type system. Restructuring a pipeline around algebraic data types and explicit state machines made entire bug categories impossible to write:

  • Sum types enumerate valid states explicitly. I can’t accidentally process a Pending message as if it were Confirmed because the compiler won’t let me.
  • Typestate pattern encodes valid transitions in the type system. A Draft document can move to Review or Deleted, but never directly to Published. Invalid sequences are compile errors, not runtime bugs.
  • Parse, don’t validate transforms unstructured input at boundaries into strongly-typed domain objects. Once parsed, code trusts the types internally without defensive null checks scattered through business logic.
  • Errors as values using Result<T, E> types cannot be silently ignored. Compare this to exceptions that propagate invisibly through 14 stack frames before someone catches them with an empty catch block.
  • Functional core, imperative shell separates pure domain logic from I/O orchestration. The domain code is trivially testable because it has no side effects. The shell is thin and mechanical.

These principles matter enormously for AI-generated code because the compiler becomes your reviewer. When AI generates code within a well-typed system, category errors that would slip through human review become impossible to express.

Deep Modules Over Shallow

AI defaults to shallow modules, lots of small classes, each delegating to the next without adding value. A Philosophy of Software Design encourages modules with small interfaces and rich implementations. I’ve reviewed too many codebases where every class has an interface, every interface has one implementation, and understanding a feature requires bouncing through 15 files, each delegating to the next without adding value. The deletion test cuts through this: imagine deleting the module. If complexity vanishes, it was a pass-through and adding nothing but indirection. If complexity reappears across N callers, it was earning its keep. I apply this ruthlessly now. One adapter means a hypothetical seam. Two adapters means a real one. Don’t build seams speculatively.

Cognitive Load as Design Constraint

Three constraints keep AI-generated functions reviewable:

  • Methods stay under 24 lines. Working memory holds 4-7 chunks, code exceeding this becomes unmanageable regardless of how “clean” it looks.
  • No more than 7 concepts in a section. If I need a comment to explain what a block does, it should be a function with that name instead.
  • Fractal decomposition. Each level hides details while allowing drill-down. The system is comprehensible at every zoom level.

AI agents benefit from these constraints more than humans do. A function under 24 lines fits entirely in the context window. A deep module with a small interface can be understood without reading its implementation. Clean structure gives AI less opportunity to hallucinate.


Phase 3: Architecture (/ygs-refine-architecture)

For changes spanning multiple components, I use architecture refinement to capture system-level decisions that no single PR review can validate. The skill interviews me about module boundaries, seam placement, data flow, and failure modes and challenging shallow designs and pushing for depth. Three hard lessons shape every distributed system I design:

  • Transaction Boundaries Drive Architecture: I learned this lesson the expensive way: atomicity requirements dictate service boundaries, not the other way around. Teams that draw service boundaries first and then try to maintain consistency across them end up with distributed transactions, eventual consistency bugs, and data loss scenarios that take months to resolve.
  • The dual-write problem is the #1 source of data inconsistency I’ve encountered in microservice architectures. Writing to a database and publishing an event in separate operations means either can succeed while the other fails — leaving your system in an inconsistent state. The outbox pattern solves this: write the event to an outbox table in the same database transaction, then relay it asynchronously. Simple, reliable, non-negotiable for any system I design now.
  • For operations spanning multiple services, SAGA with explicit compensation replaces distributed transactions. Each step has a defined undo operation. When step 4 of 6 fails, steps 3, 2, and 1 execute their compensating actions. The key insight: design compensation logic before the happy path, because it’s always harder than you think.

Domain-driven design adds three more constraints that AI consistently gets wrong without explicit guidance:

  • Bounded contexts draw ownership lines. Each microservice owns one context where one set of domain concepts with one consistent vocabulary. Cross-context communication happens through well-defined events, not shared databases.
  • Ubiquitous language prevents the translation bugs I’ve seen kill projects. When the code says Order but the domain expert means Reservation, every conversation introduces subtle misunderstandings that compound into wrong implementations.
  • Hexagonal architecture (ports and adapters) means dependencies point inward. Domain logic knows nothing about HTTP, databases, or message queues. This isn’t academic purity, it’s what makes the system testable without spinning up infrastructure.

Fault Tolerance Is Architecture, Not Code

Fault tolerance is an architecture decision, not an implementation detail. Bolt it on after the fact and you get a system that fails catastrophically under load:

  • Circuit breakers prevent cascade failures. When a downstream service is unhealthy, stop sending it requests. I’ve seen a single slow database query bring down six upstream services because nobody implemented this.
  • Retry with jitter uses exponential backoff plus randomization. Without jitter, all clients retry at the same moment after an outage resolves, creating a thundering herd that triggers another outage.
  • Bulkhead isolation gives each dependency its own thread/connection pool. A slow payment provider shouldn’t exhaust your entire connection pool and take down order processing.
  • Graceful degradation means deciding in advance what to show users when a dependency fails. Not an error page, a degraded experience.
  • No hard startup dependencies. Services start even when dependencies are unavailable. They serve degraded responses and recover automatically when dependencies come back.

Phase 4: Estimation (/ygs-estimate)

Management wants dates. Engineers want to build. This tension has existed since the first software project went over schedule. I wrote about estimation practices years ago, and the core lessons haven’t changed: estimates are not commitments, decomposition reduces error, and teams consistently underestimate because they scope only the coding work. The estimate skill bridges the gap between “we need a date” and “it’ll be done when it’s done” with structured complexity-based estimation:

  • T-shirt sizing at the feature level. Before diving into details, I size each major capability as XS through XL based on complexity, uncertainty, and integration surface. An XL (4-8 weeks, architectural change) signals that the feature itself needs decomposition before meaningful estimation is possible. Uncertainty multipliers compound: new technology × external dependency = 2x your initial guess.
  • Story points at the task level. Using Fibonacci sequence (1, 2, 3, 5, 8, 13, 21) with planning poker when multiple people are involved. The power of Fibonacci isn’t magical, it’s that the gaps between numbers grow, forcing you to acknowledge increasing uncertainty rather than pretending you can distinguish between “7 days” and “8 days” of work.
  • Three-point estimation for commitments:
Expected = (Best + 4×MostLikely + Worst) / 6

Present ranges, not single numbers. “3-4 weeks with a tail risk of 6 weeks if the external API integration is harder than expected” gives management real information to plan around.

Key lesson: capacity is never 100%. I’ve seen teams plan sprints assuming full developer availability and then wonder why they deliver 60%. The reality:

CategoryTypical Budget
Feature work50-60%
KTLO (maintenance, tech debt, bug fixes)20-30%
On-call / incidents5-15%
Vacation / holidays / sick10-15%
Meetings / reviews / planning5-10%

Some teams I’ve worked with budget 40% for KTLO. If your system is old and fragile, that’s not pessimism, that’s realism. The skill asks the user what their team’s actual allocation is, because it varies enormously.

The most common estimation failure: forgetting everything that isn’t “writing code.” Engineers estimate the implementation and forget testing (20-40% of the work), deployment changes (IaC, Kubernetes manifests, feature flags), observability (metrics, dashboards, alerts, tracing), on-call runbooks and troubleshooting guides, data migration scripts, security review fixes, and documentation. My rule of thumb: if the estimate only covers writing code, double it to account for everything needed to ship to production safely.


Phase 5: Spike (/ygs-spike) — When You Don’t Know Enough

Not every feature goes straight from design to implementation. Some involve risky unknowns like a new database, an unfamiliar integration, an algorithm you’ve never tried at scale. The spike skill exists for these moments: a time-boxed experiment to answer a specific question before committing to a full design. The spike lives on a spike/ or fafo/ branch, deliberately relaxes production standards, and produces exactly one artifact: a findings doc with a clear verdict. What spikes are for:

  • Performance validation: “Can our schema handle 10K writes/sec?” Write the hot path, add a benchmark harness, measure.
  • Integration feasibility: “Does this library work with our auth stack?” Wire two systems together, make one end-to-end call work. Done.
  • Algorithm proof: “Is this fast enough for real-time?” Implement the core loop, feed it representative data, measure latency at p99.

The spike skill enforces this discipline: define hypothesis up front, scope what’s allowed, build the minimum experiment, record findings with evidence, and recommend next steps. If the spike confirms feasibility, you proceed to full design with confidence. If it refutes your hypothesis, you’ve saved weeks of wasted implementation.


Phase 6: Work Breakdown Structure (/ygs-wbs)

AI excels at small, well-defined tasks. It struggles with large, ambiguous ones. The WBS skill hierarchically decomposes deliverables into vertical slices, thin end-to-end cuts through all layers, each independently demoable and verifiable. Like a traditional Work Breakdown Structure, it divides complex projects into manageable components at three levels: deliverables (major features), work packages (independently shippable units), and tasks (atomic implementation steps).

Key lessons from years of estimation and delivery:

  • Vertical over horizontal. Each task cuts through UI, API, and database, not “build all the models, then all the APIs, then all the UI.” Horizontal slicing delays feedback. You don’t know if the feature works until the last layer is complete. Vertical slicing gives you a working thin slice from day one.
  • Dependency ordering prevents blocked work. Data model tasks before API tasks before UI tasks. Shared utilities before their consumers. I sequence tasks so each one builds on verified, tested foundations.
  • Scope signals trigger splits. When I see “and also…” or “and verify…” in a task description, that’s two tasks disguised as one. Exception: causally dependent steps (create migration + update model + update handlers for same entity) stay together.
  • Size drives ceremony. Small tasks (1-3 files, <300 lines) get standard workflow. Large tasks (8+ files, 800+ lines) get flagged immediately for splitting. I’ve learned that tasks AI implements in one session should stay under 300 lines of change, beyond that, coherence degrades.

Phase 7: Implementation (/ygs-implement)

Without guardrails, AI will modify 30 files in one session, introduce subtle coupling between components that should be independent, and produce a diff too large to review meaningfully. I’ve had sessions where the agent touched 12 files to implement a feature that should have required 4, each extra file an “improvement” that wasn’t asked for. The implement skill enforces discipline:

Scope guardrails I enforce:

  • 3+ unplanned files -> STOP. The agent reports the deviation and asks me to confirm expanded scope. This single rule has prevented more architectural drift than any other practice.
  • Checkpoint every 5 files. Build and tests must pass before proceeding. Catches regressions early when they’re cheap to fix.
  • Deviation tracking. When implementation differs from design: “Design said X, did Y because Z.” This documentation prevents the next session from reverting the deviation or making it worse.

Three testing rules I enforce regardless of who wrote the code:

  • Stubs only at 3rd-party/OS boundaries: HTTP clients, system clocks, filesystem, randomness. Everything else uses real implementations.
  • If you can’t test without mocking internal code, the design is wrong. This is a litmus test I apply relentlessly. Mocking internals means your modules are coupled. Fix the coupling, don’t paper over it with mocks.
  • Test the public contract, not implementation details. Tests that verify internal method calls break every refactor. Tests that verify external behavior survive decades.

Four tidying rules that prevent AI from refactoring itself into bugs:

  • Tidy first but only when it makes the next change cheaper. I’ve watched AI eagerly refactor things that don’t need refactoring, burning context and introducing bugs. The rule: cost(tidy) + cost(change after tidy) < cost(change without tidy). Otherwise, leave it.
  • Guard clauses over nested conditionals. Early returns flatten code and make the happy path obvious.
  • One pile first. Before splitting scattered code into elegant modules, consolidate it in one place. Understand the full picture before decomposing. AI tends to decompose prematurely, creating abstractions before understanding what varies.
  • Tidy in separate commits from behavior changes. Never mix formatting with functionality. It makes review impossible and rollback dangerous.

Phase 8: Code Review (/ygs-code-review)

AI-generated code passes syntax checks and basic tests but can contain subtle logic errors, security holes, and design violations that only emerge under careful structured review. I don’t trust casual “looks good” scanning instead I use a two-pass approach with explicit criteria.

Pass 1 Critical issues (blocks merge):

  • Logic errors. Off-by-one bugs, null handling, race conditions (TOCTOU, check-then-act, find-or-create without locks).
  • Security holes. Injection (SQL, XSS, SSRF, path traversal), hardcoded secrets, missing auth checks.
  • Data loss. Destructive operations without confirmation, missing transactions around multi-step mutations.
  • Error swallowing. Empty catch blocks, ignored return values, Result types discarded with .unwrap() or _ =.
  • Partial failure. What if the operation half-succeeds? I’ve seen update endpoints that modify 3 records in sequence, e.g., if #2 fails, #1 is already committed and the system is in an inconsistent state.
  • Enum completeness. New enum values must be traced through ALL consumers. One unhandled match arm in a downstream service can cause silent data loss.

Pass 2 Design and maintainability:

  • Immutability and state. Is mutable state minimized? Are invalid states representable? Should this use an explicit state machine instead of boolean flags?
  • Type safety. Sum types for variants? Newtypes for semantically different IDs (UserId vs OrderId)? Parse-don’t-validate at boundaries?
  • Command-Query Separation. Methods either change state OR return data, never both. Violations make code unpredictable and untestable.
  • Interface design. Deep modules with small interfaces? Or shallow pass-throughs adding indirection without value?
  • Performance. N+1 queries hiding inside loops, missing database indexes for common query patterns, O(n^2) operations on collections that grow.
  • Proportionality. Is the complexity justified by data? I’ve reviewed PRs that introduced three new abstractions for a feature used by 12 people. Proportionality means the solution matches the problem’s actual scale.

Severity classification:

  • MUST — Blocks merge (correctness, security, data loss)
  • SHOULD — Strong recommendation (design, performance, testability)
  • MAY — Suggestion (naming, style, minor optimization)

You don’t get the same understanding from reviewing as from writing, that tension is real. But structured multi-pass review with explicit criteria gets you closer than rubber-stamping ever could.


Phase 9: Security Review (/ygs-security-review)

AI doesn’t think adversarially. It generates happy-path code that works when used as intended. Attackers don’t use things as intended. I’ve seen AI-generated endpoints that validated input on the frontend but accepted anything on the backend, that logged full request bodies including passwords, that built SQL queries with string interpolation “because the ORM was too slow.” The security review skill forces red-team thinking for every changed endpoint.

Lessons from my previous post on building secure microservices:

  • Injection vectors. I check for SQL injection (raw queries with interpolation), command injection (exec/system with user input), template injection (SSTI), XSS (unescaped user content in responses), SSRF (user-controlled URLs in server requests), and path traversal (user input in file paths).
  • Authentication & authorization. Missing auth checks on new endpoints (AI doesn’t always copy the middleware pattern). Broken access control where user A can access user B’s resources by changing an ID in the URL. Privilege escalation through parameter manipulation.
  • Data exposure. Sensitive data in logs (I’ve caught AI logging full request bodies including auth tokens). Secrets in error messages returned to clients. Debug information in production responses.
  • Supply chain. Vulnerable or unpinned dependencies. Deserialization of untrusted data (pickle, YAML.load, eval). AI loves pulling in libraries without checking their security posture.

Red-team perspective: I ask these questions for every endpoint:

  • What happens if someone sends 10,000 requests per second? (Rate limiting)
  • What if they bypass the frontend entirely and craft raw API calls? (Server-side validation)
  • What’s the blast radius if this component is fully compromised? (Lateral movement, data access)
  • What happens on double-submit within 100ms? (Idempotency)
  • Is there defense in depth, or does one failed check expose everything? (Layered security)

The CIA triad applied to every data flow:

  • Confidentiality: Encryption at rest and in transit, access controls at every hop, zero-trust between services
  • Integrity: Cryptographic verification of artifacts, input validation at trust boundaries, tamper detection
  • Availability: Redundancy, failover, rate limiting to prevent DoS, graceful degradation under attack

For systems with significant attack surface, I produce a formal STRIDE threat model, systematically enumerating threats per subsystem, classifying assets by sensitivity, identifying trust boundaries, and tracking mitigations to completion. The structured template ensures nothing falls through the cracks: every threat gets an owner, a mitigation plan, and a security test that verifies the fix.


Phase 10: SRE Review (/ygs-sre-review)

Code that works in development fails in production. AI has no intuition for this because it’s never been paged at 3am. It doesn’t know that a missing index causes 30-second queries under load, or that an unbounded list endpoint will OOM the service when it hits 10 million records. The SRE review skill forces failure-mode analysis from my production readiness experience:

For every changed component, I analyze:

  1. What happens when it fails? Crash, hang, corrupt data, or silent degradation? Each demands a different mitigation.
  2. Blast radius. Does failure cascade? A single unhealthy pod shouldn’t take down the cluster. Circuit breakers and bulkheads contain damage.
  3. Recovery path. Auto-recovers (best), requires restart (acceptable), requires manual intervention (document it), requires data repair (unacceptable without backups).
  4. Partial failure. What if step 3 of 5 succeeds but step 4 fails? Is the system in a consistent state? Are there compensating actions?

Observability because you can’t fix what you can’t see:

  • Metrics: Latency percentiles (p50, p95, p99), error rates, throughput, saturation (CPU, memory, connections, disk).
  • Logging: Structured with correlation IDs. Proper levels. No PII. Enough context to diagnose without reproducing.
  • Tracing: Distributed tracing end-to-end. When a request touches 6 services, I need to see the full path without grepping logs across clusters.
  • Alerting: Threshold-based AND anomaly detection. Every alert links to a runbook. If an alert fires and the responder doesn’t know what to do, the alert is useless.

Deployment safety:

  • Canary releases: Deploy to 1% of traffic, monitor for 15 minutes, auto-rollback on metric breach. This catches issues that tests miss.
  • Backward-compatible schema changes: Two-phase releases (add column -> deploy code that writes both -> migrate data -> remove old column -> deploy code that reads new). Never lock a production table.
  • Feature flags: For anything risky, ship dark and enable gradually. This decouples deployment from release.
  • Immutable infrastructure: No in-place patches. Every deployment is a fresh container from a verified image.

Testing pyramid from Google SRE practices:

LayerProportionWhat It Catches
Unit tests80%Logic errors, edge cases, regressions — fast, isolated, deterministic
Integration tests15%Component interactions, contract violations, real DB behavior
End-to-end tests5%Critical user journeys, cross-service flows — expensive, flaky, essential
Chaos testingPeriodicFailure recovery, cascade prevention, degradation behavior
Property-basedWhere applicableInvariant violations across random inputs, edge cases you didn’t imagine

In my post about caching, I shared caching related production failures I’ve encountered repeatedly:

  • Thundering herd after cache expiry. All clients hit the backend simultaneously. Stagger TTLs and use cache stampede prevention.
  • Stale data during update failures. Serving old data is sometimes acceptable, sometimes catastrophic, know which case you’re in.
  • Cache unavailability causing cascading failures. Test performance without cache during peak load. If your system can’t function without cache, cache is a hard dependency, not an optimization.
  • Security: cache keys MUST respect authorization boundaries. I’ve seen cached responses served to unauthorized users because the cache key didn’t include tenant ID.
  • Bimodal behavior: when the system behaves fundamentally differently with vs. without cache, you have two systems to understand and debug. Minimize this.

Phase 11: QA and UAT (/ygs-qa, /ygs-uat)

I separate QA from UAT because they catch different failure modes. Code can be functionally correct and still unusable. An API can return the right data and still violate the user’s mental model of how the workflow should behave.

QA (/ygs-qa) tests the system objectively:

  • Functional correctness: Does core logic produce right results for valid inputs?
  • Edge cases: Boundary values, empty inputs, maximum limits, null handling, Unicode, special characters
  • Error paths: Invalid input, network failures, timeouts, partial failures — does the system degrade gracefully or crash?
  • Regressions: Do existing features still work after the change? This is where AI causes the most subtle damage: fixing one thing while breaking something adjacent.
  • Performance: Response times acceptable? No degradation under load? No memory leaks in long-running processes?

I score each category 0-10 and produce an overall health rating (0-50). This gives me a quantitative signal for ship readiness rather than a vague “looks good.”

UAT (/ygs-uat) tests from the customer’s perspective:

  • Walk through actual user stories end-to-end. Not individual API calls, complete workflows as a user would experience them.
  • Error messages must be helpful, not technical. “Connection refused to localhost:5432” is a developer error message. “We’re having trouble loading your data, please try again” is a user error message.
  • Check the golden path AND the “what if the user does something weird” paths. What if they double-click? What if they navigate back mid-flow? What if they have 10,000 items instead of 10?

Both must pass before shipping. I’ve shipped code that was technically correct but confused every user who touched it.


Phase 12: Ship and Learn (/ygs-ship, /ygs-retro)

Sync (/ygs-sync) addresses a problem I’ve seen kill design docs across every team I’ve worked with: docs drift from reality within weeks. The OpenSPDD project formalizes this as bidirectional synchronization. When code changes during review or refactoring, the design documents must update to reflect actual implementation, not just planned implementation. Stale docs are worse than no docs because they actively mislead. The sync skill compares implementation against spec, identifies drift, and proposes updates with rationale (“Design said Strategy pattern; implementation uses simple switch because only 2 variants exist”).

Ship (/ygs-ship) enforces the pre-merge ceremony I’ve seen skipped too many times:

  • All tests pass (not “most tests pass” ALL tests pass)
  • Diff reviewed against base branch, no debug code, no .env files, no build artifacts
  • Version bumped appropriately (patch for fixes, minor for features, major for breaking changes)
  • Changelog updated so consumers know what changed
  • PR created with clear description for the record

No shortcuts. The ceremony exists because every shortcut I’ve taken in 30 years has eventually cost more than the ceremony would have.

Retro (/ygs-retro) closes the feedback loop — and this is where learning happens:

  • What went well: Practices to keep. Architectural decisions that paid off. Estimation accuracy.
  • What didn’t: Missed estimates (why specifically?). Bugs that shipped (what review would have caught them?). Scope creep (where did it come from?).
  • Patterns: Recurring issues across tasks reveal systemic problems. The same type of bug appearing three times isn’t bad luck — it’s a missing test category or a design flaw.

Five Whys with the Swiss Cheese model drives every retro:

  1. Why did the system fail? -> Direct cause
  2. Why was that possible? -> Missing guard
  3. Why wasn’t it prevented? -> Process gap
  4. Why wasn’t it detected? -> Monitoring gap
  5. Why wasn’t impact contained? -> Isolation gap

Multiple barriers had to fail simultaneously for the incident to reach customers. The fix is never “be more careful”, it’s always a structural change: a new test category, a new circuit breaker, a new alert threshold, a new deployment gate.


The Code-to-Production Pipeline

See my post on production readiness:


Beyond Vibe Coding: Specifications as the Missing Layer

Most teams use AI in what I call vibe coding mode: describe what you want in natural language, generate code, iterate. It works for small problems. It fails for complex systems. I tested this boundary directly by combining TLA+ formal specifications with Claude. The insight: AI fails not because of intelligence limits, but because we give it vague specifications. “Create a task management API” produces guesses. A TLA+ spec defining valid state transitions, invariants, and concurrent scenarios produces code that satisfies those properties precisely. You don’t need TLA+ for every feature. But the spectrum matters:

  • Vague natural language ? AI guesses, inconsistent edge case handling
  • Structured requirements (RFC 2119 + Given/When/Then) ? AI follows rules, mostly correct
  • Formal specifications (TLA+) ? AI implements verified properties, comprehensive test coverage from execution traces

Writing TLA+ properties reveals design flaws before implementation. I discovered that sequential task IDs create security vulnerabilities — a flaw that wouldn’t surface until production. The model checker found it automatically. The SDLC skills sit in the practical middle: structured enough to eliminate ambiguity, lightweight enough to use daily.

The REASONS Canvas: Structured Prompts as Design Contracts

The OpenSPDD project takes this further with a 7-dimension framework called the REASONS Canvas: Requirements, Entities, Approach, Structure, Operations, Norms, Safeguards. The distinction between a plan and a REASONS Canvas is the distinction between a suggestion and a contract. Plans describe intent; structured prompts define constraints that eliminate AI improvisation. I’ve incorporated the most valuable elements into these skills:

  • Entities as an explicit TRD questioning dimension — forcing domain model clarity before implementation
  • Norms and Safeguards — explicit negative constraints (“do NOT refactor existing structures unless requirements demand it”) that prevent AI from improvising
  • Operations sequencing — implementation order based on dependency analysis, not arbitrary file ordering
  • Bidirectional sync — the insight that design docs must stay accurate as code evolves, not just at initial creation

The key insight from SPDD’s design philosophy resonates: capability and control are separate dimensions. AI models keep getting smarter (capability improves), but that doesn’t automatically improve alignment with your specific intent (control).


Prompting Frameworks: Why Structure Beats Eloquence

Following prompting frameworks shaped how I designed every skill in this set:

  • R.E.A.S.O.N. (Role, Environment, Action, Steps, Output, Negatives): The Negatives dimension is underappreciated. Telling AI what NOT to do eliminates entire categories of unwanted behavior more reliably than telling it what to do. Every skill includes explicit constraints: “do not refactor existing code,” “do not touch files outside task scope,” “do not fix without establishing root cause.”
  • PRISM for reasoning models (Problem, Relevant Information, Success Measures): For newer reasoning models, step-by-step instructions can degrade performance. Define the problem, provide context, specify what success looks like, then let the model’s internal reasoning find the path. The refine skills work this way: instead of prescribing exact steps, they define dimensions to explore and quality criteria to meet.
  • Context hygiene:Agent quality is roughly 75% model, 25% context. Long sessions degrade as context fills and compacts. The SDLC skills address this structurally: each phase is a separate invocation, artifacts persist as files (not conversation history), and small vertical-slice tasks complete within a single focused session. Since the agent can’t remember across sessions, encode everything important into files that do.
  • Multi-Shot and Few-Shot Patterns: Providing examples of desired output format dramatically improves consistency. The skills encode this implicitly, e.g., the templates (PRD, TRD, design doc, threat model, task, ADR) serve as few-shot examples of the expected output structure. When the AI reads a template before generating, it produces output that matches the format without being told explicitly. The design doc template encodes the 9-section structure I’ve refined over years of writing design documents at scale: executive summary, background/problem statement, proposal with stakeholders, architecture with failure paths, alternatives considered, functional requirements traced to PRD, non-functional requirements (performance, security, operations, cost), rollout plan with phases, and a decision log recording ADRs inline. The threat model template follows STRIDE methodology with 13 sections: from defining security tenets and trust boundaries through systematic threat analysis grouped by subsystem, to security test plans and compliance checklists.

Model Selection: Match the Model to the Phase

Not every SDLC phase needs the same model. I’ve settled on a pattern that optimizes for both quality and cost:

Reasoning-heavy phases -> strongest model (Opus-class):

  • Requirements refinement (/ygs-refine-prd): Needs to challenge assumptions, find contradictions, explore implications
  • Technical design (/ygs-refine-trd): Needs architectural reasoning, trade-off analysis, pattern recognition across the codebase
  • Architecture refinement (/ygs-refine-architecture): System-level thinking, identifying failure modes, deep module analysis
  • Code review (/ygs-code-review): Catching subtle logic errors, race conditions, partial failure scenarios
  • Security review (/ygs-security-review): Adversarial thinking, attack path analysis, red-team perspective

Implementation phases -> fast model (Sonnet-class):

  • Implementation (/ygs-implement): Following well-defined specs, writing code within established patterns
  • Grooming (/ygs-grooming): Mechanical decomposition of well-understood requirements
  • Ship (/ygs-ship): Running tests, creating PRs, version bumping

Either works:

  • Estimation (/ygs-estimate): Benefits from reasoning for uncertainty analysis, but doesn’t require it
  • QA/UAT (/ygs-qa, /ygs-uat): Testing scenarios benefit from creativity but are often mechanical
  • Sync (/ygs-sync): Comparison is largely mechanical, but drift detection benefits from reasoning

The logic: design and review require judgment; implementation requires following instructions. A cheaper, faster model that faithfully executes a well-specified task often outperforms an expensive model given a vague one. This is why investing effort in the refinement phases (where you use the strongest model to produce precise specs) pays dividends in the implementation phase.

Industry Patterns for Model Routing

The practical takeaway: the quality of your specs determines how capable your implementation model needs to be. A well-specified task with clear acceptance criteria, explicit constraints, and defined negative boundaries (what NOT to do) can be implemented correctly by a fast model. A vague task requires a reasoning model to fill gaps, and it will fill them with assumptions from training data, not your domain knowledge.


Lessons from Agentic AI Design Patterns

I’ve catalogued 50 design patterns for generative and agentic AI across six categories — from content control and RAG to multi-agent orchestration. Several patterns directly inform how I structured these skills:

  • Reflection pattern: Agents that evaluate and revise their own output produce better results than single-shot generation. The SDLC skills implement this as separate review phases: generate (implement) -> evaluate (code review) -> revise (fix findings). The review skills ARE the reflection pattern, externalized into a structured workflow.
  • Prompt chaining over autonomy: Decomposing complex tasks into sequential, well-defined steps consistently outperforms giving an agent unbounded autonomy. The WBS skill does exactly this: hierarchically decomposes large features into small, sequential tasks with clear acceptance criteria. Each task is one link in the chain.
  • Tool calling with clear contracts: Agents that invoke well-defined tools with explicit input/output contracts produce more reliable results than agents reasoning in open-ended conversation. The skills serve as “tools” for the AI coding agent — each one a well-defined workflow with clear inputs (what phase we’re in, what artifacts exist) and outputs (specific deliverables with completion status).
  • Human-in-the-loop at decision points: The most reliable pattern across all my agent systems is autonomous execution for mechanical work with human checkpoints for judgment calls. The implementation skill embodies this: AI codes autonomously but STOPS at 3+ unplanned files, checkpoints every 5 files, and reports all deviations. You make the judgment calls; AI does the typing.
  • Memory tiers for context management: Production agents need structured memory: short-term (current session), medium-term (project conventions), and long-term (organizational knowledge). These skills serve as the medium and long-term memory tiers — encoding patterns and standards that survive across sessions.

The operational lesson from building all these systems: production AI requires the same engineering discipline as any distributed system. Circuit breakers for external API calls. Cost tracking with hard limits. Observability with correlation IDs. Graceful degradation when dependencies fail. These aren’t optional — they’re what separates demos from systems that run in production without 3am pages. The same discipline applied to AI coding workflows is what these skills encode.


Why This Matters Now

Martin Fowler recently asked the fundamental question: can AI evade the tar pit, or will it struggle in the accumulated complexity that slows every software project? The answer: AI doesn’t escape the tar pit. It digs faster. Autonomous AI agents mostly mean ‘I don’t know what it’s going to do.’ Structured workflows beat autonomy for production code. Most AI coding benefits from structured workflows, not autonomous agents making unbounded decisions. Jessica Kerr’s insight about double feedback loops matches how I use these skills: one loop builds features; another improves the development process. The skills aren’t static, each post-mortem adds a check to security review, each escaped bug extends the code review criteria. The AI benefits from that evolution without needing to “learn” it.


The Paradox: Writing vs. Reviewing

When you review AI-generated code, you don’t build the same understanding as when you write it. Here’s the middle path that works for me:

  1. Own the design. Write the architecture docs yourself. Define the interfaces. Specify the state machines. Draw the data flow diagrams. This is where deep thinking happens — at the design level, not the implementation level.
  2. Delegate the implementation. Let AI fill in the mechanical details within your design constraints. The type system and test suite verify it got the details right.
  3. Review with structure. Multi-pass review with explicit criteria catches what casual reading misses. Two passes (critical then design) force different modes of attention.
  4. Learn through refinement. The structured questioning in refinement sessions forces you to think deeply about the problem space. You can’t answer “what happens when this fails halfway through?” without building real understanding.

The skills encode this approach: you think deeply during refinement, design, and review. AI accelerates the mechanical middle. The result maintains conceptual integrity because the design philosophy flows from structured artifacts that persist across sessions, not from the agent’s ephemeral training data biases. As Brooks said: conceptual integrity matters more than any individual feature. These skills are how I maintain it while leveraging AI for the implementation work that used to consume 80% of my time.


Getting Started

# Install
git clone https://github.com/bhatti/you-got-skills.git ~/.claude/skills/you-got-skills

# Start with an idea
/ygs-refine-prd

# Work through the lifecycle
/ygs-refine-trd -> /ygs-estimate -> /ygs-spike (if risky) -> /ygs-wbs -> /ygs-implement -> /ygs-code-review -> /ygs-ship

The skills are pure markdown, no compilation, no dependencies, no telemetry. Read any skill in 30 seconds. Understand the full set in 10 minutes. Extend by adding a SKILL.md file in a new directory. Each skill stands alone. Use any subset in any order. Skip what doesn’t apply. The power isn’t in following a rigid process, it’s in having structured knowledge available when you need it, so the AI works with your standards instead of against them. The repository: github.com/bhatti/you-got-skills


Conclusion

The quest to make coding simpler is as old as coding itself. BASIC to 4GLs to UML to AI agents — every generation promises the same thing: focus on what, not how. Every generation delivers the same lesson: the thinking is the hard part, and you can’t automate it away. What’s different about AI coding agents is that they genuinely accelerate the how in ways previous tools never achieved. But acceleration without direction is faster wandering. Acceleration without conceptual integrity fragments your system’s design philosophy at speed.

These skills answer the question I kept returning to: how do you maintain conceptual integrity when the agent starts from zero every session? You encode your standards, conventions, and design philosophy into structured artifacts that survive across sessions. You own the what and the why. You let AI accelerate the how. You review everything through principles that have survived three decades of paradigm shifts. You own the what and the why. You let AI accelerate the how.


The skills discussed in this post are available at github.com/bhatti/you-got-skills. Built for Claude Code but the principles apply to any AI-assisted development workflow.

Related Blog posts:

TopicKey Insight
Functional PipelineType system beats testing for correctness. Immutable data flows eliminate aliasing bugs. State machines make illegal transitions impossible.
API Design50 anti-patterns I now check automatically like Idempotency, Command-Query Separation, etc.
Production Readiness and IncidentsFailures are multi-cause; fixes must be structural
Domain Driven and Hexagonal DesignBounded context, ubiquitous language, separation of concerns.
Production AI Agents such as enterprise AI platforms with vLLM, multi-agent architectures with MCP and A2A, API compatibility checking, PII detection, and personal productivity.The protocol is 10% of the work

May 13, 2026

From Big Ball of Mud to Functional Pipeline: Building an Observability Platform in Rust

Filed under: Computing,Technology — admin @ 2:19 pm

I. The Big Ball of Mud

In your career, you often have to deal with a legacy codebase that nobody wants to touch but everyone depends on. I had to deal with a similar real-time observability system that ingested logs, metrics, and traces and routed them to storage, alerting, and analytics systems. It started as a small Node.js project but then grew into a Big Ball of Mud over the years: a system with no discernible structure, where everything depends on everything else, and changes in one area trigger cascading failures across the codebase. The symptoms were textbook:

  • God classes: A single PipelineManager had grown to thousand of lines, handling config loading, event parsing, routing, batching, error recovery, and metrics reporting.
  • Singletons everywhere: dozens of module-level mutable instances accessed via getInstance(). Testing required elaborate startup sequences and teardown.
  • Type erasure: thousands of any in the TypeScript codebase. Refactoring was impossible because the compiler couldn’t help.
  • Silent failures: hundres of catch {} blocks that swallowed errors. Production incidents took hours to diagnose because the system happily continued with corrupted state.
  • Deep inheritance: A 6-level class hierarchy for “processors” where each level overrode different methods in incompatible ways.

This impacted business in terms of feature velocity, onboarding for new engineers and high change failure rate (see dora metrics). But here is the thing: not everything was broken. Buried under layers of mutation, global state, and type erasure, there were sound architectural ideas. The original designers made some good calls.

This post describes how functional programming patterns, domain-driven design, and hexagonal architecture (see https://shahbhat.medium.com/applying-domain-driven-design-and-clean-onion-hexagonal-architecture-to-microservic-284d54b3a874) with a POC implementation can be used toeliminate entire categories of bugs and restore the ability to move fast.


II. Patterns Worth Preserving

The legacy system had three core architectural patterns that deserved preservation but can be implemented better in Rust.

Pipes and Filters

The legacy system used pipes and filter pattern to flow events through a chain of independent processing stages. Each stage does one thing like parse, filter, enrich, mask, route and passes the result to the next stage. The problems were mutable events shared across stages, untyped filter functions, and no backpressure between stages. The chain was there, but the links were rusty.

The new POC implementation keeps Pipes and Filters as the backbone. Each stage is immutable, strongly typed, and composable. A stage receives an owned event and returns a new event (or drops it, or splits it into many). No stage can observe or interfere with another stage’s work.

// Legacy: mutable, untyped, no backpressure
// function processStage(event: any): any { event.stage = "done"; return event; }

// New: immutable, typed, composable
pub trait PipelineFn: Send + Sync {
    fn name(&self) -> &str;
    fn process(&self, event: Event) -> FnResult;
}

Decorator/Enrich: Adding Context to Events

The legacy system enriched events with metadata like adding timestamps, source identifiers, routing tags, geo-IP data. This is the Decorator pattern applied to streaming data, and it is essential. Raw events from producers are incomplete; the pipeline adds context. The problem was mutation. The legacy enrichment stages modified events in place, so downstream stages could not trust what they received. The new POC system keeps enrichment but uses immutable event copies. Each enrichment stage returns a new event with the added data. The original is untouched.

// Enrichment returns a new event — the original is unchanged
pub fn enrich_with_timestamp(event: Event) -> Event {
    event.set_field("_enriched_at", FieldValue::Int(now_millis()))
}

Source/Sink: The Endpoints

Every pipeline has endpoints: where data comes in (sources) and where it goes out (sinks). The legacy system had these abstractions, though they were concrete classes rather than interfaces. The new POC system makes sources and sinks trait-based and pluggable. You can swap a Kafka source for an HTTP source without touching the pipeline logic. You can add a new sink type without modifying existing code.

pub trait EventSource: Send + Sync {
    async fn start(&mut self) -> Result<(), SourceError>;
    fn stream(&mut self) -> Pin<Box<dyn Stream<Item = Event> + Send + '_>>;
}

pub trait EventSink: Send + Sync {
    async fn write(&self, events: Vec<Event>) -> Result<(), SinkError>;
    async fn flush(&self) -> Result<(), SinkError>;
}

These three patterns (Pipes and Filters, Decorator/Enrich, Source/Sink) are natural fits for functional style because they already think in terms of data transformation rather than stateful objects. Pipes and Filters is literally function composition: f ? g ? h. Decorator/Enrich is fmap over an event applying a function to the value inside a context without touching the structure. Source/Sink maps to the producer/consumer model at the heart of stream combinators.


III. The Architecture: DDD + Hexagonal in Rust

I previously wrote about DDD and Hexagonal architecture in https://shahbhat.medium.com/applying-domain-driven-design-and-clean-onion-hexagonal-architecture-to-microservic-284d54b3a874. I organized the POC as a Rust workspace with four crates, each representing a layer of the hexagonal architecture. Hexagonal architecture (also called ports and adapters) means: business logic sits in the center and knows nothing about the outside world. It defines “ports” as trait interfaces that the outside world must implement. The infrastructure layer provides “adapters” that fulfill those ports. The result is that you can test your domain logic without a database, without a network, without any I/O at all.

Dependencies point inward only: Interfaces depend on Application, Application depends on Domain, Infrastructure depends on Domain. The domain never imports anything from the outer layers. Here is how the Pipes and Filters pattern looks as an event flow through the system:

Each box in the filter chain is an independent PipelineFn. Each arrow carries an immutable Event. The chain is configured at runtime via the pipeline definition, but each stage is statically typed and independently testable.

The critical insight: Rust’s crate system makes architectural boundaries a compile-time guarantee. The domain crate literally cannot import infrastructure code. There is no way to “just quickly” add a database call to a domain service. This is the difference between architecture as aspiration and architecture as enforcement. The domain crate’s dependencies tell the whole story:

[dependencies]
ulid = { version = "1", features = ["serde"] }
serde = { version = "1", features = ["derive"] }
thiserror = "2"
async-trait = "0.1"
futures-core = "0.3"

No I/O. No database drivers. No HTTP clients. No channels. Just data structures, pure functions, and trait definitions (ports) that the infrastructure layer must implement.


IV. Group 1 Foundations: Types, Errors, and Dependencies

These six patterns form the bedrock.

Antipattern 1: Singletons to Dependency Injection

Before: The legacy system used module-level singletons for everything like database connections, config, registries:

// Module-level mutable state, accessed globally
let pipelineManager: PipelineManager;

export function getInstance(): PipelineManager {
  if (!pipelineManager) {
    pipelineManager = new PipelineManager(/* hardcoded deps */);
  }
  return pipelineManager;
}

// Somewhere far away in the codebase:
getInstance().processBatch(events); // untestable, hidden dependency

Testing was a nightmare. You could not create a PipelineManager with a mock database because it internally called DatabaseSingleton.getInstance().

After: Every dependency is passed explicitly through constructors. The composition root (main.rs) is the only place that knows how to wire things together:

// Composition root: wiring happens once, at startup
let pipeline_repo = Arc::new(SqlitePipelineRepository::new(conn));
let route_repo = Arc::new(SqliteRouteRepository::new(conn));
let event_bus = Arc::new(ChannelEventBus::new(256));

// Services receive their dependencies — they don't hunt for them
let handler = CreatePipelineHandler::new(
    pipeline_repo.clone(),
    event_bus.clone(),
);

This is the Reader monad made explicit: each handler is a function Config -> A, where the configuration (its dependencies) is threaded through construction rather than pulled from a global. No DI framework needed and the type system enforces what each component depends on.

Antipattern 2: Module-Level Mutable State to Immutable Values

Before: Events were passed by reference and mutated in place across pipeline stages:

function processEvent(event: any): void {
  event.timestamp = Date.now();        // mutate in place
  event.fields.processed = true;       // caller's copy is changed
  event.metadata.stage = "enriched";   // invisible side effect
}

This is where the Decorator/Enrich pattern went wrong in the legacy system. The enrichment was correct in intent but destructive in implementation.

After: Events are immutable value objects. Every transformation returns a new event:

// Event is immutable — set_field returns a NEW event
pub fn set_field(&self, name: impl Into<FieldName>, value: FieldValue) -> Self {
    let mut new_event = self.clone();
    new_event.fields.insert(name.into(), value);
    new_event
}

// Pipeline functions take ownership and return new values
pub trait PipelineFn: Send + Sync {
    fn process(&self, event: Event) -> FnResult;
}

An immutable Event is referentially transparent and enrich_with_timestamp(event) can be replaced by its result value anywhere in the program with no change in behavior. No aliasing bugs. The type system guarantees that if you have a reference to an event, nobody else is changing it.

Antipattern 5: God Class to Bounded Contexts

The thousands of lines in PipelineManager was split across four crates. Each crate has exactly one responsibility:

// domain/   — Event, Pipeline, Route, FnResult (pure data + logic)
// app/      — CreatePipelineHandler, IngestEventHandler (orchestration)
// infra/    — SqlitePipelineRepository, ChannelEventBus (I/O adapters)
// api/      — REST endpoints, CLI commands (user interface)

The compiler enforces the boundaries. You cannot accidentally couple the routing logic to the database layer.

Antipattern 7: Error Swallowing to Result Types

Before: Errors vanished into the void:

try {
  const pipeline = await loadPipeline(id);
  const result = pipeline.process(event);
  await sink.write(result);
} catch (e) {
  // "it's fine"
}

Hundreds of catch blocks like this in the legacy codebase. When something went wrong in production, the system kept running in a corrupted state.

After: Errors are values in the type signature. You cannot ignore them without the compiler warning you:

#[derive(Debug, thiserror::Error)]
pub enum DomainError {
    #[error("validation: {0}")]
    Validation(String),
    #[error("{0} not found: {1}")]
    NotFound(String, String),
    #[error("pipeline execution: {0}")]
    PipelineExecution(String),
    #[error("persistence: {0}")]
    Persistence(String),
}

// Every function that can fail declares it in its type
pub async fn handle(&self, cmd: CreatePipelineCommand) -> Result<Pipeline, DomainError> {
    pipeline.validate()?;  // ? propagates errors — impossible to forget
    self.pipeline_repo.save(&pipeline).await?;
    Ok(pipeline)
}

The ? operator is syntactic sugar for monadic bind over Result. The for-comprehension equivalent in Scala (for { x <- f1; y <- f2 } yield ...) and Rust’s ?-chaining are the same pattern: sequence dependent computations and short-circuit on the first failure, propagating the error with full context.”

Antipattern 11: Primitive Obsession to Newtypes

Before: IDs were raw strings. Mix them up and nothing stops you:

function linkPipeline(pipelineId: string, routeId: string) { ... }
// Oops: arguments swapped, compiles fine, fails at runtime
linkPipeline(routeId, pipelineId);

After: Each ID is a distinct type. The compiler catches mix-ups:

macro_rules! define_id {
    ($name:ident) => {
        #[derive(Debug, Clone, PartialEq, Eq, Hash, Serialize, Deserialize)]
        pub struct $name(String);
        impl $name {
            pub fn new() -> Self { Self(ulid::Ulid::new().to_string()) }
            pub fn as_str(&self) -> &str { &self.0 }
        }
    };
}

define_id!(PipelineId);
define_id!(RouteId);
define_id!(EventId);
// fn link(pipeline: &PipelineId, route: &RouteId) — can't swap these

This is the phantom type pattern: PipelineId and RouteId are both String at runtime, but they are different types at compile time because the wrapper carries no runtime data. Zero cost, full safety.

Antipattern 18: any Types to Generics and Trait Bounds

Before: The pipeline function interface accepted and returned any:

type ProcessorFn = (event: any) => any;
// No contract. No guarantees. Runtime explosions.

After: Trait bounds make the contract explicit and compiler-checked:

pub trait PipelineFn: Send + Sync {
    fn name(&self) -> &str;
    fn process(&self, event: Event) -> FnResult;
}

pub trait PipelineFnFactory: Send + Sync {
    fn create(&self, config: &serde_json::Value) -> Result<Box<dyn PipelineFn>, String>;
}

The trait says: “Give me an Event, I’ll give you an FnResult (Pass, Split, or Drop).” No ambiguity. No any. The compiler enforces the contract at every call site.


V. Group 2 Data Modeling: Making Illegal States Unrepresentable

Antipattern 3: Mode/Env Branching to Sum Types

A sum type (also called an algebraic data type or ADT) is an enum where each variant carries different data. Instead of one struct with optional fields where only some combinations are valid, you define each valid combination as its own variant.

Before: Configuration types were discriminated by strings, with every consumer doing defensive checking:

interface FunctionConfig {
  type: string;         // "eval" | "drop" | "mask" | ... maybe?
  field?: string;       // required for some types
  pattern?: string;     // required for mask and regex
  expression?: string;  // required for eval
  targetFields?: string[];  // only regex
}

// Every consumer:
if (config.type === "eval") {
  if (!config.field || !config.expression) throw new Error("invalid");
}

After: An enum makes illegal states unrepresentable. Each variant carries exactly its required data:

pub enum FunctionConfig {
    Eval { field: String, expression: String },
    Drop { filter: String },
    Mask { field: String, pattern: String, replacement: String },
    RegexExtract { field: String, pattern: String, target_fields: Vec<String> },
}

// Pattern matching is exhaustive — add a new variant and the compiler
// shows you every place that needs updating
fn resolve(config: &FunctionConfig) -> Result<Box<dyn PipelineFn>, DomainError> {
    match config {
        FunctionConfig::Eval { field, expression } => { /* guaranteed present */ }
        FunctionConfig::Drop { filter } => { /* ... */ }
        FunctionConfig::Mask { field, pattern, replacement } => { /* ... */ }
        FunctionConfig::RegexExtract { field, pattern, target_fields } => { /* ... */ }
    }
}

Similarly, the result of processing an event is a sum type:

pub enum FnResult {
    Pass(Event),       // event continues downstream
    Split(Vec<Event>), // one event becomes many
    Drop,              // event is discarded
}

This is the core ADT insight: product types (structs, where a value has field A and field B) model data that is always fully present; sum types (enums, where a value is variant A or variant B) model data where only some combinations are valid. Illegal states become unrepresentable by construction. FnResult is a sum type that makes the three possible outcomes of a pipeline stage explicit. The legacy equivalent was return null | Event | Event[], but invisible to the type system and easy to miss in a catch {} block.

Antipattern 4: Type-String Dispatch to Registry Pattern

Before: Function types were resolved with an if/else chain that grew with every new type:

function createFunction(config: any): ProcessorFn {
  if (config.type === "eval") return new EvalFn(config);
  else if (config.type === "drop") return new DropFn(config);
  else if (config.type === "mask") return new MaskFn(config);
  // ... grows forever, easy to forget one
  else throw new Error(`unknown type: ${config.type}`);
}

After: A registry maps type names to factories. Adding new types does not touch existing code:

pub struct DefaultFunctionRegistry {
    factories: HashMap<String, Box<dyn PipelineFnFactory>>,
}

impl DefaultFunctionRegistry {
    pub fn new() -> Self {
        let mut registry = Self { factories: HashMap::new() };
        registry.factories.insert("eval".into(), Box::new(EvalFnFactory));
        registry.factories.insert("drop".into(), Box::new(DropFnFactory));
        registry.factories.insert("mask".into(), Box::new(MaskFnFactory));
        registry.factories.insert("regex_extract".into(), Box::new(RegexExtractFnFactory));
        registry
    }
}

The registry is an interpreter pattern where you separate the description of what to do (FunctionConfig as a DSL) from how to do it (PipelineFnFactory as the interpreter). This is the same structure as Free Monads: define your algebra as data (each FunctionConfig variant is an AST node), then write interpreters against it (production factories, test stubs, dry-run validators). The registry approach is the pragmatic version without monad transformer overhead, just a HashMap of factories. The key property is the same: you can swap the interpreter without touching the program description.

Antipattern 8: Temporal Coupling to Typestate Builder

Typestate is a pattern that uses the type system to enforce valid state transitions at compile time. You encode the object’s lifecycle phase into its type, so calling methods in the wrong order is a compiler error rather than a runtime error.

Before: Pipelines could be created in invalid states — no functions, empty description — and the error only surfaced at runtime:

const pipeline = new Pipeline();
pipeline.save(); // Oops: no functions, no description. Runtime error.

After: The builder uses phantom types to make the invalid state impossible to compile:

pub struct PipelineBuilder<State> {
    id: PipelineId,
    description: String,
    functions: Vec<PipelineFunction>,
    _state: PhantomData<State>,
}

// Can only add functions in the NoFunctions state (transitions to HasFunctions)
impl PipelineBuilder<NoFunctions> {
    pub fn add_function(self, func: PipelineFunction) -> PipelineBuilder<HasFunctions> { ... }
}

// build() only exists on HasFunctions — you literally cannot call it without functions
impl PipelineBuilder<HasFunctions> {
    pub fn build(self) -> Pipeline { ... }
}

Rust’s ownership system is an affine type system: values may be used at most once (moved, not copied, unless Copy). The typestate builder exploits this: add_function(self) takes ownership of the builder and returns a new one in the next state. You literally cannot hold onto the old PipelineBuilder<NoFunctions> after calling add_function and the borrow checker makes it a compile error. This is stronger than a runtime lifecycle check: the invalid state cannot exist in memory, not just in logic.

Antipattern 9: Global Mutable Registry to Persistent Data Structures

Before: The route table was a global mutable singleton. Updates caused race conditions and stale reads:

class RouteRegistry {
  private static instance: RouteRegistry;
  private rules: RouteRule[] = []; // mutated by multiple threads
  addRule(rule: RouteRule) { this.rules.push(rule); } // race!
}

After: Route tables are immutable values. “Updating” returns a new version:

impl RouteTable {
    pub fn add_rule(&self, rule: RouteRule) -> Self {
        let mut new_table = self.clone();
        new_table.rules.push(rule);
        new_table.version += 1;
        new_table
    }
}

In a real persistent data structure (Clojure’s HAMT, Haskell’s finger trees), ‘copying’ only involves copying the path from the modified node to the root with O(log n) nodes, not O(n). Rust’s clone() here is a simple structural copy, which is fine for small route tables. The principle is the same: multiple versions coexist safely because neither modifies the other.

Antipattern 12: Signal-Based Dispatch to Handler Map

Before: Event handling used a giant switch statement that grew with every new event type:

function handleSignal(signal: string, data: any) {
  switch (signal) {
    case "pipeline.created": notifyUI(data); break;
    case "pipeline.deleted": cleanupCache(data); break;
    // ... 40 more cases
  }
}

After: A handler map registers handlers by event type. New events are handled by registering a new handler, not by modifying existing code:

// Register handlers at composition time
let mut handlers: HashMap<String, Box<dyn EventHandler>> = HashMap::new();
handlers.insert("pipeline.created".into(), Box::new(NotifyUiHandler));
handlers.insert("pipeline.deleted".into(), Box::new(CleanupCacheHandler));

// Dispatch is a single lookup — no switch statement
if let Some(handler) = handlers.get(event.event_type()) {
    handler.handle(event).await?;
}

Antipattern 13: Anemic Domain Model to Rich Domain Objects

Before: Pipeline was a data bag with all logic living in external “service” classes:

class Pipeline {
  id: string;
  functions: FunctionConfig[];
  // That's it. No behavior. Just a struct with public fields.
}

class PipelineService {
  validate(p: Pipeline) { /* 200 lines */ }
  addFunction(p: Pipeline, f: FunctionConfig) { /* 50 lines */ }
}

After: The pipeline owns its behavior. Invariants are maintained internally:

impl Pipeline {
    pub fn add_function(&mut self, func: PipelineFunction) {
        self.functions.push(func);
        self.version += 1; // version always tracks mutations
    }

    pub fn validate(&self) -> Result<(), DomainError> {
        if self.description.is_empty() {
            return Err(DomainError::Validation("description cannot be empty".into()));
        }
        if self.functions.is_empty() {
            return Err(DomainError::Validation("must have at least one function".into()));
        }
        Ok(())
    }

    pub fn active_functions(&self) -> impl Iterator<Item = &PipelineFunction> {
        self.functions.iter().filter(|f| !f.disabled)
    }
}

VI. Group 3: Composition and Control Flow

Antipattern 6: forEach + Push to Iterator Combinators

Before: Processing was imperative loops accumulating into mutable vectors:

function processBatch(events: any[], functions: ProcessorFn[]): any[] {
  const results: any[] = [];
  for (const event of events) {
    let current = event;
    for (const fn of functions) {
      const result = fn(current);
      if (result === null) break;
      if (Array.isArray(result)) { results.push(...result); break; }
      current = result;
    }
    if (current) results.push(current);
  }
  return results;
}

After: The pipeline engine uses fold (reduce) over the function chain. This is the Pipes and Filters pattern made explicit where each function is a filter stage, the vector is the pipe:

pub struct PipelineEngine;

impl PipelineEngine {
    pub fn process_event(event: Event, functions: &[&dyn PipelineFn]) -> Vec<FnResult> {
        let mut current_events = vec![event];
        let mut final_results = Vec::new();

        for func in functions {
            let mut next_batch = Vec::new();
            for evt in current_events {
                match func.process(evt) {
                    FnResult::Pass(e) => next_batch.push(e),
                    FnResult::Split(es) => next_batch.extend(es),
                    FnResult::Drop => final_results.push(FnResult::Drop),
                }
            }
            current_events = next_batch;
        }

        final_results.extend(current_events.into_iter().map(FnResult::Pass));
        final_results
    }
}

The pipeline engine’s inner loop is a fold (catamorphism) over the function list, with the accumulator being the current set of live events. Every iteration either passes events forward, fans them out (Split), or drops them. This is the structural recursion pattern: the shape of the computation mirrors the shape of the data (a linear chain of functions).

Antipattern 10: Callback Chains to Async Composition

Before: Nested callbacks (or deeply chained .then() promises) with error handling at each level:

loadConfig()
  .then(config => loadPipeline(config.pipelineId))
  .then(pipeline => pipeline.process(event))
  .then(result => sink.write(result))
  .catch(e => { /* which step failed? */ });

After: Rust’s async/await with ? gives linear, readable control flow:

async fn handle(&self, cmd: IngestEventCommand) -> Result<Vec<Event>, DomainError> {
    let route_table = self.route_repo.get_table().await?;
    let decisions = RoutingEngine::route_event(&cmd.event, &route_table)?;
    for decision in decisions {
        let pipeline = self.pipeline_repo.get(&decision.pipeline_id).await?;
        // ... each ? short-circuits on error with full context
    }
    Ok(all_output)
}

Antipattern 14: Eager Initialization to Lazy Evaluation

Before: All pipeline functions, parsers, and regex patterns were compiled at startup, even if never used:

// All compiled eagerly at module load time, even for pipelines never triggered
const ALL_PATTERNS = compileAllRegexPatterns(); // 500ms startup cost

After: Expensive initializations are deferred until first use with once_cell::Lazy, and streams are demand-driven:

use once_cell::sync::Lazy;

static REGEX_CACHE: Lazy<HashMap<String, Regex>> = Lazy::new(|| {
    // Only compiled when first accessed
    HashMap::new()
});

// Sources produce events on demand — pull, not push
impl EventSource for FileSource {
    fn stream(&mut self) -> Pin<Box<dyn Stream<Item = Event> + Send + '_>> {
        // Lines are read only when the consumer calls .next()
        Box::pin(self.reader.lines().map(|line| parse_event(line)))
    }
}

Lazy::new is memoization with a single input (the unit type): the computation runs at most once and its result is cached forever. This is safe only because the initializer is pure with same (empty) input always produces the same output. If the initializer had side effects, re-running it vs. caching would produce different behavior.

Antipattern 15: Mixed I/O + Logic to Effect Separation

Before: Business logic was interleaved with database calls, HTTP requests, and logging:

async function processEvent(event: any) {
  const config = await db.getConfig();      // I/O
  event.enriched = transform(event, config); // logic
  await kafka.publish(event);                // I/O
  metrics.increment("processed");            // I/O
  if (event.severity > 3) {
    await alertService.fire(event);          // I/O
  }
  return event;
}

After: Domain services are pure functions. I/O lives exclusively in the infrastructure layer:

// Domain service: PURE — no I/O, no side effects
impl PipelineEngine {
    pub fn process_batch(events: Vec<Event>, functions: &[&dyn PipelineFn]) -> BatchResult {
        // Pure computation: transform events through functions
    }
}

// Application layer: orchestrates I/O around pure domain logic
impl IngestEventHandler {
    pub async fn handle(&self, cmd: IngestEventCommand) -> Result<Vec<Event>, DomainError> {
        let route_table = self.route_repo.get_table().await?;   // I/O: read
        let decisions = RoutingEngine::route_event(&cmd.event, &route_table)?; // Pure
        // ... resolve functions (I/O), process (pure), return results
    }
}

This is Functional Core, Imperative Shell (FCIS) in practice: PipelineEngine::process_batch is the functional core with a pure function, trivially testable, no mocks needed. IngestEventHandler::handle is the imperative shell that orchestrates I/O around the pure core, calling out to repositories and event buses. The pattern is the same as Haskell’s IO monad: describe what to do (pure), defer execution to the edge (impure).

Antipattern 16: Monolithic Functions to Function Composition

The key insight from the pipeline engine: each transform is a small, independent function that composes with others. Instead of one 500-line processEvent() method that does everything, we have a chain of focused transforms:

// Each function is tiny and testable in isolation
struct MaskFn { field: String, regex: Regex, replacement: String }

impl PipelineFn for MaskFn {
    fn name(&self) -> &str { "mask" }
    fn process(&self, event: Event) -> FnResult {
        match event.get_field(&self.field) {
            Some(FieldValue::Str(value)) => {
                let masked = self.regex.replace_all(value, self.replacement.as_str());
                FnResult::Pass(event.set_field(&self.field, FieldValue::Str(masked.into())))
            }
            _ => FnResult::Pass(event),
        }
    }
}

This is the Pipes and Filters pattern at the code level. Each PipelineFn is a filter. The engine composes them into a pipeline. You can test each filter in isolation, reorder them, add new ones without touching existing filters.

Each PipelineFn implementation is a pure function transformer: it takes an Event and returns an FnResult. The engine is function composition at runtime — the pipeline definition is a list of function names that the registry resolves into a chain of Box<dyn PipelineFn>. Adding a new stage means writing one new impl PipelineFn block, not touching the engine.

Antipattern 17: No Rollback to Saga Pattern

Before: Multi-step operations had no compensation logic. If step 3 of 5 failed, steps 1-2 left orphaned state:

await db.savePipeline(pipeline);
await registry.register(pipeline);  // if this fails, DB has orphan
await bus.publish("created");       // if this fails, registry is stale

After: Command handlers treat publish failures as non-fatal (eventual consistency), and the pattern supports full compensation:

pub async fn handle(&self, cmd: CreatePipelineCommand) -> Result<Pipeline, DomainError> {
    self.pipeline_repo.save(&pipeline).await?;

    // Non-critical: event publication. If it fails, the pipeline still exists.
    // A background reconciler can re-publish later.
    if let Err(e) = self.event_publisher.publish(event).await {
        tracing::warn!("Failed to publish PipelineCreated event: {}", e);
    }

    Ok(pipeline)
}

This is the simplified saga pattern, treating non-critical steps (event publication) as best-effort with background reconciliation, rather than requiring two-phase commit. Full saga compensation (explicit rollback actions for each step) would be appropriate if, say, publishing failure meant the pipeline should be marked inactive. The pattern scales from ‘log and retry’ to full compensating transactions depending on consistency requirements.


VII. Group 4: Concurrency and Architecture

Antipattern 20: Monolithic Startup to Plugin Architecture

Before: Adding a new source or sink type required modifying core initialization code in multiple files:

// startup.ts — grows with every new component
import { KafkaSource } from './sources/kafka';
import { S3Sink } from './sinks/s3';
import { HttpSource } from './sources/http';
// ... 30 more imports

function init() {
  registerSource('kafka', KafkaSource);
  registerSource('http', HttpSource);
  // ... grows linearly
}

After: Cargo features allow components to be compiled in or out. The function registry pattern means new types are added without modifying existing code:

[features]
default = ["http-source", "file-source", "stdout-sink"]
http-source = []
file-source = []
stdout-sink = []
memory-sink = []
// New source? Implement the trait and register in the feature-gated module.
// No existing code changes.
#[cfg(feature = "http-source")]
registry.register_source("http", Box::new(HttpSourceFactory));

Antipattern 21: OS Process Forking to Actor Model

Before: The legacy system scaled by forking OS processes, each with its own copy of global state:

import cluster from 'cluster';
if (cluster.isPrimary) {
  for (let i = 0; i < numCPUs; i++) cluster.fork();
} else {
  startWorker(); // entire app copied, 200MB per worker
}

After: Lightweight async actors communicate through bounded channels:

pub struct PipelineActor {
    rx: mpsc::Receiver<PipelineActorMsg>,
    output_tx: mpsc::Sender<Vec<Event>>,
    functions: Vec<Box<dyn PipelineFn>>,
    state: PipelineActorState,
}

impl PipelineActor {
    pub async fn run(mut self) {
        while let Some(msg) = self.rx.recv().await {
            match msg {
                PipelineActorMsg::ProcessBatch(events) => {
                    let result = PipelineEngine::process_batch(events, &fn_refs);
                    self.state.processed += result.passed.len() as u64;
                    if !result.passed.is_empty() {
                        let _ = self.output_tx.send(result.passed).await;
                    }
                }
                PipelineActorMsg::Shutdown => break,
            }
        }
    }
}

This is Erlang’s actor model translated to Tokio tasks. The key insight from both models: if there is no shared mutable state, there is nothing to race over. Tokio’s mpsc bounded channel is the CSP channel where both sender and receiver synchronize on the buffer, and backpressure propagates automatically when the buffer is full.

Antipattern 22: Leader Bottleneck to Version Vectors

Rather than a single leader node holding all configuration state, each entity carries its own version number. Concurrent updates to different pipelines do not conflict.

pub struct Pipeline {
    pub version: u64, // incremented on every mutation
    // ...
}

impl Pipeline {
    pub fn add_function(&mut self, func: PipelineFunction) {
        self.functions.push(func);
        self.version += 1;
    }
}

// Optimistic concurrency: "update only if still at version 7"
pub async fn save(&self, pipeline: &Pipeline) -> Result<(), DomainError> {
    let rows = sqlx::query("UPDATE pipelines SET ... WHERE id = ? AND version = ?")
        .bind(pipeline.id.as_str())
        .bind(pipeline.version - 1) // expected previous version
        .execute(&self.pool).await?;
    if rows.rows_affected() == 0 {
        return Err(DomainError::ConcurrencyConflict);
    }
    Ok(())
}

The principled FP alternative to optimistic locking is Software Transactional Memory (STM): compose atomic operations on shared memory without locks, with automatic retry on conflict. Haskell’s atomically $ do { modifyTVar from subtract; modifyTVar to (+) } makes multi-step updates composable where either all happen or none do. Rust doesn’t have STM in the standard library, and for database-backed state, optimistic locking (version vectors + UPDATE WHERE version = N) achieves the same semantic: detect conflicts at commit time, retry at the application layer. STM is preferable when conflicts are rare and the critical section is in-memory; version vectors scale to distributed state across process boundaries.

Antipattern 23: Shared Code Bloat to Feature-Gated Modules

The Cargo features system means you only compile what you need. A deployment that only uses HTTP sources does not include the file-tailing code. Binary size stays small, and the dependency graph is explicit.

// Only compiled when the feature is enabled
#[cfg(feature = "file-source")]
pub mod file_source;

#[cfg(feature = "http-source")]
pub mod http_source;

Antipattern 24: Push Without Backpressure to Bounded Channels

Before: Producers pushed events into unbounded queues. Under load, memory grew until the process OOM’d:

const queue: Event[] = []; // grows forever
source.on('data', event => queue.push(event)); // no limit!

After: Bounded channels create natural backpressure. When the buffer is full, producers wait:

pub struct HttpEventSource {
    sender: mpsc::Sender<Event>,
    receiver: Option<mpsc::Receiver<Event>>,
}

impl HttpEventSource {
    pub fn new(buffer_size: usize) -> Self {
        let (sender, receiver) = mpsc::channel(buffer_size); // bounded!
        Self { sender, receiver: Some(receiver) }
    }
}

Bounded channels are the Rust equivalent of reactive streams backpressure: when the downstream consumer can’t keep up, the sender.send().await call suspends the producer task rather than buffering unboundedly. The pipeline becomes a dataflow graph where each stage’s throughput is constrained by its slowest downstream neighbor.

Antipattern 25: Polling to Lazy Pull Streams

Before: Workers polled for new data on a timer, wasting CPU when idle and introducing latency when busy:

setInterval(async () => {
  const batch = await queue.poll(); // wasteful when idle
  if (batch.length > 0) process(batch);
}, 100); // 100ms latency floor

After: Event sources implement the Stream trait. Consumers pull one item at a time via .next().await, which parks the task until data is available:

use futures::StreamExt;

// Consumer pulls events on demand — no polling, no wasted cycles
while let Some(event) = source.stream().next().await {
    let results = PipelineEngine::process_event(event, &fn_refs);
    for result in results {
        sink.write(result).await?;
    }
}

A Stream is corecursive: where recursion consumes a finite structure by breaking it down (a catamorphism, like AP 28), corecursion produces a potentially infinite structure by building it up one step at a time (an anamorphism). FileSource::stream() is an anamorphism over the file: the seed is the file handle, each step produces one event and a new handle position, and the stream terminates when the handle is exhausted. The Stream trait is Rust’s lazy sequence and the functional equivalent of Haskell’s LazyList or Scala’s LazyList. Nothing is computed until the consumer calls .next().await. This is demand-driven (pull) evaluation: the producer runs exactly as fast as the consumer needs, with no intermediate buffering and no polling overhead.


VIII. Group 5: Advanced Functional Patterns

Antipattern 19: Opaque Service Interfaces to Capability Traits

Before: Services exposed god-interfaces with dozens of methods, most irrelevant to any given caller:

interface PipelineService {
  create(p: Pipeline): void;
  delete(id: string): void;
  process(event: any): any;
  getMetrics(): Metrics;
  reload(): void;
  // ... 20 more methods
}

After: Each capability is a separate trait. Callers depend only on what they need:

// Fine-grained capability traits
pub trait FunctionResolver: Send + Sync {
    fn resolve(&self, config: &FunctionConfig) -> Result<Box<dyn PipelineFn>, DomainError>;
}

pub trait PipelineRepository: Send + Sync {
    async fn get(&self, id: &PipelineId) -> Result<Pipeline, DomainError>;
    async fn save(&self, pipeline: &Pipeline) -> Result<(), DomainError>;
}

// Callers declare exactly what they need — nothing more
struct IngestHandler {
    resolver: Arc<dyn FunctionResolver>,
    repo: Arc<dyn PipelineRepository>,
}

Fine-grained capability traits are Tagless Final in practice. Instead of a concrete PipelineService god-object, you declare your algebra as a set of type class constraints: fn ingest<R, P>(resolver: &R, repo: &P, event: Event) where R: FunctionResolver and P: PipelineRepository. The function is polymorphic over its effects and you substitute production implementations at the composition root and test stubs in unit tests, with zero runtime overhead compared to dynamic dispatch.

Antipattern 26: Deep Inheritance to Trait Composition

Before: A 6-level inheritance hierarchy where each level overrode different methods:

class BaseProcessor { ... }
class FilteringProcessor extends BaseProcessor { ... }
class EnrichingProcessor extends FilteringProcessor { ... }
class BatchingEnrichingProcessor extends EnrichingProcessor { ... }
// "Which version of transform() am I actually running?" — nobody knows

After: Behavior is defined through trait composition. No inheritance. Each implementation is independent and flat:

pub trait PipelineFn: Send + Sync {
    fn name(&self) -> &str;
    fn process(&self, event: Event) -> FnResult;
}

// Each implementation is flat — no hierarchy, no overriding
impl PipelineFn for EvalFn { ... }
impl PipelineFn for DropFn { ... }
impl PipelineFn for MaskFn { ... }
impl PipelineFn for RegexExtractFn { ... }

You never ask “which version of process() am I actually running?” There is exactly one implementation per type. No surprises.

Antipattern 27: Unbounded Recursion to Iterative Fold

Before: Batch processing used recursion that could blow the stack on large inputs:

function processAll(events: any[], fns: Function[], idx: number): any[] {
  if (idx >= fns.length) return events;
  return processAll(events.map(fns[idx]), fns, idx + 1); // stack overflow risk
}

After: The pipeline engine uses iterative fold. Stack overflow is impossible regardless of pipeline length:

// Iterative: each function is applied in a loop, not via recursion
for func in functions {
    let mut next_batch = Vec::new();
    for evt in current_events {
        match func.process(evt) {
            FnResult::Pass(e) => next_batch.push(e),
            FnResult::Split(es) => next_batch.extend(es),
            FnResult::Drop => {}
        }
    }
    current_events = next_batch;
}

Antipattern 28: Ad-Hoc Recursion to Catamorphism

A catamorphism is a recursive fold over a tree structure and you define how to handle each node type, and the recursion follows the shape of the data automatically. The routing engine evaluates filter expressions using this pattern:

pub fn evaluate_filter(filter: &FilterExpr, event: &Event) -> Result<bool, DomainError> {
    match filter {
        FilterExpr::Eq(field, expected) => {
            Ok(event.get_field(field) == Some(expected))
        }
        FilterExpr::And(left, right) => {
            Ok(Self::evaluate_filter(left, event)? && Self::evaluate_filter(right, event)?)
        }
        FilterExpr::Or(left, right) => {
            Ok(Self::evaluate_filter(left, event)? || Self::evaluate_filter(right, event)?)
        }
        FilterExpr::Not(inner) => Self::evaluate_filter(inner, event).map(|b| !b),
        FilterExpr::True => Ok(true),
    }
}

The catamorphism’s real value is that it separates what to compute at each node from how to recurse. You never write the recursive traversal by hand and the match on the enum is the recursion. Add a new FilterExpr variant and every unhandled match becomes a compile error.

Antipattern 29: Hardcoded Parsers to Parser Combinators

Before: Filter expressions were parsed with regex and string splitting, growing more fragile with each new operator:

function parseFilter(expr: string): Filter {
  if (expr.includes(' AND ')) {
    const parts = expr.split(' AND ');
    return { type: 'and', left: parseFilter(parts[0]), right: parseFilter(parts[1]) };
  }
  // fails silently on malformed input
}

After: Parser combinators (using nom) build complex parsers from small, tested pieces:

fn parse_comparison(input: &str) -> IResult<&str, FilterExpr> {
    let (input, field) = parse_identifier(input)?;
    let (input, _) = multispace0(input)?;
    let (input, op) = alt((tag("=="), tag("!="), tag(">"), tag("<"), tag("contains")))(input)?;
    let (input, _) = multispace0(input)?;
    let (input, value) = parse_value(input)?;

    let expr = match op {
        "==" => FilterExpr::Eq(field, value),
        "!=" => FilterExpr::Neq(field, value),
        ">" => FilterExpr::Gt(field, value),
        "<" => FilterExpr::Lt(field, value),
        "contains" => FilterExpr::Contains(field, value),
        _ => unreachable!(),
    };
    Ok((input, expr))
}

fn parse_and(input: &str) -> IResult<&str, FilterExpr> {
    let (input, left) = parse_atom(input)?;
    let (input, _) = delimited(multispace0, tag_no_case("AND"), multispace0)(input)?;
    let (input, right) = parse_expr(input)?;
    Ok((input, FilterExpr::And(Box::new(left), Box::new(right))))
}

Parser combinators are applicative by nature: parse_comparison and parse_and are independent parsers composed with alt (choice) and sequence (both must succeed). This is the Applicative pattern and unlike a monad, where each step depends on the previous result, applicative composition runs independent effects and combines their outputs. alt((tag("=="), tag("!="))) is f <*> g where both parsers are defined statically, with no dependency between them.

Antipattern 30: Stringly-Typed Field Access to Typed Lenses

Before: Accessing nested event data was a chain of string lookups with no type safety:

const value = event.fields["user"]["email"]; // undefined? string? number? who knows
if (value) { /* hope it's a string */ }

After: Typed accessor methods (lens-style) provide safe, focused access to nested data:

// get_field returns Option<&FieldValue> — forces the caller to handle absence
let email = event.get_field("user.email");

// set_field returns a new event — the lens "focuses" on one field
// and produces a new whole from the modified part
let masked = event.set_field("user.email", FieldValue::Str("[REDACTED]".into()));

// Type-safe: you know exactly what you're getting
match event.get_field("severity") {
    Some(FieldValue::Int(level)) => route_by_severity(*level),
    Some(FieldValue::Str(s)) => route_by_severity(s.parse()?),
    None => route_to_default(),
    _ => Err(DomainError::Validation("unexpected severity type".into())),
}

Antipattern 31: Implicit Mutable State to Reducer Pattern

The actor’s message loop is a reducer: it receives a message and transitions to a new state. The state is always consistent because there is only one owner (the actor itself):

// State transitions are explicit and atomic
PipelineActorMsg::ProcessBatch(events) => {
    let result = PipelineEngine::process_batch(events, &fn_refs);
    self.state.processed += result.passed.len() as u64;
    self.state.dropped += result.dropped;
}

No concurrent access. No locks. No race conditions. The actor pattern plus Rust’s ownership model guarantees single-writer semantics.

Antipattern 32: Monkey-Patching to Extension via Traits

Before: Extending behavior meant modifying existing classes or patching prototypes at runtime:

// Monkey-patching: modifying someone else's class at runtime
Pipeline.prototype.customProcess = function() { /* surprise! */ };

After: You implement a trait for your type. The registry accepts any Box<dyn PipelineFn> — your custom function is a first-class citizen without modifying any framework code:

// Your custom function — no framework modification needed
struct MyCustomFn { config: MyConfig }

impl PipelineFn for MyCustomFn {
    fn name(&self) -> &str { "my_custom" }
    fn process(&self, event: Event) -> FnResult { /* your logic */ }
}

// Register it alongside built-in functions
registry.register("my_custom", Box::new(MyCustomFnFactory));

Antipattern 33: Implicit Ordering to Typestate Lifecycle

The actor has a clear lifecycle: Created, Running, Stopped. The run() method consumes self, making it impossible to use the actor after it has been started (unless you keep the handle):

impl PipelineActor {
    pub async fn run(mut self) { // takes ownership — actor is "consumed"
        while let Some(msg) = self.rx.recv().await { ... }
        // When this returns, the actor is done. No zombie state.
    }
}

// After spawning, you only have the handle — not the actor itself
let handle = tokio::spawn(actor.run()); // actor moved into the task
// actor.do_something(); // COMPILE ERROR: actor has been moved

Antipattern 34: Window via Mutation to Comonad-Style

A comonad is a structure that provides context around a focused element. Think of it as the dual of a monad: where a monad wraps a value you can map over, a comonad gives you a value plus its neighborhood.

Before: Sliding windows were implemented as mutable arrays with index arithmetic:

class SlidingWindow {
  private buffer: any[] = [];
  private index = 0;
  push(item: any) { this.buffer[this.index++ % this.size] = item; }
  getContext() { /* complex index math, off-by-one bugs */ }
}

After: A comonad-style window provides extract() (get the focused value) and extend() (apply a context-aware function at every position):

pub struct SlidingWindow<T> {
    items: VecDeque<T>,
    focus_idx: usize,
    window_size: usize,
}

impl<T: Clone> SlidingWindow<T> {
    /// Get the focused element (comonad extract)
    pub fn extract(&self) -> Option<&T> {
        self.items.get(self.focus_idx)
    }

    /// Apply a function at every position, producing a new window (comonad extend)
    pub fn extend<B, F>(&self, f: F) -> SlidingWindow<B>
    where
        F: Fn(&SlidingWindow<T>) -> B,
        B: Clone,
    {
        let mut results = VecDeque::with_capacity(self.items.len());
        for i in 0..self.items.len() {
            let shifted = SlidingWindow {
                items: self.items.clone(),
                focus_idx: i,
                window_size: self.window_size,
            };
            results.push_back(f(&shifted));
        }
        SlidingWindow { items: results, focus_idx: self.focus_idx, window_size: self.window_size }
    }
}

A monad lets you chain ‘what to do next’ (flatMap), a comonad lets you ask ‘what does the context around this value say’ (extend). The classic examples are spreadsheets (each cell is a value with a grid of neighbors) and Conway’s Game of Life (extend step grid applies the evolution rule at every cell simultaneously). In the pipeline, extend lets you compute a moving average or rate-of-change at every position in one pass, without index arithmetic.

Antipattern 35: Static Worker Assignment to Work-Stealing

Before: Work was distributed round-robin to a fixed number of workers, causing hot spots:

const workers = Array.from({ length: 4 }, () => new Worker());
let nextWorker = 0;
function dispatch(batch) {
  workers[nextWorker++ % workers.length].send(batch); // unbalanced
}

After: For CPU-bound batch processing, rayon‘s parallel iterators provide work-stealing scheduling:

use rayon::prelude::*;

// rayon automatically distributes work across cores
let results: Vec<BatchResult> = batches
    .par_iter()
    .map(|batch| PipelineEngine::process_batch(batch.clone(), &fn_refs))
    .collect();

Use rayon for CPU-bound batch processing where tasks are independent and similar in size. Use the actor-per-pipeline model (Antipattern 21) for I/O-bound work and heterogeneous task sizes and actors handle backpressure and message ordering; rayon just parallelizes.”


IX. The Human Cost

The patterns described here are not primarily about performance, they are about cognitive load. When errors are values, when states are explicit in types, when illegal states are unrepresentable, and when each function does one thing, a new engineer can understand any individual piece in isolation. That is the real dividend of functional discipline: onboarding time and debugging time drop together.

Each pattern from above addresses a real cost that the team paid every day. For example, new engineers on the legacy system could not ship features for months. Not because observability pipelines are conceptually hard. It was because the system had enormous artificial complexity. There was no way to understand one piece in isolation because everything depended on everything else.

When errors are swallowed, states are implicit, and types are erased, debugging a production incident means reading every log line and reconstructing what happened. In the new system, errors propagate with context. The route table is immutable, so corruption is structurally impossible. All of these costs reinforce each other. Slow onboarding means fewer experienced engineers. Fewer experienced engineers means less refactoring capacity. Less refactoring means more debt.


X. Conclusion

This is not a story about Rust vs. TypeScript and it comes with a working POC at github.com/bhatti/pipeflow that implements all the patterns described. TypeScript with strict: true, branded types, and careful architecture can achieve many of the same guarantees. The lesson is about principles:

  1. Keep what works. Pipes and Filters, Decorator/Enrich, Source/Sink worked. The problem was their implementation, not their design.
  2. Make illegal states unrepresentable. Use sum types (enums where each variant carries different data) and typestate (using the type system to enforce valid state transitions) to shift runtime errors to compile-time.
  3. Separate effects from logic. Pure domain functions are trivially testable and infinitely composable.
  4. Enforce boundaries with the build system. Architecture diagrams lie. Compiler errors do not.
  5. Prefer immutable data. Clone when you need to diverge. The clarity is worth the allocation.
  6. Make errors explicit. Result<T, E> in the type signature. No swallowing. No surprises.
  7. Compose small functions. A pipeline of 5 focused transforms beats one 500-line method.
  8. Name the patterns. Immutable values, sum types, typestate, catamorphism, comonad are not buzzwords. They are compressed names for solutions that took decades to discover. Knowing the name means knowing the laws, the composability guarantees, and the tradeoffs.

The mud did not accumulate overnight, and it will not disappear overnight. But every boundary you draw, every type you make explicit, every error you refuse to swallow makes the next change slightly easier. That is how you reverse the flywheel.

Source code: The full POC implementing all patterns described here is available as an open-source Rust project at github.com/bhatti/pipeflow.


XI. Pattern Index

#Antipattern -> SolutionCore FP Concept(s)Section
1Singletons -> Dependency InjectionReader Monad, Functional Core/Imperative ShellIV
2Mutable State -> Immutable ValuesReferential Transparency, Value SemanticsIV
3Mode Branching -> Sum TypesADT (Sum Types), Exhaustive Pattern MatchingV
4String Dispatch -> RegistryTagless Final (lite), Open/Closed, First-Class FunctionsV
5God Class -> Bounded ContextsModule Systems, FCIS, Separation of ConcernsIV
6forEach + Push -> Iterator CombinatorsFunctor (map), Fold / Catamorphism, Lazy PipelinesVI
7Error Swallowing -> Result TypesMonad (bind / ?), Either / Option, Monadic ChainingIV
8Temporal Coupling -> Typestate BuilderPhantom Types, Affine / Linear Types, TypestateV
9Global Registry -> Persistent Data StructuresPersistent DS, Structural Sharing, Immutable UpdatesV
10Callback Chains -> Async CompositionMonad (sequential composition), CPS (async/await desugaring)VI
11Primitive Obsession -> NewtypesNewtype Pattern, Phantom Types, Zero-Cost AbstractionIV
12Signal Dispatch -> Handler MapFirst-Class Functions, Open Dispatch, Strategy PatternV
13Anemic Model -> Rich Domain ObjectsADTs, Encapsulation of Invariants, Expression-OrientedV
14Eager Init -> Lazy EvaluationThunks, Memoization (evaluate-once semantics)VI
15Mixed I/O + Logic -> Effect SeparationIO Monad, Algebraic Effects, Functional Core / Imperative ShellVI
16Monolithic Functions -> Function CompositionFunction Composition, Point-Free Style, Pipes and FiltersVI
17No Rollback -> Saga PatternEventual Consistency, Compensating TransactionsVI
18any Types -> Generics + Trait BoundsType Classes, Parametric Polymorphism, Ad-Hoc PolymorphismIV
19God Interface -> Capability TraitsInterface Segregation, Type Classes, Dependency InversionVIII
20Monolithic Startup -> Plugin ArchitectureOpen/Closed Principle, Feature-Gated ModulesVII
21OS Process Forking -> Actor ModelActor Model, CSP (message-passing), Isolated Mutable StateVII
22Leader Bottleneck -> Version VectorsOptimistic Concurrency, STM (contrast), Immutable VersioningVII
23Shared Code Bloat -> Feature-Gated ModulesConditional Compilation, Module System BoundariesVII
24Unbounded Push -> Bounded ChannelsCSP Channels, Reactive Streams, BackpressureVII
25Polling -> Lazy Pull StreamsLazy Evaluation, Corecursion, Demand-Driven StreamsVII
26Deep Inheritance -> Trait CompositionComposition over Inheritance, Type Classes, Flat DispatchVIII
27Unbounded Recursion -> Iterative FoldTrampolining, Tail Recursion, Accumulator-Passing StyleVIII
28Ad-Hoc Recursion -> CatamorphismRecursion Schemes (Catamorphism), Structural RecursionVIII
29Hardcoded Parsers -> Parser CombinatorsParser Combinators, Applicative Functor, MonadVIII
30Stringly-Typed Access -> Typed LensesLenses / Optics, Profunctors, Focused Immutable UpdateVIII
31Implicit Mutation -> Reducer PatternFold, State Monad, Single-Writer SemanticsVIII
32Monkey-Patching -> Extension via TraitsType Classes, Retroactive Extension, CoherenceVIII
33Implicit Ordering -> Typestate LifecycleLinear / Affine Types, Typestate, Ownership as ProtocolVIII
34Mutable Window -> Comonad-StyleComonad (extract / extend), Context-Aware ComputationVIII
35Round-Robin Workers -> Work-StealingParallel Collections, Work-Stealing, parMapVIII

April 28, 2026

Building Mini OpenClaw: Secure AI Agents with Actors, WASM, and Supervision

Filed under: Agentic AI,Computing — admin @ 7:17 pm

Introduction

Most agent frameworks start simple: one process, one conversation loop, one tool registry, one memory store, and one pile of credentials. That simplicity is useful for demos, but dangerous for enterprise systems. If a prompt injection reaches a tool with broad permissions, the whole runtime becomes part of the blast radius (see https://arxiv.org/abs/2403.02691). If one tool call hangs or crashes, it can stall the agent loop. If memory and sessions are shared by convention instead of isolated by construction, tenant boundaries depend on every developer remembering every guardrail every time. Enterprise teams need a different foundation. They need agents that isolate state, limit blast radius, enforce tenant boundaries, and recover from failures without operator intervention. They need the same properties that telecom systems have delivered for four decades: per-process isolation, supervision trees, guardian processes, and location-transparent messaging.

This post shows how I built Mini OpenClaw as a proof of concept implementation that runs entirely on PlexSpaces, an actor-based distributed runtime inspired by Erlang/OTP. OpenClaw-style systems are useful because they give developers a programmable agent runtime: tools, memory, planning, execution, and orchestration. MiniClaw keeps that spirit, but changes the failure and security model. Instead of one runtime owning everything, each responsibility becomes an actor with its own state, permissions, lifecycle, and supervision boundary. MiniClaw deploys ten actors inside a WebAssembly + Firecracker sandbox to deliver a secure, fault-tolerant agent system. Every actor owns its state exclusively. Every message travels through explicit channels and every failure triggers a supervised restart instead of full-system crash.

OpenClaw’s 2026.4.29 release triggered plugin dependency repair loops at startup and cold paths due to monolithic core owns too many responsibilities. MiniClaw starts from the opposite position: every responsibility is an actor from the beginning, with its own state, and its own explicit message contract.


Part 1: Agents and Actors Isomorphism

1.1 The Same Computational Model

An LLM agent has four things: state (conversation history, tool results), a processing loop (receive message, reason, act), communication (call tools, delegate to other agents), and failure modes (timeouts, hallucinations, rate limits). An actor has exactly the same structure. This is not a coincidence. Both actors and agents derive from the same computational model, isolated units of stateful computation that communicate by passing messages.

# From examples/python/apps/miniclaw/agent.py
# An agent IS an actor same structure, same guarantees
# For readability, this POC keeps message history directly on the `AgentActor`. 
# In a production deployment, I would usually run one actor instance per session or 
# store history by `session_id` to avoid cross-session context mixing.
@actor
class AgentActor:
    """Core agent: receive user message, call LLM, execute tools, loop until end_turn."""

    system_prompt: str = state(default="You are a helpful AI assistant with access to tools.")
    messages: list  = state(default_factory=list)   # Conversation state
    max_history: int = state(default=50)            # Context window bound
    total_chats: int = state(default=0)             # Usage counter
    agent_name: str  = state(default="general-assistant")

    @init_handler
    def on_init(self, config: dict) -> None:
        args = config.get("args", {})
        self.agent_name = args.get("agent_name", self.agent_name)
        self.system_prompt = args.get("system_prompt", self.system_prompt)
        host.process_groups.join("svc:agent")        # Announces itself for discovery
        write_actor_info(self.actor_id, self.agent_name,
                         "Core agent loop with tool calling and session memory",
                         ["chat", "tool_use", "memory"])

    @handler("chat")
    def chat(self, message: str = "", session_id: str = "") -> dict:
        # Agent processing loop: receive message -> reason -> act
        ...

The mapping is direct. Every agent concept has an actor primitive:

Agent ConceptActor PrimitiveMiniClaw Implementation
Conversation historyActor-private statemessages: list (serialized, isolated)
Tool callingInter-actor messagingask(tool_reg_id, "execute_tool", ...)
Agent delegationLocation-transparent Askask(agent_id, "chat", ...) via process groups
Crash recoverySupervisor restart + durability facetState checkpointed to SQLite, restored on restart
Rate limitingPer-actor circuit breaker statecircuit_open, consecutive_failures in actor state
MemoryScoped KV + TupleSpaceGlobal/agent/session scopes via MemoryActor
Audit trailFire-and-forget GenEventhost.send(audit_id, "log_event", ...) — non-blocking

1.2 Four Behaviors Map to Four Agent Archetypes

PlexSpaces provides four actor behaviors. Each maps to a distinct agent archetype:

BehaviorAgent ArchetypeMiniClaw ActorDecorator
GenServerTool executor, stateful helperAgentActor, LLMRouterActor, ToolRegistryActor, MemoryActor, SessionManagerActor, TaskQueueActor, HealthMonitorActor@actor
GenEventAudit logger, event publisherAuditEventActor@event_actor
GenStateMachineState-machine agent, quality gateAgentStateFSM@fsm_actor(states=[...], initial="idle")
WorkflowOrchestrator, pipeline coordinatorOrchestratorActor@workflow_actor

Part 2: PlexSpaces Primitives

Before walking through each actor, it helps to see the five low-level primitives that every actor uses. These are the only operations available inside the WASM sandbox without filesystem or global state.

2.1 Process Groups and Object Registry for Location-Transparent Discovery

Every actor is registered in an actor-registry and can optionally join a named process group on @init_handler. Callers look up the first member with pg_first(), a one-liner that hides whether the target is local or on a remote node:

# From examples/python/apps/miniclaw/helpers.py
def pg_first(group: str) -> Tuple[Optional[str], Optional[str]]:
    """Return (actor_id, None) for the first member of a process group, or (None, error)."""
    try:
        members = host.process_groups.members(group)
        if members:
            return members[0], None
        return None, f"no members in {group}"
    except Exception as e:
        return None, str(e)

Every actor announces itself on startup:

@init_handler
def on_init(self, config: dict) -> None:
    host.process_groups.join("svc:agent")
    write_actor_info(self.actor_id, self.agent_name,
                     "Core agent loop with tool calling and session memory",
                     self.capabilities)

The orchestrator discovers agents via pg_first("svc:agent"), it does not know the agent’s address, node, or port. The framework routes the message transparently.

2.2 Fire-and-Forget Audit with host.send, Never host.ask

The audit trail uses host.send() (fire-and-forget) rather than host.ask() (request-reply). This is a deliberate design choice: audit events must never add latency to the agent’s critical path.

# From examples/python/apps/miniclaw/helpers.py
def fire_audit(event_type: str, detail: str) -> None:
    """Fire-and-forget audit event. Failures are logged, never raised."""
    audit_id, err = pg_first("svc:audit")
    if err or not audit_id:
        host.debug(f"fire_audit: {err}")
        return
    try:
        host.send(audit_id, "log_event", {
            "op": "log_event",
            "event_type": event_type,
            "detail": detail,
            "timestamp": host.now_ms(),
        })
    except Exception as e:
        host.warn(f"fire_audit: send failed: {e}")

Every actor calls fire_audit() after each meaningful operation. The audit actor receives the event asynchronously. If the audit actor is slow or temporarily down, callers are unaffected, they never wait for a response.

2.3 TupleSpace: Queryable Shared Coordination State

TupleSpace (host.ts) is the coordination layer. Unlike KV (point lookup by key), TupleSpace supports pattern queries like read all tuples matching a template with None wildcards:

# Write a memory tuple
host.ts.write(["memory", "global", "user_name", "Alice"])

# Read all global memories — None matches any value in that position
tuples = host.ts.read_all(["memory", "global", None, None])

# Read all audit events of a specific type
events = host.ts.read_all(["audit", "tool_executed", None, None])

# Orchestrator checkpoints sub-task results for crash recovery
host.ts.write(["orch_result", task_id, i, str(result)])

The write_actor_info helper uses TupleSpace to publish actor capabilities for external discovery without blocking callers:

# From examples/python/apps/miniclaw/helpers.py
def write_actor_info(actor_id: str, name: str, description: str, capabilities: list) -> None:
    """Write actor capability tuples to TupleSpace for discovery."""
    try:
        host.ts.write(["agent_card", actor_id, name, description])
        for cap in capabilities:
            host.ts.write(["agent_cap", cap, actor_id])
    except Exception as e:
        host.warn(f"write_actor_info: {e}")

2.4 send_after for Scheduling Timers

The health monitor uses host.send_after() to schedule a self-message after every poll interval. No cron job, no external scheduler, the actor manages its own polling timeline:

@init_handler
def on_init(self, config: dict) -> None:
    # Schedule first poll; each tick reschedules the next
    host.send_after(self.poll_interval_ms, "poll_tick", {"op": "poll_tick"})

@handler("poll_tick", "cast")
def poll_tick(self) -> None:
    # ... do poll work ...
    # Re-arm: each tick schedules the next — no external scheduler needed
    host.send_after(self.poll_interval_ms, "poll_tick", {"op": "poll_tick"})

2.5 host.channel for Channel-Backed Durable Queues

The Channel primitive provides at-least-once message delivery with explicit ack/nack:

# Producer: send to channel
msg_id = host.channel.send("", _TASK_CHANNEL, task_type, task)

# Consumer: receive, process, then ack or nack
msg, ok, _ = host.channel.receive("", _TASK_CHANNEL, timeout_ms)
if ok:
    host.channel.ack("", _TASK_CHANNEL, msg["msg_id"])   # commit
    # OR
    host.channel.nack("", _TASK_CHANNEL, msg["msg_id"], True)  # requeue

2.6 The Let-It-Crash Philosophy

Monolithic agent frameworks force developers to write defensive error handling around every tool call, every LLM request, and every memory access. MiniClaw takes the Erlang philosophy: let actors crash, and let guardians restart them in a clean state. A guardian supervisor watches its children. When one crashes, it applies a restart strategy. The other children continue running, unaffected without cascading failures and global error handlers.

# From examples/python/apps/miniclaw/app-config.toml
[supervisor]
strategy = "one_for_one"          # Restart ONLY the crashed actor
max_restarts = 10                 # Allow up to 10 restarts
max_restart_window_seconds = 60   # Within a 60-second window
# If 10 crashes in 60s -> escalate to parent supervisor

PlexSpaces provides three restart strategies, each suited to different failure patterns:

StrategyBehaviorAgent Use Case
one_for_oneRestart only the crashed actorIndependent tools: calculator crash does not affect weather
rest_for_oneRestart crashed actor + all actors started after itPipeline stages: if retriever crashes, restart generator and validator too
one_for_allRestart all children when any crashesTightly coupled team: research + analysis + writing agents share context

2.7 Monitors and Links

PlexSpaces provides two mechanisms for actors to watch each other (similar to Erlang):

  • Monitors (host.monitor()) provide one-way observation. The monitoring actor receives a __DOWN__ message when the monitored actor stops.
  • Links (host.link()) provide bidirectional fate-sharing. If either linked actor crashes abnormally, the other receives an __EXIT__ message.
# Monitor: one-way watch. ValidatorAgent watches workers.
monitor_ref = host.monitor(worker_id)

@handler("__DOWN__", "cast")
def on_down(self, monitor_ref: str = "", down_from: str = "", down_reason: str = "") -> None:
    """Monitored worker stopped. ValidatorAgent stays alive and compensates."""
    self.failed_workers.append(down_from)
    # Spawn replacement, redistribute work, alert operator

# Link: bidirectional fate-sharing. Coordinating agents share fate.
host.link(peer_id)

@handler("__EXIT__", "cast")
def on_exit(self, exit_from: str = "", exit_reason: str = "") -> None:
    """Linked peer died abnormally. Clean up shared resources."""
    self.linked_peers.remove(exit_from)

In MiniClaw, the guardian supervisor monitors all ten actors. If the LLMRouterActor crashes, the supervisor restarts it with a clean state. The AgentActor‘s in-flight request receives a timeout error while the MemoryActor, the AuditEventActor, and every other actor continues running without interruption.

The supervisor IS the guardian pattern from Erlang. Every MiniClaw actor runs under guardian supervision for crash recovery.


Part 3: WASM + Firecracker Sandbox

3.1 Defense in Depth

MiniClaw actors run inside three concentric isolation layers:

  1. Actor isolation: Each actor owns its state exclusively. No shared memory, no global variables, no cross-actor data access. Communication happens only through host.ask() and host.send().
  2. WASM + Firecracker sandbox: Each actor compiles to a WebAssembly module that runs inside a hardware-enforced memory sandbox. The WASM linear memory is isolated per actor instance. In production deployments, each WASM runtime itself runs inside a Firecracker microVM, a lightweight KVM-based hypervisor that boots in ~125ms and provides hardware-level memory and I/O isolation between tenants.
  3. Tenant isolation: Every PlexSpaces operation requires a RequestContext with explicit tenant and namespace identifiers via JWT authentication. The framework rejects cross-tenant access before the request reaches the actor.

3.2 What the Two-Layer Sandbox Prevents

Attack VectorMonolithic FrameworkWASM SandboxWASM + Firecracker
open("/etc/passwd")Succeeds with full FS accessBlocked with no FS import in WITBlocked with separate VM filesystem
os.environ["API_KEY"]Succeeds with env vars sharedBlocked with no env access in WASMBlocked with separate VM env
Read another actor’s memorySucceeds with shared processBlocked with WASM linear memory is per-instanceSeparate VM address space
Escape WASM sandbox via JIT bugPossible in theoryPartially mitigatedBlocked with hypervisor hardware boundary
Cross-tenant KV accessPossible if scoping misconfiguredBlocked with RequestContext enforcedBlocked with separate VM tenant

The WIT (WebAssembly Interface Types) definition explicitly declares what the actor can access:

// From wit/plexspaces-actor/host.wit
// The actor can ONLY call these imports — nothing else
interface host {
    send: func(to: string, msg-type: string, payload: payload) -> result<_, actor-error>;
    ask: func(to: string, msg-type: string, payload: payload, timeout-ms: u64) -> result<payload, actor-error>;
    kv-get: func(key: string) -> result<payload, actor-error>;
    kv-put: func(key: string, value: payload) -> result<_, actor-error>;
    http-fetch: func(link-name: string, method: string, path: string, request: payload) -> result<payload, actor-error>;
    // No filesystem. No env vars. No raw network. No process exec.
}

3.3 Tenant Isolation by Construction

Every PlexSpaces operation propagates tenant context through the call chain. KV keys, TupleSpace tuples, object-registry and process groups are all scoped by tenant and namespace. A session created by tenant acme cannot be retrieved by tenant globex and the framework rejects the request before it reaches the actor.

# Every API request carries tenant context — enforced at framework level
# KV keys scoped:     tenant-acme:prod:session:sess-001
# TupleSpace scoped:  tenant-acme:prod:["memory", "global", "user_name", "Alice"]
# Process groups:     tenant-acme:prod:svc:llm_router

There is no internal() bypass for application code. Tenant boundaries are enforced by construction, not by convention.


Part 4: MiniClaw Architecture

MiniClaw decomposes the agent framework into ten actors. Every actor runs as a WebAssembly module inside the PlexSpaces runtime, discovers collaborators through object-registry or process groups, and persists state through the durability facet.

ActorBehaviorResponsibilitySecurity Property
LLMRouterActorGenServerRoute LLM calls, circuit-break on failureReal API keys never leave the actor (phantom token proxy)
ToolRegistryActorGenServerRegister tools with schemas, execute in isolationSchema validation prevents malformed tool inputs
AgentActorGenServerCore agent loop: message -> LLM -> tool -> repeatBounded iteration (max 5) prevents infinite loops
SessionManagerActorGenServerMap users to sessions, enforce tenant scopeTenant-scoped KV keys prevent cross-tenant access
OrchestratorActorWorkflowDecompose tasks, delegate, checkpoint progressDurable checkpoints survive crashes
MemoryActorGenServerScoped memory (global/agent/session)KV + TupleSpace dual-write with tenant scoping
AuditEventActorGenEventImmutable log of every actor operationFire-and-forget; senders never block on audit
AgentStateFSMGenStateMachineLifecycle guard: idle -> processing -> tool_executing -> respondingValidates transitions; rejects illegal states
TaskQueueActorGenServerDurable task queue backed by Channel; enqueue/dequeue/ack/nackAt-least-once delivery; no external broker
HealthMonitorActorGenServerPeriodic PG membership polling via send_after; writes health snapshotsSimple polling eliminates subscription races

Part 5: Design Patterns Used in MiniClaw

The NanoClaw project introduced an important design philosophy: instead of reaching for external infrastructure when you hit a constraint, first ask whether the primitives you already have can solve the problem.

Pattern 1: Phantom Token / Credential Proxy

The constraint: Agents need to call an LLM provider, but callers should never see real API keys. Storing keys in the agent payload means any log line or bug report leaks credentials.

The actor solution: LLMRouterActor owns the credential store. It exposes a register_credential op that stores phantom_token -> real_api_key in its private KV namespace. Callers pass only the opaque token; the actor resolves the real key internally and discards it before building any response.

# Phantom token: real key stored in actor-private KV — never echoed to callers
@handler("register_credential")
def register_credential(self, phantom_token: str = "", api_key: str = "") -> dict:
    if not phantom_token or not api_key:
        return {"error": "phantom_token and api_key required"}
    host.kv_put(f"cred:{phantom_token}", api_key)  # Only this actor reads it
    return {"status": "ok", "phantom_token": phantom_token}  # api_key never returned

@handler("chat_completion")
def chat_completion(self, messages: list = None, tools: list = None,
                    phantom_token: str = "") -> dict:
    resolved_key = host.kv_get(f"cred:{phantom_token}") if phantom_token else ""
    # resolved_key used by real HTTP client; discarded here
    # ... call LLM, build response ...
    return {"status": "ok", "response": response}  # resolved_key never in response

Actor-private state means the real key is inaccessible from any other actor, any other tenant, and any logged payload. Even if a prompt injection tricks the agent into returning its full state, the real credential is not in the agent, it is in the router actor, which never echoes it back.

Pattern 2: Task Queue (TaskQueueActor)

The constraint: The orchestrator needs to enqueue work items for agents to process asynchronously but the environment already has the Channel primitive and no external message broker.

The actor solution: TaskQueueActor is a thin wrapper around host.channel. The Channel handles durability, at-least-once delivery, and redelivery on nack transparently:

# From examples/python/apps/miniclaw/infra.py
_TASK_CHANNEL = "tasks:pending"

@actor
class TaskQueueActor:
    """Thin actor wrapper around the host Channel primitive."""

    enqueued: int = state(default=0)
    completed: int = state(default=0)
    failed: int = state(default=0)

    @handler("enqueue")
    def enqueue(self, task_type: str = "generic", payload: dict = None) -> dict:
        task = {"task_type": task_type, "payload": payload or {}, "enqueued_at": host.now_ms()}
        msg_id = host.channel.send("", _TASK_CHANNEL, task_type, task)
        self.enqueued += 1
        fire_audit("task_enqueued", f"msg_id={msg_id} type={task_type}")
        return {"status": "ok", "msg_id": msg_id}

    @handler("dequeue")
    def dequeue(self, limit: int = 1, timeout_ms: int = 0) -> dict:
        tasks = []
        for _ in range(int(limit)):
            msg, ok, _ = host.channel.receive("", _TASK_CHANNEL, int(timeout_ms))
            if not ok:
                break
            tasks.append(msg)
        return {"status": "ok", "tasks": tasks, "count": len(tasks)}

    @handler("ack")
    def ack(self, msg_id: str = "") -> dict:
        host.channel.ack("", _TASK_CHANNEL, msg_id)   # commits the delivery
        self.completed += 1
        return {"status": "ok", "msg_id": msg_id}

    @handler("nack")
    def nack(self, msg_id: str = "", requeue: bool = True) -> dict:
        host.channel.nack("", _TASK_CHANNEL, msg_id, requeue)  # requeue for redelivery
        self.failed += 1
        return {"status": "ok", "msg_id": msg_id, "requeue": requeue}

PlexSpaces supports multiple providers for queues/channels such as Kafka, SQS, redis or backed by process-groups communication. The Channel primitive is built into the PlexSpaces host, durable, ordered, with explicit ack/nack semantics. If the consumer crashes mid-processing, the unacked message is redelivered on the next dequeue.

Pattern 3: Polling Over Events (HealthMonitorActor)

The constraint: We want to know the health of all service actors, but subscribing to process group membership change events introduces races: a join and a crash can arrive out of order, leaving stale membership in the subscriber’s view.

The actor solution: HealthMonitorActor never subscribes to anything. It polls every service group on a configurable interval using send_after to schedule its own next tick:

# From examples/python/apps/miniclaw/infra.py
_SERVICE_GROUPS = [
    "svc:llm_router", "svc:tool_registry", "svc:agent",
    "svc:session_manager", "svc:memory", "svc:audit",
    "svc:agent_fsm", "svc:task_queue",
]

@actor
class HealthMonitorActor:
    """Polls process group membership on a fixed interval using send_after."""

    poll_count: int = state(default=0)
    last_poll_ms: int = state(default=0)
    group_health: dict = state(default_factory=dict)
    poll_interval_ms: int = state(default=5000)

    @init_handler
    def on_init(self, config: dict) -> None:
        args = config.get("args", {})
        if args.get("poll_interval_ms"):
            iv = int(args["poll_interval_ms"])
            self.poll_interval_ms = min(max(iv, 1000), 300_000)
        host.process_groups.join("svc:health_monitor")
        host.send_after(self.poll_interval_ms, "poll_tick", {"op": "poll_tick"})

    @handler("poll_tick", "cast")
    def poll_tick(self) -> None:
        health = {}
        for grp in _SERVICE_GROUPS:
            try:
                members = host.process_groups.members(grp)
                health[grp] = len(members)
            except Exception:
                health[grp] = 0
        self.group_health = health
        self.poll_count += 1
        self.last_poll_ms = host.now_ms()

        import json
        host.ts.write(["health_snapshot", self.last_poll_ms, json.dumps(health)])
        # Re-arm: each tick schedules the next — no external scheduler needed
        host.send_after(self.poll_interval_ms, "poll_tick", {"op": "poll_tick"})

    @handler("get_health")
    def get_health(self) -> dict:
        degraded = [g for g, c in self.group_health.items() if c == 0]
        return {
            "status": "ok",
            "group_health": self.group_health,
            "healthy": len(self.group_health) - len(degraded),
            "degraded": degraded,
        }

Polling is always correct as it converges to the true membership on every tick regardless of event order. get_health returns not just a count but a list of degraded groups, making it immediately actionable.

The Constraint-Aware Philosophy

These four patterns share a common thread: each one reaches for the primitives already available in the PlexSpaces sandbox before introducing external dependencies.

NeedNaive SolutionNanoClaw SolutionPrimitive Used
Protect API keysEnvironment variables or secrets managerPhantom token stored in actor-private KVhost.kv_put/kv_get
Async task queueRabbitMQ / SQSChannel-backed queue with ack/nackhost.channel.send/receive/ack/nack
Service health monitoringEvent subscription + fan-outPeriodic send_after poll + TupleSpace snapshothost.send_after + host.process_groups.members()
Capability discoveryService registry with TTLProcess groups + TupleSpace agent cardshost.process_groups.join/members() + host.ts.write/read_all

The WASM sandbox is not a limitation to work around instead it is the guide for designing simpler, more auditable systems.


Part 6: The Agent Loop

6.1 The Loop in Code

The AgentActor drives the core agent loop. It receives a user message, calls the LLM, checks for tool requests, executes tools, feeds results back, and repeats with a hard cap of five iterations to prevent runaway loops.

# From examples/python/apps/miniclaw/agent.py
_MAX_ITER = 5
...
    @handler("chat")
    def chat(self, message: str = "", session_id: str = "") -> dict:
        if not message:
            return {"error": "message is required"}

        self.messages.append({"role": "user", "content": message})

        # Discover tools
        tool_reg_id, _ = pg_first("svc:tool_registry")
        tools = []
        if tool_reg_id:
            resp = ask(tool_reg_id, "list_tools", {})
            if resp:
                tools = resp.get("tools", [])

        # Signal FSM: processing
        fsm_id, _ = pg_first("svc:agent_fsm")
        if fsm_id:
            host.send(fsm_id, "transition", {"op": "transition", "to": "processing"})

        final_response = ""
        for i in range(_MAX_ITER):
            llm_id, err = pg_first("svc:llm_router")
            if err or not llm_id:
                final_response = f"[no LLM] Processed: {message}"
                break

            llm_resp = ask(llm_id, "chat_completion", {"messages": [{"role": "system", "content": self.system_prompt}] + self.messages, "tools": tools}, 10000)
            if not llm_resp or "error" in llm_resp:
                final_response = f"LLM unavailable: {llm_resp}"
                break

            response = llm_resp.get("response", {})
            stop_reason = response.get("stop_reason", "end_turn")
            content = response.get("content", "")

            assistant_msg = {"role": "assistant", "content": content, "stop_reason": stop_reason}
            if response.get("tool_calls"):
                assistant_msg["tool_calls"] = response["tool_calls"]
            self.messages.append(assistant_msg)

            if stop_reason == "end_turn":
                final_response = content
                break

            if stop_reason == "tool_use":
                if fsm_id:
                    host.send(fsm_id, "transition", {"op": "transition", "to": "tool_executing"})

                for tc in response.get("tool_calls", []):
                    tc_name = tc.get("name", "")
                    tc_input = tc.get("input", {})
                    tool_output = {}
                    if tool_reg_id:
                        tool_output = ask(tool_reg_id, "execute_tool", {"name": tc_name, "input": tc_input}) or {}

                    self.messages.append({
                        "role": "tool",
                        "tool_call_id": tc.get("id", ""),
                        "content": str(tool_output),
                    })
                    fire_audit("tool_called", f"tool={tc_name} session={session_id}")

                if fsm_id:
                    host.send(fsm_id, "transition", {"op": "transition", "to": "processing"})
                final_response = f"Tool results applied (iteration {i + 1})"
            else:
                final_response = content
                break

        # FSM: responding ? idle
        if fsm_id:
            host.send(fsm_id, "transition", {"op": "transition", "to": "responding"})
            host.send(fsm_id, "transition", {"op": "transition", "to": "idle"})

        # Compact history if needed
        if len(self.messages) > self.max_history:
            keep = self.max_history // 2
            self.messages = self.messages[:1] + self.messages[-keep:]

        # Persist history in KV if session provided
        if session_id:
            import json
            host.kv_put(f"session_history:{session_id}", json.dumps(self.messages))

        self.total_chats += 1
        fire_audit("agent_chat", f"session={session_id}")
        return {
            "status": "ok",
            "response": final_response,
            "session_id": session_id,
            "messages_count": len(self.messages),
        }

The _MAX_ITER = 5 cap prevents runaway loops. In a monolithic framework, this cap requires global state or thread-local storage.


Part 7: Circuit Breakers and Immutable Audit Trails

7.1 LLM Router

The LLMRouterActor simulates an LLM with tool-call routing. In production, replace the simulation with a real API call via host.http_fetch() over a named service link:

# From examples/python/apps/miniclaw/llm_router.py
TOOL_CALL_TRIGGERS = ("weather", "search", "calculate", "lookup", "find")

# `LLMRouterActor` is a simulator in this POC. It demonstrates the routing 
# boundary where production code would call OpenAI, Anthropic, Bedrock, Gemini, or 
# an internal model endpoint through a named service link.
@actor
class LLMRouterActor:
    """Simulated LLM router with tool-calling capability."""

    model: str = state(default="miniclaw-simulated-v1")
    request_count: int = state(default=0)

    @init_handler
    def on_init(self, config: dict) -> None:
        self.model = config.get("args", {}).get("model", self.model)
        host.process_groups.join("svc:llm_router")

    @handler("chat_completion")
    def chat_completion(self, messages: list = None, tools: list = None) -> dict:
        messages = messages or []
        tools = tools or []
        self.request_count += 1

        user_msg = ""
        for m in reversed(messages):
            if m.get("role") == "user":
                user_msg = str(m.get("content", "")).lower()
                break

        should_use_tool = tools and any(kw in user_msg for kw in TOOL_CALL_TRIGGERS)

        if should_use_tool:
            tool = tools[0] if tools else {}
            tool_name = tool.get("name", "search") if isinstance(tool, dict) else "search"
            response = {
                "stop_reason": "tool_use",
                "content": "",
                "tool_calls": [{"id": f"tc_{self.request_count}", "name": tool_name,
                                 "input": {"query": user_msg}}],
            }
        else:
            response = {
                "stop_reason": "end_turn",
                "content": f"[{self.model}] Processed: {user_msg}",
                "tool_calls": [],
            }
        return {"status": "ok", "response": response, "model": self.model}

To add a circuit breaker for production LLM rate limits, extend the actor state with circuit_open and consecutive_failures. The actor IS the circuit breaker, and the durability facet ensures the circuit state survives restarts:

@actor
class LLMRouterActor:
    model: str = state(default="gpt-4o")
    circuit_open: bool = state(default=False)
    consecutive_failures: int = state(default=0)
    request_count: int = state(default=0)

    @init_handler
    def on_init(self, config: dict) -> None:
        host.process_groups.join("svc:llm_router")
        # Schedule circuit recovery timer
        host.send_after(30_000, "timer_tick", {"op": "timer_tick"})

    @handler("chat_completion")
    def chat_completion(self, messages: list = None, tools: list = None) -> dict:
        if self.circuit_open:
            return {"error": "circuit_open", "circuit_open": True}

        try:
            # Production: real API call via host.http_fetch("llm-api", ...)
            result = self._call_llm(messages, tools)
            self.consecutive_failures = 0
            self.request_count += 1
            return result
        except Exception as e:
            self.consecutive_failures += 1
            if self.consecutive_failures >= 3:
                self.circuit_open = True
            return {"error": str(e), "circuit_open": self.circuit_open}

    @handler("timer_tick", "cast")
    def timer_tick(self) -> None:
        # Gradual recovery: decrement failure count by 1 each tick (30s).
        # 3 failures -> 90s before circuit closes again. Prevents premature re-open.      
        if self.circuit_open and self.consecutive_failures > 0:
            self.consecutive_failures -= 1
            if self.consecutive_failures == 0:
                self.circuit_open = False
        host.send_after(30_000, "timer_tick", {"op": "timer_tick"})

7.2 Immutable Audit Trail

The AuditEventActor captures every agent action as a fire-and-forget event. Senders never block. Events flow into TupleSpace for append-only, queryable storage:

# From examples/python/apps/miniclaw/memory.py

@event_actor
class AuditEventActor:
    """GenEvent actor: fire-and-forget audit events stored in TupleSpace."""

    event_count: int = state(default=0)

    @init_handler
    def on_init(self, config: dict) -> None:
        host.process_groups.join("svc:audit")

    @handler("log_event", "cast")
    def log_event(self, event_type: str = "", detail: str = "", timestamp: int = 0) -> None:
        ts = timestamp or host.now_ms()
        try:
            host.ts.write(["audit", event_type, ts, detail])
        except Exception as e:
            host.warn(f"AuditEvent: ts.write failed: {e}")
        self.event_count += 1

    @handler("get_stats")
    def get_stats(self) -> dict:
        return {"status": "ok", "event_count": self.event_count}

Notice the "cast" annotation on log_event, this marks the handler as fire-and-forget. The sender (fire_audit() in helpers.py) calls host.send(), not host.ask() without blocking.


Part 8: Tools as Actors with MCP-Style Isolation

8.1 Each Tool Gets Supervision, Metrics, and Fault Recovery

In MiniClaw, the ToolRegistryActor manages tool definitions and dispatches execution. Each tool handler runs within the actor’s sandboxed environment:

# From examples/python/apps/miniclaw/tool_registry.py

@actor
class ToolRegistryActor:
    """Registry of callable tools with simulated execution."""

    tools: dict = state(default_factory=dict)   # name -> tool spec
    exec_count: int = state(default=0)
    actor_id: str = state(default="")

    @init_handler
    def on_init(self, config: dict) -> None:
        self.actor_id = config.get("actor_id", "")
        self.tools = {t["name"]: t for t in _BUILTIN_TOOLS}
        host.process_groups.join("svc:tool_registry")
        host.info(f"ToolRegistryActor init actor_id={self.actor_id} tools={list(self.tools)}")

    @handler("list_tools")
    def list_tools(self) -> dict:
        return {"status": "ok", "tools": list(self.tools.values()), "count": len(self.tools)}

    @handler("register_tool")
    def register_tool(self, name: str = "", description: str = "", input_schema: dict = None) -> dict:
        if not name:
            return {"error": "name is required"}
        self.tools[name] = {"name": name, "description": description, "input_schema": input_schema or {}}
        host.info(f"ToolRegistry: registered tool={name}")
        return {"status": "ok", "name": name}

    @handler("execute_tool")
    def execute_tool(self, name: str = "", input: dict = None) -> dict:
        input = input or {}
        if name not in self.tools:
            return {"error": f"unknown tool: {name}"}

        self.exec_count += 1
        host.info(f"ToolRegistry: executing tool={name} exec={self.exec_count}")

        # Simulated responses per tool type
        if name == "web_search":
            return {"result": f"Search results for: {input.get('query', '')}"}
        if name == "calculator":
            expr = input.get("expression", "0")
            try:
				# Demo-only restricted evaluation.
				# Production code should replace this with an AST-based evaluator or a sandboxed tool actor.                    
                result = eval(expr, {"__builtins__": {}})  # noqa: S307
                return {"result": str(result)}
            except Exception:
                return {"result": f"Could not evaluate: {expr}"}
        if name == "weather":
            location = input.get("location", "unknown")
            return {"result": f"Weather in {location}: 22°C, partly cloudy"}

        return {"result": f"[simulated] {name} output for input {input}"}

    @handler("get_stats")
    def get_stats(self) -> dict:
        return {"status": "ok", "tool_count": len(self.tools), "exec_count": self.exec_count}

8.2 What Standalone MCP Servers Lack

CapabilityStandalone MCPTool-as-Actor (MiniClaw)
State persistenceIn-memory only; lost on restartDurability facet checkpoints to SQLite
Multi-tenant accessNo built-in tenant scopingRequestContext enforces tenant isolation
MetricsMust add manually per toolPer-actor invocation counts automatic
Fault toleranceProcess crash loses all stateSupervisor restarts; state restored from checkpoint
SandboxProcess boundary onlyWASM linear memory + optional Firecracker VM

Part 9: Agent Lifecycle State Machine

9.1 Scoped Memory with KV + TupleSpace Dual-Write

MemoryActor writes every memory entry to both KV (for durable point-lookup) and TupleSpace (for queryable pattern-scan across a scope):

# From examples/python/apps/miniclaw/memory.py

@actor
class MemoryActor:
    """Scoped memory backed by KV (persistent) and TupleSpace (queryable)."""

    memory_count: int = state(default=0)

    @init_handler
    def on_init(self, config: dict) -> None:
        host.process_groups.join("svc:memory")

    @handler("store_memory")
    def store_memory(self, key: str = "", value: str = "",
                     scope: str = "global", agent_id: str = "", session_id: str = "") -> dict:
        if not key:
            return {"error": "key is required"}
        scoped_key = _scoped_key(scope, agent_id, session_id, key)
        host.kv_put(scoped_key, str(value))                     # KV: durable point-lookup
        host.ts.write(["memory", scope, key, str(value)])       # TupleSpace: queryable scan
        self.memory_count += 1
        fire_audit("memory_stored", f"scope={scope} key={key}")
        return {"status": "ok", "key": key, "scope": scope}

    @handler("recall_memory")
    def recall_memory(self, key: str = "", scope: str = "global",
                      agent_id: str = "", session_id: str = "") -> dict:
        scoped_key = _scoped_key(scope, agent_id, session_id, key)
        value = host.kv_get(scoped_key)
        return {"status": "ok", "key": key, "value": value, "found": bool(value)}

    @handler("list_memories")
    def list_memories(self, scope: str = "global") -> dict:
        try:
            tuples = host.ts.read_all(["memory", scope, None, None])
            memories = [{"key": t[2], "value": t[3]} for t in tuples if len(t) >= 4]
        except Exception:
            memories = []
        return {"status": "ok", "memories": memories, "scope": scope}


def _scoped_key(scope: str, agent_id: str, session_id: str, key: str) -> str:
    if scope == "agent" and agent_id:
        return f"mem:agent:{agent_id}:{key}"
    if scope == "session" and session_id:
        return f"mem:session:{session_id}:{key}"
    return f"mem:global:{key}"

The three scopes are not just naming conventions — they determine which memories survive across session boundaries:

ScopePersists acrossExample
globalEverything including sessions, agent restartsUser name, user preferences
agentRestarts of this specific agentAgent-specific learned facts
sessionOnly within a single session“We were discussing X” context

9.2 Session Management with KV with a Channel+User Index

SessionManagerActor stores session metadata in KV and maintains a secondary index that maps channel+user_id to session_id:

# From examples/python/apps/miniclaw/agent.py

@actor
class SessionManagerActor:
    """Manages agent session lifecycle backed by KV storage."""

    active_sessions: int = state(default=0)
    total_created: int = state(default=0)
    session_ids: list = state(default_factory=list)

    @handler("create_session")
    def create_session(self, channel: str = "web", user_id: str = "anonymous",
                       agent_id: str = "agent") -> dict:
        import json
        session_id = f"sess-{channel}-{user_id}-{host.now_ms()}"
        meta = {"session_id": session_id, "channel": channel, "user_id": user_id,
                "agent_id": agent_id, "created_at": host.now_ms(), "status": "active"}
        host.kv_put(f"session:{session_id}", json.dumps(meta))
        host.kv_put(f"session_map:{channel}:{user_id}", session_id)  # secondary index
        self.session_ids.append(session_id)
        self.active_sessions += 1
        fire_audit("session_created", f"session_id={session_id} channel={channel} user_id={user_id}")
        return {"status": "ok", "session_id": session_id}

    @handler("get_session")
    def get_session(self, session_id: str = "", channel: str = "", user_id: str = "") -> dict:
        import json
        if not session_id and channel and user_id:
            # Natural key lookup via secondary index
            session_id = host.kv_get(f"session_map:{channel}:{user_id}")
        if not session_id:
            return {"error": "session not found"}
        raw = host.kv_get(f"session:{session_id}")
        if not raw:
            return {"error": "session not found", "session_id": session_id}
        meta = json.loads(raw)
        meta["status"] = "ok"
        return meta

The secondary index means a chatbot can route an incoming webhook (which carries channel and user_id but not a session token) directly to the right session without a scan.

9.3 State Management

The AgentStateFSM tracks execution state through a finite state machine. It validates transitions at runtime and attempting idle -> responding is rejected. This catches bugs in the agent loop before they produce corrupt state.

# From examples/python/apps/miniclaw/memory.py

# Sole authoritative definition of the FSM.
# Adding a new state requires only adding it here.
_VALID_FSM_TRANSITIONS = {
    "idle": {"processing", "tool_executing"},
    "processing": {"tool_executing", "responding", "idle"},
    "tool_executing": {"processing", "idle"},
    "responding": {"idle"},
}


@fsm_actor(states=["idle", "processing", "tool_executing", "responding"], initial="idle")
class AgentStateFSM:
    """Agent lifecycle FSM: idle -> processing -> tool_executing -> responding -> idle."""

    fsm_state: str = state(default="idle")
    transition_count: int = state(default=0)

    @init_handler
    def on_init(self, config: dict) -> None:
        host.process_groups.join("svc:agent_fsm")

    @handler("transition")
    def transition(self, to: str = "") -> dict:
        allowed = _VALID_FSM_TRANSITIONS.get(self.fsm_state, set())
        if to not in allowed:
            host.debug(f"FSM: invalid transition {self.fsm_state} -> {to}")
            return {"status": "ignored", "from": self.fsm_state, "to": to}
        prev = self.fsm_state
        self.fsm_state = to
        self.transition_count += 1
        host.debug(f"FSM: {prev} -> {to}")
        return {"status": "ok", "from": prev, "to": to}

    @handler("get_state")
    def get_state(self) -> dict:
        return {"status": "ok", "state": self.fsm_state, "transitions": self.transition_count}

Operators query the FSM to see what every agent does at any moment with full observability.


Part 10: Multi-Agent Orchestration with Durable Checkpoints

The OrchestratorActor decomposes complex tasks and delegates each sub-task to the AgentActor. It uses the Workflow behavior, which checkpoints progress after each step:

# From examples/python/apps/miniclaw/orchestrator.py

@workflow_actor
class OrchestratorActor:
    """Durable workflow: decompose task -> delegate to agents -> aggregate results."""

    status: str = state(default="idle")
    task_id: str = state(default="")
    progress: int = state(default=0)

    @init_handler
    def on_init(self, config: dict) -> None:
        host.info(f"OrchestratorActor init actor_id={config.get('actor_id', '')}")

    @run_handler
    def run(self, payload: dict = None) -> dict:
        payload = payload or {}
        task = payload.get("task", "explain how agents work")
        task_id = payload.get("task_id", f"orch-{host.now_ms()}")

        self.status = "running"
        self.task_id = task_id
        self.progress = 0

        agent_id, err = pg_first("svc:agent")
        if err or not agent_id:
            self.status = "failed"
            return {"error": "no agents in svc:agent", "task_id": task_id}

        # Decompose: split on " and " for multi-step tasks
        lower = task.lower()
        idx = lower.find(" and ")
        sub_tasks = [task[:idx].strip(), task[idx + 5:].strip()] if idx >= 0 else [task]

        sub_results = []
        for i, sub_task in enumerate(sub_tasks):
            self.progress = (i + 1) * 100 // len(sub_tasks)
            resp = ask(agent_id, "chat",
                       {"message": sub_task, "session_id": f"orch-{task_id}-{i}"}, 15000)
            if not resp:
                self.status = "failed"
                return {"error": "sub-task failed", "task_id": task_id}
            # Checkpoint sub-result to TupleSpace — survives orchestrator crash
            host.ts.write(["orch_result", task_id, i, str(resp.get("response", ""))])
            sub_results.append(resp)

        summaries = [r.get("response", "") for r in sub_results if r.get("response")]
        self.status = "completed"
        self.progress = 100
        fire_audit("orchestrator_completed", f"task_id={task_id} subtasks={len(sub_tasks)}")
        return {
            "status": "ok",
            "task_id": task_id,
            "result": " | ".join(summaries),
            "sub_results": sub_results,
            "sub_tasks": len(sub_tasks),
        }

    @signal_handler("cancel")
    def cancel(self) -> None:
        self.status = "cancelled"
        host.info(f"Orchestrator cancelled task_id={self.task_id}")

    @query_handler("status")
    def query_status(self) -> dict:
        return {"task_id": self.task_id, "status": self.status, "progress": self.progress}

The @run_handler, @signal_handler, and @query_handler decorators map cleanly to the Workflow behavior’s three message types:

  • run: starts the workflow execution
  • signal: sends an out-of-band control message (e.g., cancellation mid-workflow)
  • query: reads durable workflow state without blocking the running workflow

Part 11: Multi-App Deployments

In this example all ten actors share a single WASM binary via ACTOR_REGISTRY:

# From examples/python/apps/miniclaw/miniclaw_actor.py
ACTOR_REGISTRY = {
    "llm_router":      LLMRouterActor,
    "tool_registry":   ToolRegistryActor,
    "agent":           AgentActor,
    "session_manager": SessionManagerActor,
    "orchestrator":    OrchestratorActor,
    "memory":          MemoryActor,
    "audit_event":     AuditEventActor,
    "agent_fsm":       AgentStateFSM,
    "task_queue":      TaskQueueActor,
    "health_monitor":  HealthMonitorActor,
}

This is convenient for development and single-tenant deployments. For enterprise multi-tenant deployments, you can split actors into separate applications to achieve stronger isolation:

  • llm-gateway/ – LLMRouterActor only for credential management isolated
  • agent-app/ – AgentActor + SessionManagerActor one app per tenant team
  • tools-app/ – ToolRegistryActor + MemoryActor hared tool catalog
  • audit-app/ – AuditEventActor compliance isolation
  • infra-app/ – TaskQueueActor + HealthMonitorActor

In the multi-app model, each application gets its own Firecracker microVM in production, providing hardware-level tenant isolation. Actors across applications discover each other via process groups or object registry and the code changes only in app-config.toml, not in the actor implementations.

Plugins as Deployed Apps, Not Bundled Packages

OpenClaw’s post-mortem describes a painful middle state: too much moved toward plugins, while plugins were still bundled, repaired, and dependency-loaded in startup paths. This is the monolith decomposition trap: you split the code but not the process, so startup coupling survives the refactor.

PlexSpaces avoids this by treating plugins as deployed apps, not installed packages. A channel connector, or a third-party memory backend is a separate app that exposes one or more actors. The agent loop discovers them the same way it discovers any actor via pg_first("svc:telegram-connector") or on a remote node. Adding a new integration means deploying a new app, not modifying package.json.

OpenClaw patternPlexSpaces equivalentWhat changes
Bundled channel plugins in coreChannel app deployed separatelyStartup failure in the channel app doesn’t touch the agent loop
Shared node_modules dependency graphEach app is its own WASM binarySupply-chain compromise in one app’s deps can’t reach another app
Plugin repair at startupActor restarts via one_for_one supervisorOnly the failed actor restarts; the rest keep running
Hard to decompose after the factActor boundaries are message contracts from day oneMoving an actor to its own app changes app-config.toml, not the actor code

Part 12: Security Comparison Actor Framework vs. Monolithic

Security PropertyOpenClaw / MonolithicMiniClaw / Actor-Based
State isolationShared memory; one agent reads another’s statePer-actor private state; accessible only through messages
Privilege boundarySingle process; tools share agent’s full permissionsWASM sandbox; actor can only call WIT-declared imports
Sandbox depthOS process boundary onlyWASM linear memory + Firecracker microVM hardware boundary
Tenant separationApplication-level checks; misconfiguration = data leakFramework-enforced RequestContext; no bypass possible
Tool executionIn-process; tool crash = agent crashSeparate actor; tool crash triggers supervised restart
Secret managementos.environ shared across all toolsActor-scoped KV; WASM has no env var access
Audit trailOptional; must add per toolBuilt-in @event_actor; captures all operations by default
Prompt injection blast radiusFull system access: files, network, memoryConfined to single actor’s WIT capabilities
Circuit breakerMust implement per integrationBuilt into LLMRouterActor; state survives restarts
Crash recoveryProcess restart; lose all in-flight stateActor restart; resume from durability checkpoint
Quality validationHope the LLM got it rightReflection loop + three-check guardrails + LLM-as-Judge
Failure detectionUncaught exceptions; manual health checksMonitor/link primitives; __DOWN__/__EXIT__ messages
Multi-tenant scalingShard by process; complex ops burdenCellular architecture; independent failure domains

Part 13: Running the Example

Build and Deploy

cd examples/python/apps/miniclaw
./build.sh                     # componentize-py -> WASM Component Model
./test.sh 8092                 # Deploy to running node and run full test suite

What the Test Script Validates

The test script exercises all ten actors end-to-end:

# Step 3: LLM Router — simulated chat + tool routing
ask "llm_router" '{"op":"chat_completion","messages":[{"role":"user","content":"Hello!"}],"tools":[]}'

# Step 5: Agent chat — full loop including tool use
ask "agent" '{"op":"chat","message":"Search for the weather in Paris","session_id":"test-sess-1"}'

# Step 9: Agent FSM — validate state transitions
ask "agent_fsm" '{"op":"transition","to":"processing"}'
ask "agent_fsm" '{"op":"transition","to":"responding"}'

# Step 10: Orchestrator workflow — durable multi-agent task
ask "orchestrator" '{"op":"workflow_run","task":"explain AI agents","task_id":"test-orch-1"}' 60
ask "orchestrator" '{"op":"workflow_query:status"}'

# Step 8: Task Queue — Channel-backed enqueue/dequeue/ack
ask "task_queue" '{"op":"enqueue","task_type":"send_email","payload":{"to":"bob@example.com"}}'
ask "task_queue" '{"op":"dequeue","limit":1}'
ask "task_queue" '{"op":"ack","msg_id":"..."}'

App Configuration

All ten actors are declared in app-config.toml. Each actor specifies its behavior_kind, role (used to select the right class from ACTOR_REGISTRY), and facets:

[[supervisor.children]]
name = "agent"
actor_type = "miniclaw_wasm"
role = "agent"
behavior_kind = "GenServer"
args = { role = "agent", agent_name = "general-assistant",
         system_prompt = "You are a helpful AI assistant with access to tools." }
facets = [
  { type = "virtual_actor", priority = 100, config = { idle_timeout = "10m", activation_strategy = "eager" } },
  { type = "durability", priority = 90, config = { checkpoint_interval = 3 } }
]

[[supervisor.children]]
name = "orchestrator"
actor_type = "miniclaw_wasm"
role = "orchestrator"
behavior_kind = "Workflow"            # Enables @run_handler, @signal_handler, @query_handler
args = { role = "orchestrator" }
facets = [
  { type = "virtual_actor", priority = 100, config = { idle_timeout = "10m", activation_strategy = "lazy" } },
  { type = "durability", priority = 90, config = { checkpoint_interval = 5 } }
]

[[supervisor.children]]
name = "agent_fsm"
actor_type = "miniclaw_wasm"
role = "agent_fsm"
behavior_kind = "GenFSM"              # Enables @fsm_actor state machine behavior
args = { role = "agent_fsm" }
facets = [
  { type = "virtual_actor", priority = 100, config = { idle_timeout = "30m", activation_strategy = "lazy" } },
  { type = "durability", priority = 90, config = { checkpoint_interval = 1 } }
]

The Isolation Ladder

Not every deployment needs a Firecracker VM, but every production agent system should reason explicitly about which isolation layer each component requires. MiniClaw provides a progression:

LayerMechanismWhat it contains
Message isolationActor private state; all access via host.ask/sendCross-agent state reads; accidental coupling through shared memory
Tenant isolationRequestContext JWT enforced by the frameworkCross-tenant KV, TupleSpace, and process group access
App isolationSeparate deployed apps; independent startup pathsStartup coupling; plugin dependency repair contagion across integrations
WASM isolationWIT import surface; per-actor linear memorySupply-chain attacks; filesystem, env, and exec access
Firecracker/Docker isolationVM boundary per tenantWASM JIT escape; cross-tenant kernel syscall surface

The same actor code runs at every level. The app-config.toml determines which layers are active for a given deployment. Development runs message isolation only. A single-tenant production deployment adds WASM. A multi-tenant enterprise deployment adds Firecracker/Docker.


Conclusion

MiniClaw is not a finished enterprise agent platform. It is a small proof of concept that demonstrates a different foundation for one. The important lesson is not that every agent system needs these exact ten actors. The lesson is that agent runtimes benefit when isolation, supervision, explicit messaging, durable state, scoped memory, audit, and tenant boundaries are part of the architecture from the beginning. A monolithic agent loop is easy to start with, but hard to harden later. MiniClaw takes the opposite path: split the runtime into small actors, give each actor one responsibility, constrain what it can access, supervise it when it fails, and communicate only through explicit messages. Each actor owns one responsibility: routing LLM calls, managing tools, storing session metadata, persisting memory, recording audit events, coordinating workflows, or monitoring health.

MiniClaw is implemented with PlexSpaces that provides runtime primitives such as KV, TupleSpace, Channels, timers, workflows, GenEvent, and GenFSM. It allows better fault tolerance, observability, tenant-isolation, authentication, observability, rate limiting, circuit breaker, backpressure, sandboxed execution via WebAssembly and Firecracker. This POC demonstrates the shape of the solution:

  • AgentActor models the bounded agent loop: user message -> LLM -> tool call -> repeat -> final response.
  • LLMRouterActor defines the model boundary, using a simulator where production code would call OpenAI, Anthropic, Bedrock, Gemini, or an internal model.
  • ToolRegistryActor centralizes tool registration and dispatch.
  • SessionManagerActor stores session metadata in KV.
  • MemoryActor demonstrates global, agent, and session-scoped memory.
  • AuditEventActor records non-blocking audit events through GenEvent-style fire-and-forget messaging.
  • AgentStateFSM makes lifecycle transitions explicit.
  • TaskQueueActor shows durable background work through channels.
  • HealthMonitorActor polls service-group health using actor timers.
  • OrchestratorActor demonstrates workflow-style task decomposition and result aggregation.

A production MiniClaw would harden the implementation with the following:

  • strict tenant, user, session, and tool authorization on every message;
  • safe eval like asteval; the WASM sandbox reduces but does not eliminate the risk;
  • one actor instance per tenant/session or explicit session-partitioned state;
  • add schema validation before tool execution;
  • add idempotency to task queue processing;
  • hardened tool execution with separate sandboxed tool actors for high-risk tools;
  • real LLM provider integration with retries, budgets, timeouts, backoff, and circuit breakers;
  • prompt-injection detection, output validation, and optional LLM-as-judge actors;
  • stronger memory governance, including TTLs, redaction, encryption, and deletion semantics;
  • structured audit trails with retention policies and tamper-resistant storage;
  • crash-recovery tests, chaos testing, and cross-tenant isolation tests;
  • deployment hardening for secrets, networking, service links, and Firecracker isolation.

For teams building enterprise AI agents, the real question is not whether they need isolation, auditability, tenant boundaries, tool governance, and failure recovery. They do. The question is whether they bolt those properties onto a monolithic agent process later, or start with a runtime where those properties are first-class primitives.


The full source, including the Go and Python implementations, is at github.com/bhatti/PlexSpaces.

References

Older Posts »

Powered by WordPress