Shahzad Bhatti Welcome to my ramblings and rants!

June 27, 2026

Measuring Availability Properly: Percentiles, Tail Latency, and the Production Traps

Filed under: Computing — admin @ 1:18 pm

Observability is a key part of any infrastructure but I’ve watched teams repeat the same mistakes around measuring availability. For example, they track uptime and watch average latency. They run a TCP health check on port 80 and call it good. Then support learns about the availability issues from customers but the health dashboard shows everything is green. This post covers how to measure availability correctly: what signals to collect, how monitoring tools compute the rolling statistics you see, why percentiles beat averages and what happens to tail latency at scale in microservices.


1. What Availability Actually Means

The textbook definition of availability is uptime, e.g., the fraction of time a service is running. This splits into two independent questions:

Availability = P(request succeeds) AND P(request completes within SLA)

A service can answer every request successfully but take 30 seconds per response then that’s functionally unavailable. Conversely, a service can respond in 5ms but return errors to 50% of requests is also functionally unavailable.


2. User Errors vs Server Errors — Why the Distinction Matters

This is the most commonly conflated measurement in production monitoring. HTTP status codes carry clear semantic meaning that should drive entirely different alert responses:

Code RangeMeaningWhose Fault?Include in Availability?
2xxSuccessYes (success)
3xxRedirectUsually ignored
4xxClient/user errorThe callerNo
5xxServer errorYour serviceYes

4xx errors are client/user errors like 400/Bad Request, 401/Unauthorized. 5xx errors means service is failing like 500/Internal Server, 503/Service Unavailable. There is one gray area: client timeouts. If your client times out after 5s waiting for your 10s response, the client sees a 408 or a network error, which look like a 4xx but the root cause is server-side latency. This is why tracking latency separately from error codes is essential.

from prometheus_client import Counter, Histogram

# Track errors with full status code granularity
request_counter = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status_code', 'status_class']
)

latency_histogram = Histogram(
    'http_request_duration_seconds',
    'Request latency',
    ['method', 'endpoint'],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

def record_request(method: str, endpoint: str, status: int, duration_s: float):
    status_class = f"{status // 100}xx"
    request_counter.labels(
        method=method,
        endpoint=endpoint,
        status_code=str(status),
        status_class=status_class
    ).inc()
    latency_histogram.labels(method=method, endpoint=endpoint).observe(duration_s)


# --- Prometheus queries that actually measure availability ---

# Server error rate (5xx only — excludes client errors)
SERVER_ERROR_RATE = """
sum(rate(http_requests_total{status_class="5xx"}[5m]))
/
sum(rate(http_requests_total[5m]))
"""

# Availability (only penalize server errors)
AVAILABILITY = """
1 - (
  sum(rate(http_requests_total{status_class="5xx"}[5m]))
  /
  sum(rate(http_requests_total[5m]))
)
"""

# Client error rate (useful to watch, but not availability)
CLIENT_ERROR_RATE = """
sum(rate(http_requests_total{status_class="4xx"}[5m]))
/
sum(rate(http_requests_total[5m]))
"""

# Latency SLA compliance — fraction of requests completing within 500ms
LATENCY_SLA_COMPLIANCE = """
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
"""

A spike in 4xx that isn’t paired with a 5xx spike is almost certainly a misbehaving client, not your service. Alert on them differently: 5xx pages your on-call, 4xx goes to a ticket queue for review.


3. SLAs, SLOs, and Error Budgets

These three terms are used interchangeably in many organizations and they shouldn’t be.

  • SLA (Service Level Agreement) is a contractual commitment to external customers. Violating it has legal or financial consequences. Example: “We guarantee 99.9% availability per calendar month. If we breach this, we issue service credits.”
  • SLO (Service Level Objective) is an internal engineering target, usually tighter than the SLA. Example: “We target 99.95% availability.” The gap between SLO and SLA is your buffer.
  • Error Budget is what you get to spend before you breach your SLO. For a 99.9% SLO over 30 days:
Total minutes in 30 days = 30 × 24 × 60 = 43,200 minutes
Allowed downtime = 43,200 × (1 - 0.999) = 43.2 minutes

The error budget is your 43.2 minutes. Every minute of downtime spends from it. This reframes the conversation from “is the service up?” to “how fast are we burning through our budget?”

from datetime import datetime, timedelta

class ErrorBudget:
    """
    Track error budget consumption in real time.
    
    Example: 99.9% SLO over 30 days = 43.2 minutes of allowed downtime.
    """
    def __init__(self, slo_target: float, window_days: int = 30):
        self.slo_target = slo_target          # e.g., 0.999 for 99.9%
        self.window_minutes = window_days * 24 * 60
        self.allowed_downtime_minutes = self.window_minutes * (1 - slo_target)
        self.downtime_minutes_spent = 0.0
        self.start_time = datetime.now()

    def record_downtime(self, minutes: float):
        self.downtime_minutes_spent += minutes

    def budget_remaining_minutes(self) -> float:
        return max(0, self.allowed_downtime_minutes - self.downtime_minutes_spent)

    def budget_remaining_pct(self) -> float:
        return (self.budget_remaining_minutes() / self.allowed_downtime_minutes) * 100

    def burn_rate(self) -> float:
        """How fast are we burning budget vs. expected rate? 1.0 = on track, >1.0 = burning fast."""
        elapsed = (datetime.now() - self.start_time).total_seconds() / 60
        expected_spent = (elapsed / self.window_minutes) * self.allowed_downtime_minutes
        if expected_spent == 0:
            return 0.0
        return self.downtime_minutes_spent / expected_spent

    def summary(self) -> str:
        return (
            f"SLO: {self.slo_target*100:.2f}%  |  "
            f"Budget: {self.allowed_downtime_minutes:.1f} min  |  "
            f"Spent: {self.downtime_minutes_spent:.1f} min  |  "
            f"Remaining: {self.budget_remaining_pct():.1f}%  |  "
            f"Burn rate: {self.burn_rate():.2f}x"
        )

# Usage
budget = ErrorBudget(slo_target=0.999, window_days=30)
budget.record_downtime(minutes=12.5)   # incident on day 3
budget.record_downtime(minutes=8.0)    # incident on day 11
print(budget.summary())
# SLO: 99.90%  |  Budget: 43.2 min  |  Spent: 20.5 min  |  Remaining: 52.5%  |  Burn rate: ...

A burn rate above 1.0 means you’ll exceed your error budget before the window closes. Burn rate above 14.4x means you’ll exhaust it within 48 hours, which is a PagerDuty alert.


4. The Health Check Anti-Pattern

I need to address something I’ve seen sink production deployments before we even get to metrics: health checks that only verify the process is listening on a port. A port check tells you the process hasn’t crashed. It tells you nothing about whether the process can serve traffic. I’ve seen this exact scenario: database connection pool was exhausted, port was open, load balancer marked the instance healthy, every request returned a 500. The monitoring was dark green the whole time.

A real health check must exercise the actual request path: connect to dependencies, perform a lightweight but genuine operation, return structured status. In Kubernetes this means a readiness probe hitting a /health endpoint that checks dependency connectivity. Critically, readiness and liveness are different probes:

  • Liveness: Is the process deadlocked? If not, keep it alive. If yes, kill and restart it.
  • Readiness: Can it serve traffic right now? If not, remove it from the load balancer pool, but don’t kill it.

A process that is alive but not ready (warming up a cache, waiting for a dependency) should fail readiness but pass liveness. Confusing these two causes cascading restarts during startup under load is a failure mode I’ve seen multiple times in prod. See my Zero-Downtime Services on Kubernetes and Istio post for the full treatment.


5. Why Average Latency Lies

Here’s a production story I’ve seen more than once. The team does an efficiency push: optimizes the hot path, ships a 30% improvement in p50 latency. Dashboards celebrate but three weeks later, the p99 is back to where it started. The answer is queuing theory. Consider a server with a queue in front of it. Define utilization P as:

P = arrival rate / service rate

The average number of items in the system in queue plus being served is:

E[N] = P / (1 - P)

This is not a linear relationship. It’s an asymptote that goes vertical as you approach full utilization:

P (utilization)E[N] (avg items in system)
0.50 (50%)1
0.80 (80%)4
0.90 (90%)9
0.95 (95%)19
0.99 (99%)99

When you make the code faster (higher service-rate), P drops, and you slide left on this curve, i.e., fewer items queuing with lower tail latency. But then traffic grows or you reduce servers to “realize the savings.” P climbs back to where it was, and latency returns with it. The key lesson is that the average latency reflects the fast path but high-percentile latency (p99, p99.9) is extremely sensitive to queue depth. High percentile latency is a leading indicator that you’re approaching overload.

There’s a counterintuitive implication from this: p99 is a terrible way to measure whether your efficiency work succeeded. It’s so sensitive to the queuing nonlinearity that changes in utilization will swamp the signal from your actual code changes. For measuring efficiency, mean latency is actually better because it tracks the true cost of processing one request without queue effects. Use percentiles for alerting and use mean for efficiency measurement.


6. Percentiles From First Principles

Let’s go over percentiles from scratch, because monitoring tools throw around “p50”, “p99”, “p99.9” without ever explaining what they actually represent, and misunderstanding them leads to misreading dashboards. Given a set of N latency measurements, sort them from fastest to slowest. The Nth percentile is the value at position N% in that sorted list.

Latencies (ms): [5, 7, 8, 9, 10, 11, 12, 13, 250, 400]
Sorted:          [5, 7, 8, 9, 10, 11, 12, 13, 250, 400]
                  ^              ^              ^
                  p10           p50            p90

p50 = 10ms  (50% of requests were at or below this speed)
p90 = 13ms  (90% of requests were at or below this speed)
p99 = 400ms (99% of requests were at or below this speed)

What p99 tells you is: at most 1% of your requests see latency worse than this number. Equivalently, 999 out of every 1000 requests complete faster than p99. The catch is that p99 is a single value and it summarizes nothing about the shape of the distribution between p90 and p99. Latency can get dramatically worse for customers in that range without your p99 alarm firing.

import numpy as np

def explain_percentile(latencies_ms: list[float]):
    """Show what percentiles mean in plain English."""
    arr = np.array(sorted(latencies_ms))
    n = len(arr)
    
    stats = {
        "mean":  np.mean(arr),
        "p50":   np.percentile(arr, 50),
        "p90":   np.percentile(arr, 90),
        "p95":   np.percentile(arr, 95),
        "p99":   np.percentile(arr, 99),
        "p99.9": np.percentile(arr, 99.9),
        "max":   np.max(arr),
    }
    
    print(f"{'Statistic':<10} {'Value':>10}   Plain English")
    print("-" * 65)
    print(f"{'mean':<10} {stats['mean']:>10.1f}ms  Average — hides bimodal distributions")
    print(f"{'p50':<10} {stats['p50']:>10.1f}ms  Half of requests faster than this")
    print(f"{'p90':<10} {stats['p90']:>10.1f}ms  90% of requests faster than this")
    print(f"{'p95':<10} {stats['p95']:>10.1f}ms  95% of requests faster than this")
    print(f"{'p99':<10} {stats['p99']:>10.1f}ms  99% of requests faster than this")
    print(f"{'p99.9':<10} {stats['p99.9']:>10.1f}ms  999/1000 requests faster than this")
    print(f"{'max':<10} {stats['max']:>10.1f}ms  Worst single request (very noisy)")

# Simulate a bimodal latency distribution
# 95% fast requests (cache hit), 5% slow (cache miss + DB query)
import random
random.seed(42)
latencies = [
    random.gauss(10, 2) if random.random() > 0.05 else random.gauss(300, 40)
    for _ in range(1000)
]
explain_percentile(latencies)
Statistic       Value   Plain English
-----------------------------------------------------------------
mean             24.8ms  Average — hides bimodal distributions
p50              10.4ms  Half of requests faster than this
p90              12.1ms  90% of requests faster than this
p95              17.9ms  95% of requests faster than this
p99             302.1ms  99% of requests faster than this
p99.9           375.8ms  999/1000 requests faster than this
max             392.4ms  Worst single request (very noisy)

7. Moving Averages and Rolling Percentiles

When Grafana shows you a p99 or Datadog shows you an error rate, it’s not summing up all-time data. It’s computing over a rolling time window.

Simple Moving Average vs EWMA

A Simple Moving Average (SMA) gives equal weight to every sample in the window:

from collections import deque
import statistics

class SMA:
    """Simple Moving Average — every sample in the window weighted equally."""
    def __init__(self, window: int):
        self.buf = deque(maxlen=window)
    
    def add(self, v: float) -> float:
        self.buf.append(v)
        return statistics.mean(self.buf)

An Exponentially Weighted Moving Average (EWMA) gives more weight to recent samples, fading older ones smoothly:

class EWMA:
    """
    Exponentially Weighted Moving Average.
    alpha: 0 < alpha < 1
    - High alpha (e.g. 0.3): reacts fast, noisier
    - Low alpha  (e.g. 0.05): smoother, slower to detect changes
    
    StatsD uses EWMA for gauge values. Prometheus uses time-window sums.
    """
    def __init__(self, alpha: float = 0.1):
        self.alpha = alpha
        self.value: float | None = None

    def add(self, sample: float) -> float:
        if self.value is None:
            self.value = sample
        else:
            self.value = self.alpha * sample + (1 - self.alpha) * self.value
        return self.value

# Demonstrate: same spike, different alphas
spike_data = [10, 10, 10, 10, 250, 10, 10, 10, 10, 10]
slow_ewma = EWMA(alpha=0.05)
fast_ewma = EWMA(alpha=0.30)

print(f"{'Sample':>8} {'Value':>8} {'alpha=0.05':>10} {'alpha=0.30':>10}")
for i, v in enumerate(spike_data):
    print(f"{i:>8} {v:>8.0f} {slow_ewma.add(v):>10.1f} {fast_ewma.add(v):>10.1f}")
  Sample    Value     alpha=0.05   alpha=0.30
       0       10       10.0       10.0
       1       10       10.0       10.0
       4      250       21.9       82.0   --> fast alpha sees the spike much louder
       5       10       21.3       58.4   --> slow alpha recovers faster
       9       10       18.5       17.2

Rolling Percentile

Computing exact percentiles over a moving window requires keeping raw samples and re-sorting. For production scale, the T-Digest algorithm computes approximate percentiles with bounded memory. Here’s the conceptual version first:

import numpy as np
from collections import deque

class RollingPercentile:
    """
    Rolling percentile over a fixed window of recent samples.
    
    Production note: At high throughput, use T-Digest or DDSketch instead.
    Prometheus uses pre-defined histogram buckets + linear interpolation.
    """
    def __init__(self, window: int, pctile: float):
        self.buf = deque(maxlen=window)
        self.pctile = pctile

    def add(self, v: float) -> float | None:
        self.buf.append(v)
        if len(self.buf) < 2:
            return None
        return float(np.percentile(list(self.buf), self.pctile))

# Show how window size affects sensitivity
import random
random.seed(7)

data = [random.gauss(10, 2) for _ in range(90)] + \
       [random.gauss(200, 20) for _ in range(10)]  # degradation at t=90

p99_small  = RollingPercentile(window=20,  pctile=99)
p99_medium = RollingPercentile(window=100, pctile=99)

print("How window size affects p99 detection of a latency spike:")
print(f"{'t':>4} {'value':>8} {'p99 w=20':>12} {'p99 w=100':>12}")
for t, v in enumerate(data[80:]):   # show the transition region
    small  = p99_small.add(v)
    medium = p99_medium.add(v)
    marker = " --> spike starts" if t == 10 else ""
    if small and medium:
        print(f"{t+80:>4} {v:>8.1f} {small:>12.1f} {medium:>12.1f}{marker}")

Prometheus histogram vs. summary: Prometheus offers two ways to track latency. A Summary computes quantiles client-side over a rolling window but you can’t aggregate across instances. A Histogram records counts in pre-defined buckets and approximates quantiles server-side, which is slightly less accurate, but fully aggregatable. For microservices with multiple replicas, always use Histogram.


8. Trimmed Mean: More Signal, Real Tradeoffs

Here’s the core difference between a percentile and a trimmed mean, using the product review analogy:

100 latency measurements, sorted by speed:

p99  = the single worst measurement in the best 99%
       (the 99th measurement out of 100, sorted fastest-to-slowest)

tm99 = the average of all 99 measurements in the best 99%
       (discard the 1 slowest, average the remaining 99)

tm99 summarizes 99 times more data than p99. That makes it more stable (less spiky under low traffic), harder to game (a gradual degradation can hide between percentile checkpoints, but tm99 will catch it), and more representative of typical customer experience.

  • tm99 tracks the average experience of your bulk of customers
  • TM(99%:) tracks the average of your slowest 1%; ensures outlier experience doesn’t silently worsen

Together these two numbers cover 100% of your requests with just two metrics.

import numpy as np

def compute_tm_stats(samples: list[float]) -> dict:
    """
    Compute a full suite of trimmed mean statistics.

    Syntax mirrors CloudWatch / AWS Embedded Metrics Format:
      tm99        = TM(0%:99%)  = average of fastest 99%
      TM(99%:)    = TM(99%:100%) = average of slowest 1%  
      TM(1%:99%)  = drop both extremes (handles unbounded latency)
      IQM         = TM(25%:75%) = Interquartile Mean
    """
    arr = np.sort(np.array(samples))
    n = len(arr)

    def tm(lower_pct: float, upper_pct: float) -> float:
        lo = np.percentile(arr, lower_pct)
        hi = np.percentile(arr, upper_pct)
        trimmed = arr[(arr >= lo) & (arr <= hi)]
        return float(np.mean(trimmed)) if len(trimmed) else float('nan')

    return {
        "mean":       float(np.mean(arr)),
        "p50":        float(np.percentile(arr, 50)),
        "p99":        float(np.percentile(arr, 99)),
        "tm99":       tm(0, 99),      # avg of fastest 99%
        "TM(99%:)":   tm(99, 100),    # avg of slowest 1%  --> watch your outliers here
        "TM(1%:99%)": tm(1, 99),      # drop both extremes (use for unbounded latency)
        "IQM":        tm(25, 75),     # interquartile mean
    }

# Scenario: a cache-miss spike where 2% of requests are slow
rng = np.random.default_rng(42)
fast = rng.normal(10, 1.5, 980)
slow = rng.normal(350, 30, 20)
samples = np.concatenate([fast, slow]).tolist()

stats = compute_tm_stats(samples)
print(f"{'Metric':<14} {'Value':>10}   Notes")
print("-" * 65)
for k, v in stats.items():
    notes = {
        "mean":       "Pulled up by slow tail — misleading",
        "p50":        "Median — fine but ignores tail",
        "p99":        "Single value at 99th position",
        "tm99":       "Average of 98% of customers --> primary SLO metric",
        "TM(99%:)":   "Average of slowest 2% --> outlier watchdog",
        "TM(1%:99%)": "Drops both extremes — good for browser metrics",
        "IQM":        "Middle 50% average — robust to both extremes",
    }.get(k, "")
    print(f"{k:<14} {v:>10.1f}ms  {notes}")
Metric              Value   Notes
-----------------------------------------------------------------
mean                16.8ms  Pulled up by slow tail — misleading
p50                  9.9ms  Median — fine but ignores tail
p99                335.2ms  Single value at 99th position
tm99                10.1ms  Average of 98% of customers --> primary SLO metric
TM(99%:)           351.4ms  Average of slowest 2% --> outlier watchdog
TM(1%:99%)          10.1ms  Drops both extremes — good for browser metrics
IQM                  9.8ms  Middle 50% average — robust to both extremes

Bounded vs. unbounded latency:

  • Bounded latency (server-side, with request timeouts): use tm99 + TM(99%:). Since latency is capped by your timeout, even the worst measurements are meaningful.
  • Unbounded latency (client-side browser metrics, user-perceived time): use TM(1%:99%). A user who closes their laptop mid-request and reopens it days later may log a latency of 230,400 seconds. These shouldn’t contaminate your outlier statistics. Drop the top and bottom extremes.

I have seen in a real-life production services where teams work towards improving p50/median but everything else gets worse. You only find this out when you examine tm95 because latency was consistently worse for a growing number of customers. The key lesson is that percentiles create blind spots “between the checkpoints.” A degradation that affects the 40th–60th percentile range will move neither p25 nor p75 much. Trimmed mean, because it averages across the entire range, catches these shifts. However, trimmed mean has its own blind spot. It deliberately removes the part of the distribution that dominates user experience in fan-out architectures. The right answer is not to choose between percentiles and trimmed mean but use both.


10. Winsorized Mean, Percentile Rank, and IQM

These statistics show up in CloudWatch and modern observability platforms, and they each solve a specific problem.

Winsorized Mean (WM)

Like trimmed mean, but instead of discarding outliers, it replaces them with the boundary value. For wm99:

  • Find the value at the 99th percentile (= p99)
  • Treat all 1% outliers as if they had exactly that p99 value
  • Average all 100% of samples
def winsorized_mean(samples: list[float], lower_pct: float = 0, upper_pct: float = 99) -> float:
    arr = np.array(samples, dtype=float)
    lo = np.percentile(arr, lower_pct)
    hi = np.percentile(arr, upper_pct)
    # Clip: anything below lo becomes lo, anything above hi becomes hi
    winsorized = np.clip(arr, lo, hi)
    return float(np.mean(winsorized))

Winsorized mean gives some weight to outliers without letting extreme values skew the average. The difference between tm99 and wm99 is subtle at high percentages and wm99 will be slightly higher because it includes the outliers rather than dropping them.

Percentile Rank PR()

Percentile rank answers the inverse question from percentile. Percentile says: “What latency value marks the Nth percent?” Percentile rank says: “What percent of requests are below a given latency value?”

If you have an SLA of “respond within 500ms to 99% of users,” you’d normally monitor p99 and check it’s <= 500ms. With Percentile Rank, you instead plot PR(:500ms, i.e., the percentage of requests completing within 500ms and drive that number toward 99% or higher. This is more directly action-oriented: you always know exactly how far below your SLA you are.

def percentile_rank(samples: list[float], threshold: float) -> float:
    """What fraction of samples are at or below threshold?"""
    arr = np.array(samples)
    return float(np.mean(arr <= threshold) * 100)

# Example: SLA is p99 < 500ms
samples_ms = [10, 12, 9, 11, 450, 10, 13, 600, 11, 10]  # small sample
pr_500 = percentile_rank(samples_ms, 500)
print(f"PR(:500ms) = {pr_500:.1f}%  (SLA requires 99%)")
# PR(:500ms) = 90.0%  (SLA requires 99%) — you're 9 percentage points short

IQM (Interquartile Mean)

IQM is simply TM(25%:75%), the average of the middle 50% of samples, discarding the top and bottom 25%. It’s extremely robust to outliers in both directions, useful when you expect noise from both ends of the distribution (e.g., some requests are trivially fast cache hits, others are pathologically slow).


11. The Inspection Paradox: Your Users Experience Worse Than Your Metrics Show

As Marc Brooker’s explained in his blog, this is the most underappreciated gap in distributed systems reliability. For example, say your service has outages with very different durations: some resolve in 30 seconds, but occasionally one runs for 3 hours. Your MTTR (Mean Time to Recovery) might calculate to 5 minutes. But when a user hits your service during an outage, they’re more likely to land in a long outage than a short one because long outages have more time-slots for users to arrive in.

Customer-experienced mean recovery = (1/2) × (MTTR + Variance/MTTR)

The second term is what kills you. If your outage duration has high variance, e.g., fast recovery most of the time, but occasional 3-hour events then that variance term dominates. Your customers experience something dramatically worse than your MTTR.

import random
import math
import statistics

def inspection_paradox_demo(
    median_recovery_min: float,
    p99_recovery_min: float,
    arrivals_per_min: float = 100,
    n_outages: int = 2000
) -> dict:
    """
    Simulate the gap between operator MTTR and customer-experienced recovery.
    
    Key insight: customers are t-weighted samplers of your outage distribution.
    A 10-minute outage gets sampled by ~10x as many clients as a 1-minute outage.
    """
    # Fit lognormal to median and p99
    mu = math.log(median_recovery_min)
    sigma = (math.log(p99_recovery_min) - mu) / 2.326

    server_durations = []
    client_wait_times = []

    for _ in range(n_outages):
        duration = random.lognormvariate(mu, sigma)
        server_durations.append(duration)

        # Clients arrive as a Poisson process during the outage
        t = 0.0
        while True:
            gap = random.expovariate(arrivals_per_min)
            if t + gap > duration:
                break
            # This client arrived at time t, waits until outage ends
            client_wait_times.append(duration - t)
            t += gap

    return {
        "operator_mttr":        statistics.mean(server_durations),
        "operator_p99":         sorted(server_durations)[int(len(server_durations) * 0.99)],
        "customer_mean_wait":   statistics.mean(client_wait_times) if client_wait_times else 0,
        "customer_p99_wait":    sorted(client_wait_times)[int(len(client_wait_times) * 0.99)] if client_wait_times else 0,
        "experience_gap_ratio": (statistics.mean(client_wait_times) / statistics.mean(server_durations)) if client_wait_times else 0,
    }

result = inspection_paradox_demo(
    median_recovery_min=1,    # median outage resolves in 1 minute
    p99_recovery_min=60,      # but 1% of outages take an hour
)

print("Scenario: 1-minute median recovery, 60-minute p99 recovery")
print()
print("What your on-call dashboard shows:")
print(f"  MTTR:              {result['operator_mttr']:.1f} minutes")
print(f"  p99 recovery:      {result['operator_p99']:.1f} minutes")
print()
print("What your customers actually experience:")
print(f"  Mean recovery:     {result['customer_mean_wait']:.1f} minutes")
print(f"  p99 recovery:      {result['customer_p99_wait']:.1f} minutes")
print(f"  Experience gap:    {result['experience_gap_ratio']:.1f}x worse than MTTR")
Scenario: 1-minute median recovery, 60-minute p99 recovery

What your on-call dashboard shows:
  MTTR:              4.9 minutes
  p99 recovery:      56.6 minutes

What your customers actually experience:
  Mean recovery:     60.0 minutes
  p99 recovery:      797.3 minutes
  Experience gap:    12.1x worse than MTTR

This is why tail recovery time matters more than averages suggest. Timeout-and-retry can hide individual request latency, but it cannot hide recovery time. Once a client gets stuck in an outage, retries don’t shorten the outage, they just add load to an already struggling service. The right takeaway: minimize variance in recovery time, not just its mean. Bounded, predictable recovery is far better for customers than fast-average-but-occasional-disaster.


12. Tail Latency Amplifies in Microservices

Modern architectures decompose user requests into many service calls. This creates two topologies, and both amplify tail latency:

Fan-out math: If each service has a 1% probability of a slow response, the probability that at least one is slow when calling N services in parallel is:

P(at least one slow) = 1 - (1 - 0.01)^N
N (services called)% of user requests seeing a slow response
11.0%
54.9%
109.6%
2522.2%
5039.5%
10063.4%

What was a rare 1% tail now affects the majority of user interactions. And here’s the pernicious part: your per-service p99 metric looks perfectly fine. The damage is invisible at the service level, only visible at the user-experience level.

import numpy as np, random

def simulate_fanout(n_backends: int, tail_prob: float = 0.01, n_reqs: int = 20_000):
    """
    Simulate client experience when calling n_backends in parallel.
    Each backend: (1-tail_prob) chance of fast, tail_prob chance of slow.
    """
    results = []
    slow_count = 0
    for _ in range(n_reqs):
        latencies = []
        for _ in range(n_backends):
            if random.random() < tail_prob:
                latencies.append(random.gauss(250, 25))
                slow_count += 1
            else:
                latencies.append(random.gauss(10, 2))
        results.append(max(latencies))  # fan-out: wait for slowest
    
    arr = np.array(results)
    return {
        "p50":  np.percentile(arr, 50),
        "p99":  np.percentile(arr, 99),
        "mean": np.mean(arr),
        "pct_slow_user_requests": np.mean(arr > 50) * 100,
    }

print(f"{'N':>4} {'p50 (ms)':>10} {'p99 (ms)':>10} {'mean (ms)':>10} {'% users hit slow':>18}")
for n in [1, 5, 10, 25, 50, 100]:
    r = simulate_fanout(n)
    print(f"{n:>4} {r['p50']:>10.1f} {r['p99']:>10.1f} {r['mean']:>10.1f} {r['pct_slow_user_requests']:>18.1f}%")

The trimmed mean blind spot revisited. At N=50, nearly 40% of user requests are slow. But your per-service tm99 (averaging the best 99% of individual service calls) still looks great because it’s averaging the fast cluster. This is exactly the case where trimmed mean gives you false comfort. You need explicit end-to-end latency tracking at the user-request level, not just per-service tail tracking.


13. The Pooling Dividend: Why Redundancy Is Non-Linear

Adding servers doesn’t just increase capacity linearly but it also improves latency through pooling. This comes from the Erlang C model in queuing theory. For example, two designs, both handling the same total load:

  • Design A: 1 server at 80% utilization
  • Design B: 10 servers sharing load, each at 80% utilization

Design A has roughly a 13% chance of any incoming request finding the server busy and joining a queue. Design B has roughly a 3.6% chance. Double the fleet to 20 servers at the same 80% per-server utilization, and the queueing probability drops toward 1%. You’re getting better latency and better tail behavior at the same per-server cost.

import math
from functools import lru_cache

def erlang_c(c: int, rho: float) -> float:
    """
    Erlang C formula: probability an arriving request must queue
    (rather than being served immediately) in an M/M/c system.
    
    c: number of servers
    rho: per-server utilization (0 < rho < 1)
    """
    a = c * rho  # total offered load
    
    @lru_cache(maxsize=None)
    def factorial(n: int) -> int:
        return 1 if n <= 1 else n * factorial(n - 1)
    
    # Sum term for the denominator
    sum_term = sum(a**k / factorial(k) for k in range(c))
    last_term = (a**c / factorial(c)) * (1 / (1 - rho))
    
    ec = last_term / (sum_term + last_term)
    return ec

print("Probability a request must queue before being served:")
print(f"{'Servers':>8} {'Utilization':>12} {'Queue prob':>12}   {'Queue %':>8}")
for c in [1, 2, 5, 10, 20, 50]:
    ec = erlang_c(c=c, rho=0.8)
    print(f"{c:>8} {'80%':>12} {ec:>12.4f}   {ec*100:>7.1f}%")
Probability a request must queue before being served:
 Servers  Utilization   Queue prob    Queue %
       1          80%       0.8000      80.0%
       2          80%       0.7111      71.1%
       5          80%       0.5541      55.4%
      10          80%       0.4092      40.9%
      20          80%       0.2561      25.6%
      50          80%       0.0870       8.7%

Most of the benefit materializes at modest fleet sizes. You don’t need to be at hyperscale to get pooling gains. A fleet of 5-10 servers sharing load through a proper load balancer will have dramatically better tail latency behavior than the same compute running as independent instances.


14. Retries, Circuit Breakers, and the Amplification Trap

Retries protect against transient failures like a GC pause, a brief network glitch, a thundering herd. In past production deployment, I use up to 3 retries with exponential backoff for idempotent read operations. The protection against false positives is real and worthwhile. But retries have a catastrophic failure mode: retry amplification.

A single user request can generate 3 × 3 × 3 = 27 actual requests to a struggling downstream service. This turns a partial overload into a total collapse. I’ve watched this happen in production, e.g., a service that was at 60% capacity receives a burst of retries from a misbehaving upstream and immediately spikes to 200% load, failing every request, causing more retries, a feedback loop.

The mitigations:

import time
import threading
from collections import deque

class RetryBudget:
    """
    Limit total retry rate as a fraction of total traffic.
    If retries exceed the budget, fail fast instead of retrying.
    
    Classic mitigation for retry amplification.
    """
    def __init__(self, budget_fraction: float = 0.10, window_seconds: int = 60):
        self.budget_fraction = budget_fraction
        self.window = window_seconds
        self.total_requests: deque = deque()
        self.retry_requests: deque = deque()
        self._lock = threading.Lock()

    def _prune(self):
        cutoff = time.monotonic() - self.window
        while self.total_requests and self.total_requests[0] < cutoff:
            self.total_requests.popleft()
        while self.retry_requests and self.retry_requests[0] < cutoff:
            self.retry_requests.popleft()

    def record_request(self):
        with self._lock:
            self.total_requests.append(time.monotonic())

    def should_retry(self) -> bool:
        """Returns True if we have retry budget remaining."""
        with self._lock:
            self._prune()
            total = len(self.total_requests)
            retries = len(self.retry_requests)
            if total == 0:
                return True
            current_rate = retries / total
            if current_rate < self.budget_fraction:
                self.retry_requests.append(time.monotonic())
                return True
            return False  # budget exhausted — fail fast, don't amplify


class CircuitBreaker:
    """
    Stop sending requests to a failing downstream.
    Transitions: CLOSED -> OPEN -> HALF_OPEN -> CLOSED
    """
    CLOSED, OPEN, HALF_OPEN = "CLOSED", "OPEN", "HALF_OPEN"

    def __init__(self, failure_threshold: float = 0.5, cooldown_seconds: float = 30):
        self.failure_threshold = failure_threshold
        self.cooldown = cooldown_seconds
        self.state = self.CLOSED
        self.failures = 0
        self.total = 0
        self.opened_at: float | None = None

    def call_allowed(self) -> bool:
        if self.state == self.CLOSED:
            return True
        if self.state == self.OPEN:
            if time.monotonic() - self.opened_at > self.cooldown:
                self.state = self.HALF_OPEN
                return True  # let one probe through
            return False  # fail fast
        return True  # HALF_OPEN: let one probe through

    def record_success(self):
        self.failures = 0
        self.total = 0
        self.state = self.CLOSED

    def record_failure(self):
        self.failures += 1
        self.total += 1
        if self.total >= 10 and self.failures / self.total >= self.failure_threshold:
            self.state = self.OPEN
            self.opened_at = time.monotonic()

Hedge requests are often better than retries for latency problems. Instead of waiting for a timeout and retrying, fire a second request after a short delay (say, the p90 latency). Accept whichever responds first, cancel the other. This cuts your tail exposure without amplifying load as aggressively, because typically one of the two requests will succeed quickly.


15. Synthetic Canaries in Production

Error rates and latency percentiles tell you what’s happening to real traffic but only after users are affected. Synthetic canaries fill the gap: background processes that continuously exercise your API end-to-end, giving you availability signal even at 3am when real traffic is low.

Key design decisions from production experience:

  • Test the full workflow, not just the health endpoint. A canary for a data API should create, read, update, and delete a record. One for an auth service should issue a token, validate it, and revoke it. Shallow canaries that only call GET /health will miss the exact failures that health check anti-patterns also miss.
  • Track first-attempt and final success separately. If your canary succeeds on retry 2 90% of the time, the final success rate looks fine but something is quietly broken. First-attempt success rate catches this.
  • Keep canary observability separate from production. Mixing them has two failure modes: canary failures inflate your production error rate, and canary successes can mask production degradation if canaries hit warm caches or a separate code path.
  • Account for canary bias. Canaries hit warm caches and have predictable access patterns. Their p99 is almost always better than real user p99. Use canary latency to detect regressions relative to a baseline, not to claim absolute performance numbers.
  • Use retries in canaries, but with a limit. Up to 3 retries prevents false positives from transient network blips. But record the retry count per run, e..g, a canary that regularly needs 2+ retries is a signal worth investigating even if it eventually succeeds.

16. Putting It All Together: A Layered Monitoring Strategy

After decades of building and operating distributed systems, here’s the monitoring architecture I’d deploy for any production service from day one:

MetricWhyWindowAlert Threshold
5xx rateServer failures1 min> 0.1%
p99 latencyTail experience, SLA1 min> SLA value
Request volumeSilent failures1 minDrop > 50%
tm99 latencyBulk experience5 minTrending up
TM(99%:) latencyOutlier watchdog5 minTrending up
Error budget burnSLO health1 hr> 2x expected rate
p99.9 latencyOverload early warning15 minTrending
Retry rateAmplification risk5 min> 10% of traffic
Canary first-attemptEnd-to-end health60s< 95%

Closing: The Number That Matters Most

After all of this, the insight that has most changed how I think about availability is this: your users don’t experience your MTTR. They experience a version of it weighted by how long outages last, which skews dramatically toward your worst events. A service with a 1-minute median recovery but occasional 2-hour outages will have customers experiencing something closer to hours, not minutes. The variance in your tail events matters more than the central tendency. This is why the tail cannot be trimmed away from your visibility. Build observability that shows you the tail. Use redundancy and retries but understand how they amplify under pressure. Run canaries that exercise the whole path. Track user errors and server errors separately. Keep SLO burn rate visible so you always know how much budget you’ve spent. And when your customers say the service is slow and your dashboard says everything is green then believe the customers.

June 19, 2026

Making Bad State Impossible: A Practical Guide to ADTs and Algebraic Effects

Filed under: Computing,Concurrency — admin @ 9:46 pm

I. Introduction

Debugging a production incidents is much harder when dealing with a system with complex state management. For example, you might see a worker node is simultaneously “draining” and “upgrading” while flagged as “ready to restart.” or the heartbeat buffer filled with 100,000 metrics and silently dropped the overflow. In other cases, you might see a config deployment shows “success” in the database but never actually deployed because the error got swallowed by .catch(NOOP) somewhere. I’ve seen it in most legacy codebase I’ve worked on, e.g., in one system I found:

  • 441 instances of .catch(NOOP): errors silently swallowed
  • 506 mode checks: scattered everywhere, e.g., if (isLeader)... else if (isWorker)...
  • 64 possible boolean combinations: for worker state, of which only 5 are valid
  • Race conditions: in shared state with no synchronization
  • 816 files: coupled to global singletons

Here is the core thesis: most production incidents aren’t algorithmic bugs. They’re states that shouldn’t exist. The system entered a configuration nobody intended, no test covered, and no monitoring caught. Algebraic Data Types (ADTs) and Algebraic Effects are the tools that make those impossible states unrepresentable in code. Not “less likely.” or “caught by tests.” but impossible to express.


II. What Are Algebraic Data Types?

Forget the word “algebraic” for a moment. It just means “composed of parts using AND and OR.” That’s it.

Product Types: AND

A product type is a structure where ALL fields must be present at the same time. You use these every day:

struct WorkerConnection {
    id: String,
    address: String,
    port: u16,
    last_heartbeat: Instant,
}

Every WorkerConnection has an id AND an address AND a port AND a last_heartbeat. It’s called “product” because the number of possible values is the product of each field’s possibilities.

Sum Types: OR

A sum type is a value that is ONE of several variants. This is the powerful one most codebases miss:

enum TrafficLight {
    Red,
    Yellow,
    Green,
}

A traffic light is Red OR Yellow OR Green. It is never Red AND Green at the same time. It’s called “sum” because the number of possible values is the sum of each variant. The critical feature is exhaustiveness checking. When you pattern-match on a sum type, the compiler forces you to handle every variant. Add a new one and the compiler shows you every place that needs updating:

fn action(light: &TrafficLight) -> &str {
    match light {
        TrafficLight::Red => "stop",
        TrafficLight::Yellow => "caution",
        TrafficLight::Green => "go",
        // Add FlashingRed and this won't compile until you handle it here
    }
}

Why This Matters: Making Illegal States Unrepresentable

Here’s the practical payoff. Look at actual legacy code managing worker nodes:

// Legacy pattern
struct WorkerNode {
    current_action: Option<ExclusiveAction>,
    reconfig_in_progress: Option<ClusterRequest>,
    upgrade_in_progress: bool,
    draining: bool,
    allow_restart: bool,
    restart_on_exit: bool,
}

Six independent boolean fields. That’s 2^6 = 64 possible combinations. But the system only has about 5 valid states: idle, configuring, upgrading, draining, or restarting. The other 59 combinations are bugs waiting to happen. What does upgrade_in_progress = true AND draining = true AND reconfig_in_progress = Some(request) mean? Nobody knows and no test covers it. Now the same thing as a Rust enum:

enum WorkerState {
    Idle,
    Configuring { request: ClusterRequest },
    Upgrading { version: String },
    Draining { reason: String },
    Restarting,
}

Five states but the 59 impossible combinations literally cannot be expressed. You cannot write code that puts the worker in an invalid state because the type won’t compile. This isn’t about “good practice.” It’s about making an entire class of bugs impossible at compile time. The compiler becomes your 24/7 code reviewer, rejecting every impossible state before the code ever runs.

It Costs Real Money

  • Double settlement in banking: A payment system tracks settlement with isAuthorized, isSettled, isReversed. A race condition sets both isSettled = true and isReversed = true at the same time. Result: the same transaction is both settled and reversed so money moves twice. With a sum type (Authorized | Settled | Reversed | Disputed), that combination cannot exist.
  • Ghost billing in telecom: A session tracker uses isActive, isBilled, isTerminated. A network glitch terminates the session but the billing flag was set a millisecond before termination. Result: terminated sessions generate charges for hours. With a sum type (Active { startTime } | Terminated { endTime } | Billed { amount, endTime }), a terminated session cannot be in a billable state.

These aren’t hypothetical. They’re the kind of bugs that cost millions in reconciliation and regulatory fines. The root cause is always the same: boolean flags that allow impossible combinations.

Immutability Makes This Even Better

When state is immutable, you can’t accidentally corrupt it from another part of the code. But how do you “change” immutable data? You copy it:

fn update_progress(state: &JobState, new_progress: u8) -> JobState {
    JobState {
        progress: new_progress,
        updated_at: Instant::now(),
        ..state.clone()  // copy everything else
    }
}

let state1 = JobState { phase: Phase::Running, progress: 50, worker_id: "w-1".into() };
let state2 = update_progress(&state1, 75);
// state1.progress is still 50 — no other code sees a half-updated state

In Rust, this is enforced by the ownership system: you can have either one mutable reference OR many immutable references. Race conditions on shared state become a compile error, not a runtime bug.


III. ADTs Applied to Real Problems

Problem 1: Mode Detection Hell

Production systems support multiple deployment modes: leader, worker, edge, standalone. The result in the legacy codebase? Mode checks everywhere:

// 500+ instances of this scattered throughout
const configHelperMode = ProcessInfo.isConfigHelperMode();
const workerProcessMode = ProcessInfo.isWorkerMode();
const apiProcessMode = !configHelperMode && !workerProcessMode;

if (configHelperMode) { return runConfigHelper(...); }
if (workerProcessMode) { return ProcessMgr.initWorkerProcess(...); }
if (ServiceInfo.isService(role)) { return Service.initServiceProcess(...); }
if (isProxyNode(distMode)) { /* ... */ }
if (isSearchSupervisor(distMode)) { /* ... */ }
if (isLeader) { /* ... */ }
else if (isManaged(distMode)) { /* ... */ }
else if (isStandalone(distMode)) { /* ... */ }

The problems: adding a new mode requires finding and updating all 506 sites, missing one means silent incorrect behavior, and it’s easy to create contradictory states (isLeader && isWorker). The fix: one decision point at startup, exhaustive matching everywhere else:

enum AppMode {
    Leader { config: LeaderConfig },
    Worker { leader_id: String },
    Edge { leader_id: String },
    Standalone,
    ConfigHelper { group_id: String },
    SearchSupervisor { cluster_id: String },
}

// ONE place where mode is determined — at startup
fn determine_mode(env: &Environment) -> AppMode { ... }

// EVERYWHERE else — exhaustive matching
fn bootstrap(mode: AppMode) -> Application {
    match mode {
        AppMode::Leader { config } => bootstrap_leader(config),
        AppMode::Worker { leader_id } => bootstrap_worker(&leader_id),
        AppMode::Edge { leader_id } => bootstrap_edge(&leader_id),
        AppMode::Standalone => bootstrap_standalone(),
        AppMode::ConfigHelper { group_id } => bootstrap_config_helper(&group_id),
        AppMode::SearchSupervisor { cluster_id } => bootstrap_search(&cluster_id),
    }
}

Add a new mode and the compiler immediately shows you every match that needs a new arm. Miss one? Compilation fails. This is what “compiler-guided refactoring” means in practice.

Problem 2: Operations That Partially Succeed

One of the most dangerous patterns I’ve seen: multi-step operations without atomic boundaries. I wrote about it in Transaction Boundaries: The Foundation of Reliable Systems. Here is an example:

// Config updated BEFORE deployment succeeds
groupConf.configVersion = hash;   // Step 1: mutate config
await this.update(groupConf);      // Step 2: persist to database
await cm.deploy();                 // Step 3: actually deploy

// If step 3 fails: database says "deployed" but nothing deployed.
// State is permanently inconsistent. Nobody notices until 2am.

Another version of the same problem:

// Package manager — loop continues after failure
for (const op of ops) {
    try {
        switch (op.type) {
            case 'install': await this.install(op.pack); break;
            case 'uninstall': await this.uninstall(op.pack); break;
        }
    } catch(e) {
        errors.push(e);  // collect error but CONTINUE the loop
    }
}
await this.save();  // save regardless — partially applied state!

The typestate pattern uses types to enforce operation ordering. Each step produces a different type, and the next step only accepts the correct input type:

// Each phase is a distinct type — not an enum, separate structs
struct Planned { operations: Vec<Operation> }
struct Validated { operations: Vec<ValidOperation>, checks: Vec<CheckResult> }
struct Applied { results: Vec<OperationResult> }
struct Committed { hash: String, timestamp: Instant }

// Functions consume one type, return the next
fn validate(tx: Planned) -> Result<Validated, Vec<ValidationError>> { ... }
fn apply(tx: Validated) -> Result<Applied, ApplyError> { ... }
fn commit(tx: Applied) -> Result<Committed, CommitError> { ... }

// You cannot call commit() on a Planned transaction.
// The types won't allow it.
// And because validate() CONSUMES Planned, you can't reuse the old value.

If apply fails, you have a Validated, not an Applied. You can retry or abort cleanly. There’s no half-committed state because the type system won’t let you call commit without a successful apply.

Problem 3: Silently Swallowed Errors

441 instances of .catch(NOOP) in production. Each one is a failure that nobody notices until the system is in an inconsistent state:

this.reconcileLbIfStandalone(req.body).catch(NOOP);  // load balancer fails silently
unlink(bundlePath).catch(NOOP);                       // file deletion fails silently
dest.connect().catch(NOOP);                           // connection fails silently

The problem isn’t laziness. Promise/exception-based error handling makes it easy to ignore errors and hard to handle them consistently. Rust’s Result type inverts this: handling errors is the default path, and ignoring them requires explicit effort:

// Every operation returns Result — no hidden exceptions
async fn reconcile_lb(body: &Request) -> Result<LbState, ReconcileError> {
    let state = do_reconcile(body).await
        .map_err(|e| classify_error(e))?;  // ? propagates errors up — visible in the code
    Ok(state)
}

// Caller MUST handle the Result
let lb_state = reconcile_lb(&req.body).await?;
// If we reach this line, it succeeded. Guaranteed.

// Want to explicitly ignore? You have to WRITE that intention:
let _ = reconcile_lb(&req.body).await;  // "I know this can fail and I don't care"

The key insight: with Result, ignoring an error requires writing code to ignore it. With exceptions, ignoring an error requires writing nothing. Defaults matter enormously. The ? operator makes propagating errors as easy as typing one character, no try/catch boilerplate, no .catch(NOOP) temptation.

Problem 4: Swapped Arguments and Primitive Obsession

The legacy codebase uses raw strings and numbers for everything like IDs, tokens, keys. Nothing stops you from passing arguments in the wrong order:

// 4,000+ uses of untyped parameters
fn send_request_to_worker(wid: u64, req: &str, body: &[u8]) { ... }
// What stops you from passing (request_id, worker_id, wrong_body)? Nothing.

Rust newtypes create distinct types with zero runtime cost:

struct WorkerId(String);
struct RequestId(String);
struct AuthToken(String);

fn send_request(worker_id: &WorkerId, request_id: &RequestId, body: &RequestBody) { ... }

// Now the compiler catches this:
send_request(&request_id, &worker_id, &body);  // COMPILE ERROR
// expected `&WorkerId`, found `&RequestId`

And smart constructors validate at the boundary, so the type carries the guarantee everywhere:

impl WorkerId {
    pub fn new(raw: &str) -> Result<Self, ValidationError> {
        if !WORKER_ID_PATTERN.is_match(raw) {
            return Err(ValidationError::InvalidFormat("worker ID"));
        }
        Ok(WorkerId(raw.to_string()))
    }
}
// Once you have a WorkerId, you KNOW it's valid. No re-validation needed anywhere.

Problem 5: Every Process Carries Everything

The legacy system scaled by spawning full OS processes because there was no type-safe way to separate workloads:

// Every worker loads the FULL binary — all 150 connectors, all modes
// Even edge nodes carry leader code they'll never use
// Default: 2GB heap per worker
this.env.NODE_OPTIONS = `--max-old-space-size=${heapSizeMB || 2048}`;

// 4 workers × 2GB = 8GB minimum. Plus API process, services...
// Competitors: Fluent Bit (10-30MB), Vector (30-50MB)

With typed resource boundaries, each workload declares exactly what it needs:

enum WorkloadProfile {
    IoBound { connections: usize, buffer_size: Bytes },
    CpuBound { parallelism: usize, memory_budget: Bytes },
    Mixed { io_weight: f32, cpu_weight: f32 },
}

enum ResourceClaim {
    Lightweight { max_memory_mb: u32, max_cpu_cores: f32 },
    Standard { max_memory_mb: u32, max_cpu_cores: f32 },
    Heavy { max_memory_mb: u32, max_cpu_cores: f32 },
}

fn resources_for(pipeline: &PipelineConfig) -> ResourceClaim {
    match analyze_workload(pipeline) {
        WorkloadProfile::IoBound { .. } =>
            ResourceClaim::Lightweight { max_memory_mb: 64, max_cpu_cores: 0.5 },
        WorkloadProfile::CpuBound { .. } =>
            ResourceClaim::Heavy { max_memory_mb: 2048, max_cpu_cores: 4.0 },
        WorkloadProfile::Mixed { .. } =>
            ResourceClaim::Standard { max_memory_mb: 512, max_cpu_cores: 2.0 },
    }
}

Instead of “every process gets everything,” each workload gets exactly what it declares. Resource requirements are now visible, auditable, and enforced by the type system.

Problem 6: Inheritance Hierarchies Nobody Understands

The legacy codebase had class hierarchies 7 levels deep:

BaseServiceable                // 100+ subclasses, forces EventEmitter
  --> BaseInput
    --> TcpInput
      --> FramedProtocol         // Framing, auth, metrics, load balancing — all mixed
        --> ControlListener
          --> ProxyListener      // 760 lines of proxy logic inheriting ~4,500 lines it doesn't use

Reading ProxyListener meant understanding 6 parent classes first. And there were 12 cloud storage subclasses that were entirely empty and they inherited ~5K lines and added exactly zero:

export class ProviderAOut extends CloudStorageOutput {}  // empty
export class ProviderBOut extends CloudStorageOutput {}  // empty
export class ProviderCOut extends CloudStorageOutput {}  // empty

The fix: composition with enums instead of inheritance:

enum S3Provider {
    Aws { region: String },
    Storj { gateway: String },
    Backblaze { account_id: String },
    Wasabi { region: String },
    Minio { endpoint: String },
}

fn create_s3_client(provider: &S3Provider) -> S3Client {
    match provider {
        S3Provider::Aws { region } => S3Client::new().region(region),
        S3Provider::Storj { gateway } => S3Client::new().endpoint(gateway),
        S3Provider::Backblaze { account_id } =>
            S3Client::new().endpoint(&format!("s3.{account_id}.backblazeb2.com")),
        S3Provider::Wasabi { region } => S3Client::new().endpoint(&format!("s3.{region}.wasabisys.com")),
        S3Provider::Minio { endpoint } => S3Client::new().endpoint(endpoint),
    }
}

No inheritance and no empty subclasses. Adding a new provider means adding a variant to the enum and the compiler shows you every match that needs a new arm. See my earlier blog The Reusability Trap: When DRY Becomes a Liability for more details on this anti-pattern.


IV. ADTs Applied to Concurrency

Race Conditions in Shared Mutable State

Here’s actual production code where multiple async operations read and write the same map:

private conns: { [key: string]: Connection } = {};

// Called by the service loop (runs periodically)
private async _service() {
    const values = Object.values(this.conns);
    for (const conn of values) {
        if (conn.isStale()) {
            delete this.conns[conn.key];  // Mutate while potentially being read elsewhere
        }
    }
}

// Called when a new node connects (can happen any time)
private addConnection(connKey: string, data: INodeEntry): boolean {
    this.conns[connKey] = conn;  // Race with _service()!
    this.assignToGroup(conn)
        .catch(LOG_ERR(logger, 'failed to assign'));
    return true;
}

And the classic read-modify-write race:

prevState = await this.getState(key);       // Process A reads state
// ... Process B also reads state here ...
// ... Process A modifies and writes ...
await this.store.set(key, newState);         // Process B writes — A's changes LOST

The fix: a single owner of state, communicating through typed messages like actor model:

enum ConnectionCommand {
    Add { key: String, conn: Connection },
    Remove { key: String },
    RemoveStale,
    GetAll { reply: oneshot::Sender<Vec<Connection>> },
}

// Single owner — only this task can access `conns`
async fn connection_manager(mut inbox: mpsc::Receiver<ConnectionCommand>) {
    let mut conns: HashMap<String, Connection> = HashMap::new();

    while let Some(cmd) = inbox.recv().await {
        match cmd {
            ConnectionCommand::Add { key, conn } => { conns.insert(key, conn); }
            ConnectionCommand::Remove { key } => { conns.remove(&key); }
            ConnectionCommand::RemoveStale => { conns.retain(|_, conn| !conn.is_stale()); }
            ConnectionCommand::GetAll { reply } => {
                let _ = reply.send(conns.values().cloned().collect());
            }
        }
    }
}

No mutexes, locks or data races. Rust’s ownership system guarantees conns is owned by exactly one task. Other tasks communicate through the channel, they physically cannot access the HashMap directly because they don’t own it.

Backpressure: Making Buffer Overflow Impossible to Ignore

The legacy heartbeat system silently dropped metrics when its buffer filled:

add(metric: MetricPacket, doNotDrop: boolean): void {
    if (this.hbMetrics.length > this.maxHbMetrics) {
        this.packetCounter.onDroppedMetric();  // Increment a counter nobody watches
        return;  // Data gone forever. No error. No signal to sender.
    }
    this.hbMetrics.push(metric);
}

The sender had no idea data was being lost. It kept sending happily while the system silently degraded. With Rust’s bounded channels, backpressure is built in. When the buffer is full, you must decide what to do:

match tx.try_send(metric) {
    Ok(()) => { /* sent */ }
    Err(TrySendError::Full(metric)) => {
        // Channel is full — you MUST decide:
        // Option 1: wait (applies backpressure to sender)
        tx.send(metric).await?;
        // Option 2: spill to disk
        // disk_buffer.write(metric)?;
        // Option 3: drop with explicit acknowledgment
        // warn!("Metric dropped due to backpressure");
    }
    Err(TrySendError::Closed(_)) => {
        error!("Metrics channel closed unexpectedly");
        return Err(ChannelError::Closed);
    }
}

The type system forces the conversation: “What should happen when the buffer is full?” You can’t accidentally drop data and you must write explicit code to ignore it.

Event Sourcing: Eliminating Lost Updates

Instead of mutable state that can be overwritten by concurrent operations, event sourcing treats state as a derived value from an append-only log:

enum JobEvent {
    Created { job_id: String, config: JobConfig, at: Instant },
    Started { worker_id: String, at: Instant },
    Progressed { percentage: u8, at: Instant },
    Completed { result: JobResult, at: Instant },
    Failed { error: ErrorInfo, retryable: bool, at: Instant },
}

// State is derived — never directly mutated
fn derive_state(events: &[JobEvent]) -> JobState {
    events.iter().fold(initial_state(events), apply_event)
}

fn apply_event(state: JobState, event: &JobEvent) -> JobState {
    match (state, event) {
        (JobState::Pending { .. }, JobEvent::Started { worker_id, .. }) =>
            JobState::Running { worker_id: worker_id.clone(), progress: 0 },
        (JobState::Running { worker_id, .. }, JobEvent::Progressed { percentage, .. }) =>
            JobState::Running { worker_id, progress: *percentage },
        (state, _) => state,  // Invalid transition — state unchanged
    }
}

No lost updates because events are appended, never overwritten. Invalid transitions are no-ops and the reduce function simply ignores events that don’t make sense for the current state.

Message Ordering: Protocol State Machines

The legacy system sent commands from leader to worker with no ordering guarantees:

// Leader sends: 1. configure, 2. upgrade
// Worker may RECEIVE: 1. upgrade, 2. configure (reversed!)
// Result: config applied AFTER upgrade — potential data corruption

// Current "fix": reject conflicting operations
private failOnConflictingOperation() {
    if (this.currentAction) {
        throw new ConflictingActionError();  // Command REJECTED, not queued!
    }
}
// No command queue. No ordering. No acknowledgment.
// Leader has NO WAY to know if the worker processed the command.

A typed protocol state machine makes invalid command sequences unrepresentable:

enum NodePhase { Idle, Configured, Upgrading, Draining }

fn apply_command(state: ProtocolState, cmd: &Command) -> Result<ProtocolState, ProtocolError> {
    let seq = cmd.seq();
    if seq != state.last_applied_seq + 1 {
        return Err(ProtocolError::OutOfOrder { expected: state.last_applied_seq + 1, got: seq });
    }
    match (&state.phase, cmd) {
        (NodePhase::Idle | NodePhase::Configured, Command::Configure { .. }) =>
            Ok(ProtocolState { phase: NodePhase::Configured, ..state }),
        (NodePhase::Configured, Command::Upgrade { .. }) =>
            Ok(ProtocolState { phase: NodePhase::Upgrading, ..state }),
        (NodePhase::Idle | NodePhase::Configured, Command::Drain { .. }) =>
            Ok(ProtocolState { phase: NodePhase::Draining, ..state }),
        (phase, cmd) =>
            Err(ProtocolError::InvalidTransition { from: phase.clone(), command: cmd.name() }),
    }
}

The system cannot apply an upgrade before configuration because the match on (current_phase, command) rejects it. The exhaustive match means there’s no way to accidentally leave a case unhandled.

RAII: Locks That Can’t Leak

The legacy system used file-based locks with no timeouts or heartbeats:

// If the process crashes while holding this lock, it's stuck forever
static async acquireConfigUpdateLock(dir: string): Promise<void> {
    if (!(await acquireLock(dir, CONFIG_UPDATE_LOCK_NAME))) {
        throw new AppError('Failed to acquire config update lock.');
    }
    // No timeout. No heartbeat. Crash = lock held forever.
}

In Rust, RAII (Resource Acquisition Is Initialization) makes forgotten locks a compile-time impossibility:

struct ConfigLock {
    path: PathBuf,
    acquired_at: Instant,
    ttl: Duration,
}

impl Drop for ConfigLock {
    fn drop(&mut self) {
        // Automatically called when ConfigLock goes out of scope — even on panic!
        let _ = std::fs::remove_file(&self.path);
    }
}

async fn with_config_lock<T, F>(resource: &str, ttl: Duration, f: F) -> Result<T, LockError>
where F: FnOnce(&ConfigLock) -> Result<T, LockError>
{
    let lock = acquire_lock(resource, ttl).await?;
    f(&lock)
    // lock dropped here automatically — file released no matter what
}

let result = with_config_lock("config-update", Duration::from_secs(30), |_lock| {
    extract_bundle(&dir)?;
    save_system(&dir)?;
    Ok("deployed")
}).await?;
// Lock released here — even if any step panicked

The lock cannot leak because Drop::drop() runs when the guard goes out of scope and it’s a compiler guarantee.

Serialization: Schema Evolution as an ADT

The legacy heartbeat system used JSON serialization for 100,000+ metrics per heartbeat:

// JSON.parse for 100K metrics: ~500ms–1s
// With a 10s heartbeat interval, serialization alone eats 5–10% of your cycle time
// And there's no versioning — if the schema changes, old and new nodes break silently

With Rust enums, the protocol schema is defined once and versioning is a first-class concern:

enum HeartbeatMessage {
    V1 { metrics: Vec<MetricV1> },
    V2 { metrics: Vec<MetricV2>, deltas: Vec<DeltaMetric> },  // added delta support
}

// Schema evolution is an enum — every version must be explicitly handled
fn parse_heartbeat(data: &[u8]) -> Result<HeartbeatMessage, ParseError> {
    let version = data[0];
    match version {
        1 => parse_v1(&data[1..]),
        2 => parse_v2(&data[1..]),
        _ => Err(ParseError::UnknownVersion(version)),
        // Add v3? The compiler shows you every match that needs updating.
    }
}

With protobuf or flatbuffers: zero-copy deserialization runs 10–100x faster than JSON. And schema evolution is no longer an afterthought and the enum ensures every protocol version is explicitly handled.


V. What Are Algebraic Effects?

ADTs solve the problem of representing valid states. Algebraic Effects solve a different but related problem: how to separate what code needs from how those needs are fulfilled without forcing that separation to infect every caller in the chain.

The Intuition: Exceptions That Can Resume

You already understand exceptions, e.g., when you throw, execution stops and the stack unwinds:

function getName() {
    throw new Error("need a name");  // Execution stops. Stack unwinds. Gone.
}

try {
    getName();
} catch (e) {
    // We're here, but getName() is DEAD. We can't go back.
}

Now imagine if, instead of killing getName(), the handler could answer the question and let it continue:

function getName() {
    const name = perform AskUser("What's your name?");  // Pause, don't die
    return `Hello, ${name}`;  // Continues after handler responds!
}

handle(getName(), {
    AskUser: (question, resume) => {
        const answer = prompt(question);
        resume(answer);  // Jump BACK into getName() with the answer
    }
});

That’s algebraic effects in one sentence: exceptions that can resume. The code that performs an effect doesn’t die instead it pauses, gets an answer, and continues where it left off. You can think of it this way: regular exceptions are like quitting your job when you have a question. Effects are like asking your manager, you pause, they answer, you continue.

The Function Coloring Problem

Here’s why effects matter for real systems. Once a function is async, everything that calls it must also be async:

async function getConfig(): Promise<Config> { ... }
async function processEvent(e: Event): Promise<void> {  // must be async because getConfig is
    const config = await getConfig();
    // ...
}
async function handleRequest(req: Request): Promise<Response> {  // must be async because processEvent is
    await processEvent(req.body);
    // ...
}

One async function forces asyncness through the entire call stack. This is generally called “function coloring“, async and sync functions are different “colors” and they can’t mix freely. The same problem applies to error handling (once you use Result, every caller must handle it), to dependencies (once you need config, every caller must thread it through), and to logging (once you need a logger, every intermediate function must pass it along). Effects solve this by separating what a function needs from who provides it. Intermediate functions stay uncolored:

// With effects (conceptual syntax):
function getConfig(): Effect<ConfigService, Config> {
    return perform GetConfig;
}

function processEvent(e: Event): Effect<ConfigService, void> {
    const config = getConfig();  // NOT async! Just performs an effect.
    transform(e, config);
}

// Only the TOP-LEVEL handler knows how config is provided:
handle(processEvent(event), {
    GetConfig: (resume) => {
        const config = loadFromDisk();  // or from env, or hardcoded for tests
        resume(config);
    }
});

processEvent doesn’t know or care whether config comes from disk, network, or a test fixture. The handler at the boundary decides. Intermediate functions don’t need to thread the dependency through.

You Already Use Effects

If you use React, you’re already working with algebraic effects in disguise. React Hooks are effects:

function Counter() {
    const [count, setCount] = useState(0);  // "perform GetState" — component doesn't manage storage
    useEffect(() => { ... });               // "perform ScheduleSideEffect"
    const data = use(fetchData());          // "perform Suspend"
    return <div>{count}</div>;
}

useState doesn’t tell the component where state lives. It performs an effect (“I need state”), and the React runtime acts as a handler and then provides it. The component doesn’t know if state is in memory, in a reducer, or synced to a server. React Suspense is literally “throw, then resume”:

// Simplified React Suspense:
function fetchData() {
    if (!cache.has(key)) {
        throw promise;  // "perform Suspend" — throws a Promise UP the tree
        // React catches it, shows fallback, waits for promise to resolve,
        // then RE-RENDERS the component — effectively "resuming" it with data
    }
    return cache.get(key);
}

This is exactly the algebraic effects pattern: code performs an effect (throws a Promise), a handler catches it (the Suspense boundary), and the code is resumed (re-rendered) with the result. React couldn’t add real algebraic effects to JavaScript, so they simulated them with throw/re-render.

Everything Is the Same Control Flow Mechanism

Look at these seemingly different language features:

Feature“Perform”“Handle”“Resume”
Exceptionsthrow errortry/catch? (can’t resume)
Async/Awaitawait promiseRuntime schedulerResolves with value
Generatorsyield valuefor..of consumer.next(value)
React HooksuseState()React runtimeRe-render with state
DI Container@InjectContainer configConstructor call
Algebraic Effectsperform effecthandle blockresume(value)

They’re all the same pattern: (1) code declares “I need something,” (2) something up the call stack provides it, (3) execution continues with the provided value. Algebraic effects are just the general version that unifies all the others. The historical arc of control flow in programming languages tells the same story:

goto --> structured control (if/while) --> exceptions --> continuations --> algebraic effects

Each step gives more structured, more composable control over program flow.

The Monad Infection Problem

If you’ve used functional languages, you know what happens once you use Result, Option, Future, or IO as every function in the chain must return that type:

fn get_config() -> Result<Config, Error> { ... }
fn parse_event(config: &Config) -> Result<Event, Error> { ... }
fn validate(event: &Event) -> Result<ValidEvent, Error> { ... }

fn process() -> Result<Output, Error> {
    let config = get_config()?;
    let event = parse_event(&config)?;
    let valid = validate(&event)?;
    Ok(transform(valid))
}

Once one function returns Result<T, E>, everything up the chain must acknowledge it. This is the same coloring problem as async just with error types. Effects solve this: the function just performs the effect, and a single handler at the top decides what to do. Intermediate functions stay clean.

For example, Jane Street’s hardware simulation team switched from monads to OCaml 5’s algebraic effects for exactly this reason. Their testbench code had to synchronize threads stepping through clock cycles. With monads, every function needed special let%bind syntax and couldn’t use normal OCaml features. With effects:

(* Business logic is PLAIN OCaml — no special syntax *)
let run_testbench () =
    let clk = read_signal clock in
    step ();                           (* "perform Step" — suspend until next clock cycle *)
    let data = read_signal data_bus in
    assert (data = expected);
    step ();                           (* Step again — handler resumes us at next cycle *)
    write_signal reset 1

(* Handler provides the simulation scheduler *)
let simulate circuit testbench =
    match_with testbench () {
        effc = (fun (type a) (eff : a Effect.t) ->
            match eff with
            | Step -> Some (fun (k : (a, _) continuation) ->
                advance_circuit circuit;   (* Tick the simulated hardware *)
                continue k ()             (* Resume testbench at next line *)
              )
        )
    }

The testbench reads like sequential code without monadic boilerplate. The step() call suspends execution, the handler advances the simulated hardware clock, and execution resumes.

Effects in Languages You Use Today

You don’t need OCaml 5 or Koka. Effects can be approximated in any language. In TypeScript using generator functions:

function* processEvent(event: RawEvent) {
    const config = yield { effect: 'getConfig' };           // "perform GetConfig"
    const enabled = yield { effect: 'checkFlag', flag: 'v2' }; // "perform CheckFlag"
    yield { effect: 'log', msg: 'processing' };             // "perform Log"
    return transform(event, config);
}

// Handler interprets the effects
function runWithHandler(gen, handlers) {
    let result = gen.next();
    while (!result.done) {
        const effect = result.value;
        const value = handlers[effect.effect](effect);  // "resume with value"
        result = gen.next(value);
    }
    return result.value;
}

// Production vs test — trivially swapped
const prodResult = runWithHandler(processEvent(event), productionHandlers);
const testResult = runWithHandler(processEvent(event), testHandlers);

In Python using context variables:

from contextvars import ContextVar

config_effect: ContextVar[Config] = ContextVar('config')
metrics_effect: ContextVar[MetricsCollector] = ContextVar('metrics')

def process_event(event):
    config = config_effect.get()      # "perform GetConfig"
    metrics = metrics_effect.get()    # "perform GetMetrics"
    return transform(event, config)

# Handler provides implementations at the boundary
config_effect.set(production_config)
metrics_effect.set(prometheus_collector)
result = process_event(event)

VI. Algebraic Effects Applied to Real Problems

Problem 1: Dependency Injection Without a Framework

The legacy codebase had 816 files coupled to global singletons:

// Configuration.instance() called in 858 files
// ProcessInfo singleton accessed in 320 files
// GlobalMetrics singleton in 200+ files
// FeatureFlags singleton in 186 files

class WorkerConnection {
    async configure() {
        const config = Configuration.instance();     // Hidden dependency
        const metrics = GlobalMetrics.instance();    // Hidden dependency
        const flags = FeatureFlags.instance();       // Hidden dependency
        const env = process.env.DEPLOYMENT_MODE;     // Hidden dependency (488 files!)
    }
}

You can’t test this without the real singleton. You can’t run different configurations in the same process. And the dependencies are invisible because you discover them at runtime via crashes. Effects-style DI (approximated with the Reader pattern in TypeScript):

type AppDeps = {
    config: IConfigProvider;
    metrics: IMetricsCollector;
    flags: IFeatureFlags;
    clock: IClock;
};

// Business logic is a pure function of its dependencies
function configurePipeline(deps: AppDeps) {
    return (pipeline: PipelineConfig): Result<ConfiguredPipeline, ConfigError> => {
        const features = deps.flags.getEnabled(pipeline.namespace);
        const stages = pipeline.stages
            .filter(s => features.includes(s.requiredFeature))
            .map(s => buildStage(s, deps.config));
        return { ok: true, value: { stages, configuredAt: deps.clock.now() } };
    };
}

// Production wiring — one place, at startup
const production = configurePipeline({
    config: new FileConfigProvider('/etc/app/config.yaml'),
    metrics: new PrometheusCollector(),
    flags: new LaunchDarklyFlags(apiKey),
    clock: SystemClock,
});

// Tests — zero mocking frameworks needed
const test = configurePipeline({
    config: { get: (key) => testDefaults[key] },
    metrics: new NoOpCollector(),
    flags: { getEnabled: () => ['all-features'] },
    clock: { now: () => new Date('2024-01-01') },
});

In languages with native effect support (OCaml 5, Koka, Eff), this becomes even cleaner as intermediate functions don’t need to accept or pass deps at all. They just perform GetConfig and the handler provides the value.

Problem 2: Multiple Metrics Implementations

The legacy system had multiple parallel metrics implementations built by different teams, each with stringly-typed dimensions:

// different ways to record metrics, scattered across 17+ files
IMetricsStore
GlobalMetrics
IoMetricsMgr
DataInsightsMetricsMgr
LocalSearchMetricsReporter

// Plus per-class ad-hoc metrics: PeriodicStats, ConnectionMetrics, PacketReducer...

// Stringly-typed dimensions — typos produce SILENT missing metrics:
metrics.record(['id', prefixId, 'route', routeId]);  // Swap any string? Silent wrong data.

With a single metrics effect:

type MetricEffect =
    | { kind: 'counter', name: MetricName, value: number, tags: MetricTags }
    | { kind: 'gauge', name: MetricName, value: number, tags: MetricTags }
    | { kind: 'histogram', name: MetricName, value: number, tags: MetricTags };

// Branded types prevent typos
type MetricName = string & { __brand: 'MetricName' };
type MetricTags = Record<TagKey, TagValue>;  // Also branded

// Business logic performs the effect — doesn't know WHERE metrics go
function processRoute(event: Event, route: Route): ProcessedEvent {
    perform { kind: 'counter', name: MetricName('events.processed'), value: 1, tags: { route: route.id } };
    const result = transform(event, route);
    perform { kind: 'histogram', name: MetricName('events.latency_ms'), value: elapsed(), tags: { route: route.id } };
    return result;
}

// Handler decides: Prometheus? StatsD? Both? Test collector? All swappable.

Five implementations and seventeen files collapse into one typed effect that the compiler validates.

Problem 3: Auth Tokens Anyone Can Forge

The legacy system used a single shared HS256 symmetric token for ALL workers:

// All workers share the same symmetric auth secret
// HS256 symmetric means: every worker can FORGE admin tokens!
// No per-node identity. No revocation without rotating for ALL.

const isValid = authToken === this.masterAuthToken;  // Raw secret comparison

With branded types, per-worker tokens become type-enforced:

type WorkerToken = string & { __brand: 'WorkerToken', workerId: WorkerId, scope: TokenScope };
type LeaderToken = string & { __brand: 'LeaderToken' };

type TokenScope =
    | { kind: 'control_plane', permissions: ControlPermission[] }
    | { kind: 'data_plane', routes: RouteId[] }
    | { kind: 'metrics_only' };

// Functions declare what token scope they require
function deployConfig(token: WorkerToken & { scope: { kind: 'control_plane' } }): Result<...> {
    // Can ONLY be called with a control-plane scoped token
    // Data-plane tokens won't typecheck here
}

Now a compromised worker can’t forge admin tokens. The type system enforces token scope at compile time.

Problem 4: Control Flow Disguised as Errors

The legacy codebase used exceptions for control flow:

try {
    for (const event of events) {
        processEvent(event);
    }
} catch (e) {
    if (e instanceof SkipEventError) continue;    // Control flow disguised as error!
    if (e instanceof AppError) logger.warn(e);
    if (e instanceof PipelineError) { ... }
    // Unknown errors fall through and are silently swallowed
}

There were multiple error hierarchies (AppError, RESTError, RpcError, PipelineError) with no unified classification. With effects, control flow signals and failures are distinct and handled separately:

type ControlEffect =
    | { kind: 'skip', reason: string }
    | { kind: 'retry', after: Duration }
    | { kind: 'terminate', gracefully: boolean };

type FailureEffect =
    | { kind: 'transient', error: Error, retryable: true }
    | { kind: 'permanent', error: Error, retryable: false }
    | { kind: 'validation', field: string, message: string };

// Business logic declares intent — doesn't decide policy
function processEvent(event: RawEvent): Effect<ControlEffect | FailureEffect, ProcessedEvent> {
    if (!isRelevant(event)) {
        return perform { kind: 'skip', reason: 'irrelevant event type' };
    }
    const validated = validate(event);
    if (!validated.ok) {
        return perform { kind: 'validation', field: validated.field, message: validated.message };
    }
    return transform(validated.value);
}

// Handler decides policy — completely separate from business logic
const withPolicy = handle(processEvent(event), {
    skip: (effect, resume) => { metrics.increment('skipped'); resume(null); },
    transient: (effect, resume) => { queue.requeue(event); resume(null); },
    permanent: (effect, resume) => { deadLetter.send(event, effect.error); resume(null); },
    validation: (effect, resume) => { logger.warn('Validation failed', effect); resume(null); },
});

The business logic says “this event should be skipped” or “this operation failed transiently.” It doesn’t decide whether to retry, log, or dead-letter. That’s the handler’s job and handlers can be swapped independently.

Problem 5: No Circuit Breakers

The legacy system had no circuit breakers. When a downstream service failed, requests piled up until the process crashed:

dest.connect().catch(NOOP);  // If it fails, try again next time. Or don't. Who knows.

// Retry with infinite loop and no idempotency:
while (true) {
    try {
        await writeToFile(...);
        callback();
        break;
    } catch {
        await delay(1000);  // Retry forever. No backoff. No limit. No idempotency check.
    }
}

With effects, retry and circuit-breaking become composable middleware:

type RetryPolicy =
    | { kind: 'none' }
    | { kind: 'fixed', attempts: number, delay: Duration }
    | { kind: 'exponential', maxAttempts: number, baseDelay: Duration, maxDelay: Duration }
    | { kind: 'circuitBreaker', failureThreshold: number, resetAfter: Duration };

// Circuit breaker itself is a state machine — an ADT!
type CircuitState =
    | { kind: 'closed', failureCount: number }
    | { kind: 'open', openedAt: Date, failureCount: number }
    | { kind: 'halfOpen', testRequest: Promise<unknown> };

function circuitTransition(state: CircuitState, event: CircuitEvent): CircuitState {
    switch (state.kind) {
        case 'closed':
            if (event.kind === 'failure') {
                const newCount = state.failureCount + 1;
                if (newCount >= threshold) return { kind: 'open', openedAt: new Date(), failureCount: newCount };
                return { ...state, failureCount: newCount };
            }
            return { kind: 'closed', failureCount: 0 };
        case 'open':
            if (elapsed(state.openedAt) > resetTimeout) return { kind: 'halfOpen', testRequest: null };
            return state;
        case 'halfOpen':
            if (event.kind === 'success') return { kind: 'closed', failureCount: 0 };
            return { kind: 'open', openedAt: new Date(), failureCount: state.failureCount };
    }
}

Notice: the circuit breaker itself is modeled as an ADT with exhaustive state transitions. ADTs model the state. Effects separate the retry policy from the code that needs retrying. Together they create systems that are both correct and composable.


VII. Design Thinking: Transformations Over Entities

Here’s an insight that ties everything together: design the transformations first, then the things being transformed. A system’s architecture is defined by how data flows, not by what objects exist.

The God Class Problem: Architecture You Can’t See

// A pipeline manager — 1,300+ lines, 80+ methods
class PipelineManager {
    process(event: any) {
        if (this.shouldFilter(event)) return;     // filtering concern
        this.metrics.increment('processed');       // observability concern
        const result = this.transform(event);     // transformation concern
        this.route(result);                       // routing concern
        this.metrics.recordLatency(start);        // observability again
    }
}

The architecture is invisible. Everything is tangled. You can’t test transformation without routing. You can’t add observability without modifying the pipeline. When you model the same thing as typed functions, the architecture becomes visible:

// Each stage is a typed function with a clear input/output contract
fn parse(raw: RawEvent) -> Result<ParsedEvent, ParseError> { ... }
fn validate(parsed: ParsedEvent) -> Result<ValidEvent, ValidationError> { ... }
fn enrich(valid: ValidEvent) -> Result<EnrichedEvent, EnrichError> { ... }
fn route(enriched: &EnrichedEvent) -> RoutingDecision { ... }

// Composition IS the architecture — visible, testable, reorderable
fn process_event(raw: RawEvent) -> Result<EnrichedEvent, PipelineError> {
    let parsed = parse(raw)?;
    let valid = validate(parsed)?;
    let enriched = enrich(valid)?;
    Ok(enriched)
}

// Cross-cutting concerns are separate composable wrappers
let pipeline = WithMetrics::new("pipeline", process_event);
let pipeline = WithFilter::new(filter_config, pipeline);
let pipeline = WithRouting::new(route_table, pipeline);

Each stage is independently testable. Adding observability doesn’t touch business logic. Reordering is just reordering function composition. The types document the flow: RawEvent --> ParsedEvent --> ValidEvent --> EnrichedEvent. This is what “the arrows are the architecture” means the transformations between types are the system’s behavior.

Rust’s ? Is Railway-Oriented Programming Built In

Think of data processing as a railway with two tracks: success and failure. Data flows along the success track until something goes wrong then it switches to the failure track and skips all remaining stages:

// Each ? is a branch point onto the failure track
fn process_event(raw: RawEvent) -> Result<ClassifiedEvent, PipelineError> {
    let parsed = parse(raw)?;         // fails? switch to error track
    let valid = validate(parsed)?;    // fails? switch to error track
    let enriched = enrich(valid)?;    // fails? switch to error track
    let classified = classify(enriched)?;
    Ok(classified)
}

// Each piece tested in isolation:
#[test]
fn parse_handles_malformed_json() {
    let result = parse(RawEvent::new("not json"));
    assert!(matches!(result, Err(PipelineError::MalformedInput { .. })));
}

Rust’s ? operator is this pattern built into the language syntax. No special library, no monadic boilerplate and the language itself is railway-oriented.

Thinking in Transformations

Not all transformations are the same. Knowing which kind you’re building helps you choose the right pattern:

  • One-to-one (parsing, validation): every input produces exactly one output. These compose directly: parse >> validate >> enrich.
  • One-to-many (fan-out, splitting): one input produces multiple outputs. Use flatMap or stream splitting, one log line becomes multiple metrics events.
  • Many-to-one (aggregation): multiple inputs combine into one. Use windowed reduce, 1000 metric samples become a single P99 value.
  • Reversible (encoding, encryption): can be undone without loss. Good for serialization boundaries where you need to cross system edges.
  • Self-directed (state transitions): transforms a value into another of the same type. State machines are exactly this, e.g., State --> State. An ADT enum is the natural representation.

The legacy PipelineManager muddled all five together in one class. Separating them makes each stage’s contract explicit and independently testable.

Measuring Coupling Through Connections

Here’s a concrete way to see how much a legacy architecture costs. Count the connections:

Point-to-point (legacy): N services = N × (N-1) / 2 connections
  10 services  =    45 connections
  20 services  =   190 connections
  50 services  = 1,225 connections  ? quadratic growth

Data-oriented: N services = N connections (each talks to a shared typed data layer)
  10 services  =  10 connections
  20 services  =  20 connections
  50 services  =  50 connections   ? linear growth

The legacy system’s 125+ endpoints each know about each other implicitly through shared singletons, events, and direct calls. Adding endpoint #126 means understanding what it might break in endpoints #1–125.

With a data-oriented approach, each component only needs to understand the shared data schema instead of every other component. The tradeoff: schema design becomes your hardest decision. Data outlives code. You can rewrite a service in a weekend, but migrating a billion records takes months. Get the ADTs right before committing.

Stratified Design: Layers by Rate of Change

Within the functional core, code should be layered by how often it changes:

Layer 4 (changes weekly):    Business rules, feature flags, pricing logic
Layer 3 (changes monthly):   Domain logic, validation, workflow orchestration
Layer 2 (changes quarterly): Framework utilities, pipeline combinators, retry policies
Layer 1 (changes yearly):    Language extensions, data structures, core types

Each layer only calls downward. A change in Layer 4 (a new pricing rule) cannot break Layer 1 (your Result type). This eliminates cascading failures.

// Layer 1: Stable foundation (built into the language)
// Result<T, E>, Option<T>, Traits: From, Into, TryFrom

// Layer 2: Domain-specific combinators
async fn with_retry<T>(policy: &RetryPolicy, f: impl Fn() -> Fut<T>) -> Result<T, Error>;
async fn with_circuit_breaker<T>(state: &CircuitState, f: impl Fn() -> Fut<T>) -> Result<T, Error>;

// Layer 3: Business domain
fn validate_pipeline(config: &PipelineConfig) -> Result<ValidPipeline, Vec<ValidationError>>;
fn route_event(event: &ValidEvent, table: &RouteTable) -> RoutingDecision;

// Layer 4: Configuration and policies (changes frequently)
let route_table: RouteTable = load_config("routes.yaml")?;
let retry_policy = RetryPolicy::Exponential { max_attempts: 3, base_delay_ms: 100 };

Replace Imperative Loops with Pipelines

The legacy codebase had hundreds of imperative accumulation loops:

// Legacy: imperative accumulation (hundreds of instances)
const results = [];
for (const worker of workers) {
    if (worker.isActive()) {
        const metrics = await worker.getMetrics();
        if (metrics.cpuUsage > threshold) {
            results.push({ workerId: worker.id, cpu: metrics.cpuUsage });
        }
    }
}

Iterator combinators express the same thing as a pipeline with each step is independently readable and testable:

// Declare WHAT, not HOW
let results: Vec<_> = workers.iter()
    .filter(|w| w.is_active())
    .filter_map(|w| {
        let metrics = w.get_metrics();
        (metrics.cpu_usage > threshold).then(|| OverloadedWorker {
            worker_id: w.id.clone(),
            cpu: metrics.cpu_usage,
        })
    })
    .collect();

You can add or remove a stage without restructuring any loop. Each step in the chain has a clear type. And for a 1,200-line initialization sequence, the same idea applies:

// Instead of 1,200 lines of sequential initialization with implicit ordering:
let server = ServerBuilder::new(env)
    .with_logging()?
    .with_metrics()?
    .with_storage()?
    .load_pipelines()?
    .with_health_check()?
    .bind_endpoints()?
    .build();
// Each method returns the next builder phase.
// Ordering is explicit in the chain — not hidden at line 847.
// ? propagates errors cleanly — no nested try/catch.

Reactive Patterns: Derived State That Can’t Go Stale

The legacy codebase had derived values that went stale because updates were manually tracked:

class Dashboard {
    private totalEvents = 0;      // must remember to update
    private avgLatency = 0;       // must remember to update
    private activeWorkers = 0;    // must remember to update

    onMetric(metric) {
        this.totalEvents++;
        // avgLatency updated... somewhere else. Maybe. If someone remembers.
    }
}

The reactive pattern (the same idea behind React, Redux, and spreadsheets) makes derived values automatic:

// Source cells (the inputs you can change)
const events = createCell<EventLog>([]);
const workers = createCell<Worker[]>([]);

// Derived formulas (automatically recompute when inputs change)
const totalEvents = formula(() => events.get().length);
const activeWorkers = formula(() => workers.get().filter(w => w.isActive()).length);
const avgLatency = formula(() => {
    const recent = events.get().slice(-1000);
    return recent.reduce((sum, e) => sum + e.latency, 0) / recent.length;
});

// Can NEVER be stale — recomputes automatically when inputs change
// "Forgot to update" bugs are impossible

This is ValueCell (a mutable input) and FormulaCell (a derived computation) are the two primitives behind every reactive system from spreadsheets to React.


VIII. The Bigger Framework: Actions, Calculations, Data

Everything covered so far fits into a simple three-way classification from Eric Normand’s book Grokking Simplicity:

Data: Inert facts. Immutable. Serializable. Safe to copy, share, store, send.

type WorkerState = { kind: 'idle' } | { kind: 'configuring', request: ClusterRequest };
type JobEvent = { kind: 'started', workerId: string, at: Date };

Calculations: Pure functions. Same input always produces the same output. No side effects. Safe to call anywhere, anytime, as many times as you want.

function deriveState(events: JobEvent[]): JobState { ... }
function validate(event: RawEvent): Result<ValidEvent, ValidationError> { ... }

Actions: Depend on when or how often they run. I/O. Time. Network. The dangerous stuff.

async function saveToDatabase(state: JobState): Promise<void> { ... }
async function sendMetrics(metrics: Metric[]): Promise<void> { ... }

The legacy system had roughly 80% Actions, 15% Mixed (calculations that accidentally touched singletons or Date.now()), and 5% pure Calculations. The target is the Functional Core, Imperative Shell pattern:

The core is pure: no I/O, no time, no randomness. It takes Data in and produces Data out. It’s trivially testable, trivially parallelizable (no shared state), and trivially composable. The shell is thin, it translates between the real world and the pure core. Every antipattern in the legacy codebase came from violating this boundary: singletons injecting Actions into Calculations, mutable state making “pure” functions depend on timing, mixed I/O making business logic untestable without the full system running.

Consistent API Responses as Typed Envelopes

The legacy system had 125+ endpoints with inconsistent response formats:

GET /system/inputs  ? { items: IInput[] }
GET /system/outputs ? IOutput[]                    // No wrapper!
GET /jobs           ? PaginatedListResults<IJob>   // Different wrapper!

// Error formats inconsistent too:
throw new RESTError(JSON.stringify(data), code);   // JSON string as message!
throw new RESTError('Not found', 404);
throw new RESTError('Not found', 400);             // Wrong status code!

A typed response envelope makes inconsistency a compile error:

type ApiResponse<T> =
    | { ok: true, data: T, meta?: PaginationMeta }
    | { ok: false, error: ApiError };

type ApiError = {
    code: ErrorCode;       // Typed enum, not arbitrary string
    message: string;
    details?: FieldError[];
    traceId: TraceId;      // Branded — always present for debugging
};

// Both return the same shape. Always. Compiler enforces it.
function listInputs(req: Request): ApiResponse<Input[]> { ... }
function listOutputs(req: Request): ApiResponse<Output[]> { ... }

IX. Let Compiler Work for You

The compiler catches bugs in seconds. Tests catch them in minutes. Staging catches them in hours. Production catches them over days of incident response, root cause analysis, and post-mortems. The math is simple. Investing time in better types eliminates entire categories of bugs that would each cost 10-100x more downstream.


X. When NOT to Use This

These patterns aren’t universally optimal.

  • Don’t use ADTs when you’re still exploring. When you don’t know yet what the valid states ARE, encoding them as sum types locks you in prematurely. Start with loose types, discover the states through testing, then lock them down.
  • Don’t use ADTs for simple CRUD with few states. A blog post with {title, body, published} doesn’t need Draft | Published | Archived. If the state space is small and obvious, a boolean is fine.
  • Don’t use full effects systems in hot paths. Effect handlers add indirection. In inner loops processing millions of events per second, direct function calls beat effect dispatch. Use effects at the boundary, direct calls in the hot path.
  • Don’t adopt effects before your team understands them. If your team has never seen algebraic effects, introducing them when new Service(deps) works fine creates confusion without proportional benefit. The approximations (Reader pattern, context variables) are a gentler on-ramp.

The adoption gradient, from easiest to hardest:

Easy (adopt today):
  Boolean pairs ? sum types            (just types, zero learning curve)
  .catch(NOOP) ? explicit handling     (mindset shift only)

Medium (team discussion needed):
  Singletons ? parameter injection     (changes constructor signatures)
  Imperative loops ? map/filter/reduce (functional style shift)

Hard (architectural decision):
  Shared state ? actors/channels       (concurrency model change)
  Mixed I/O ? functional core/shell    (structural refactor)
  Full effect systems                  (new paradigm)

Start at the top. Each level delivers value independently. You don’t need to reach the bottom to benefit.


XI. The Migration Path (Incremental, Not Big Bang)

You don’t need to rewrite your system. Here’s the step-by-step path.

  • Step 1: Boolean pairs –> sum types (minutes per instance)
// Before
let isConnected: boolean;
let isAuthenticated: boolean;

// After
enum ConnectionState {
    Disconnected,
    Connected { socket: TcpStream },
    Authenticated { socket: TcpStream, token: AuthToken },
}
  • Step 2: Find every .catch(NOOP) and make a decision: Each one is a decision point: should it retry, log, propagate, or recover? At minimum, log it. Better: make it a Result so callers know.
  • Step 3: Singletons ? constructor parameters (one file at a time): Pick one singleton-using class. Pass the dependency as a constructor parameter instead of hunting for it globally. Test it with a stub.
  • Step 4: Centralize mode checks before eliminating them: Before you can replace 506 scattered mode checks, you need mode determination in ONE place:
// Step 1: Create the union type
type AppMode = { kind: 'leader', ... } | { kind: 'worker', ... } | ...;

// Step 2: Determine mode ONCE at startup
const mode: AppMode = determineMode(process.env);

// Step 3: Pass mode to subsystems — then replace checks one at a time
  • Step 5: Shared mutable state ? channels (one boundary at a time): Identify shared mutable state accessed by multiple async operations. Introduce a channel wrapper and don’t rewrite everything at once.
  • Step 6: New features go in first (pure core, then I/O): For every new feature, write the business logic as pure functions. Push all I/O to the boundaries.

What’s Available in Your Language Today

LanguageSum TypesExhaustivenessResult TypePattern Matching
Rustenum (first-class)Built-in, enforcedResult<T, E> + ?match (exhaustive)
TypeScriptDiscriminated unionsnever checkCustom or fp-tsswitch + narrowing
Swiftenum with associated valuesBuilt-inResult<T, E>switch
KotlinSealed classeswhen exhaustiveResult / Eitherwhen
Java 17+Sealed interfaces + recordsSwitch expressionsCustom or vavrPattern matching (21+)
Python 3.10+@dataclass unionsmatch (partial)Custom or returnsmatch statement
GoInterface + type switchNo built-in(T, error) tupleType assertions

Rust stands out because it was designed around these patterns: first-class ADTs, mandatory exhaustive matching, built-in Result/Option with the ? operator, ownership-based concurrency safety, and zero-cost newtypes. But you can apply these ideas in any language as the patterns are about thinking, not syntax.


XII. The Three Laws

All of this comes down to three principles:

  • If it can’t be represented, it can’t happen. Illegal states that don’t exist in the type system are bugs that don’t exist in production.
  • If it must be handled, it will be handled. When the compiler forces you to address every variant, every error, every edge case then nothing slips through.
  • If it’s composed from tested parts, the composition is tested. Pure functions that individually work correctly compose into pipelines that work correctly. No emergent failure modes from unexpected interactions.

Conclusion: Architecture as Enforcement

The legacy system I analyzed had documentation describing its intended architecture. It had design reviews. It had coding guidelines. None of it prevented 441 silent error swallows, 64-state boolean explosions, race conditions in shared mutable state, 5 redundant metrics implementations, or a shared auth token that let any worker forge admin credentials. Documentation describes intent. Tests verify behavior at a point in time. But types enforce invariants continuously on every line of code, in every file, for every developer, for the entire lifetime of the codebase.

ADTs make impossible states unrepresentable. Algebraic effects separate mechanism from policy. Together, they transform architecture from aspiration into enforcement. The compiler doesn’t take vacations. It doesn’t forget edge cases. In a world of distributed systems, concurrent operations, and ever-growing complexity, that’s not just good engineering practice, it’s the only approach that scales.


Related Blogs

  1. From Big Ball of Mud to Functional Pipeline
  2. The Reusability Trap: When DRY Becomes a Liability

June 16, 2026

Growing as a Software Engineer in the Age of Agentic Coding

Filed under: Computing — admin @ 10:14 am

A self-guided path for junior and mid-level engineers whose core skills are quietly eroding


I have observed a contrast productivity gap watching a senior engineer use an agentic coding tool and watching a junior engineer use the same tool. The senior engineer moves faster, catches more problems, and produces better outcomes, while the junior engineer often ships code that looks finished but quietly breaks at the seams. The reason is not the tool, it is what each engineer brings to the tool.

Senior engineers are more effective with agentic AI because they have already built the skills that make AI useful: writing precise specifications, designing systems that hold together under real load, spotting code smells in a diff, understanding trade-offs between correctness and performance, maintaining the conceptual integrity of a codebase across hundreds of changes. These skills aren’t separate from coding experience, they are products of it, built up over years of writing code, breaking things, debugging production incidents, and internalizing the consequences of design decisions.

Junior and mid-level engineers haven’t built those skills yet, which used to be fine because the path to building them was clear: you wrote code, made mistakes, got reviewed by someone who caught what you missed, and learned. Repeat for several years. The trouble is that agentic coding short-circuits exactly that path. When an agent generates the code, a junior engineer faces a key problem: they cannot reliably distinguish between code that is actually correct and code that is plausibly correct. Agentic code looks right in the narrow context where it was generated, the function compiles, the tests pass, the logic seems sound. But a system is not a collection of locally correct functions. It is a web of interacting decisions, constraints, and invariants that only hold together if someone understands the whole. Senior engineers have built that whole-system mental model through years of implementation.

Some companies have responded to this by stopping junior engineer hiring entirely, reasoning that agents can now fill entry-level roles. This is a serious mistake, and a slow-moving disaster. It optimizes for short-term output while eliminating the pipeline through which every senior and principal engineer is eventually produced. Today’s junior engineers are tomorrow’s architects. When companies stop hiring and developing them, they are consuming the seed corn and they will feel it in three to five years when there are no experienced engineers left to review what the agents produce.

The risk doesn’t stop at junior engineers. Senior engineers face a subtler version of the same problem. When you stop writing code regularly, the skills built through writing it, the intuition for design, the eye for code smells, the ability to hold a large system in your head begin to decay. Specification writing becomes abstract rather than grounded. Architecture decisions lose their connection to implementation reality. Code review gets shallower because you’re no longer maintaining the mental model of how things fit together. The most important thing being lost is not any individual skill but shared understanding, what Fred Brooks called conceptual integrity, the coherence of design philosophy across an entire system that only exists when the people building it have deeply internalized how it works.

This post is about what to do about all of it. How to deliberately build the skills that agentic coding doesn’t hand you. How to maintain the skills you’ve built, as the nature of the work changes. And how to grow from junior to senior to principal in an era when the traditional feedback loop between design and implementation has been broken.


How Engineers Used to Grow

For decades, the career path followed a recognizable arc. You joined at entry level, wrote code, broke things, got your code torn apart in review, made better mistakes, and gradually developed what researchers call tacit understanding, the ability to look at a system and feel what is wrong before you can fully articulate why.

The Dreyfus model of skill acquisition describes this progression across five stages:

StageCharacteristicsDecision-makingKnowledge
NoviceFollows rules rigidly, no situational judgmentNoneContext-free
Advanced BeginnerRecognizes patterns, treats all aspects equallyWithout contextLimited
CompetentPlans consciously, sees longer-term goalsAnalyticalIn context
ProficientGrasps situations holistically, uses maximsAnalytical –> IntuitiveHolistic
ExpertNo rules needed, intuitive grasp of situationsIntuitiveDeep tacit

The Japanese martial arts concept of Shu Ha Ri mirrors this exactly. First you follow the form faithfully (Shu learn the rules). Then you find the exceptions and break with tradition (Ha question the rules). Then form dissolves into natural action (Ri transcend the rules). You cannot skip stages. The competent engineer who writes their first distributed system will make mistakes the expert would never make because they haven’t yet built the mental model that only comes from doing the work and suffering the consequences.

What made this work was consequence. Writing code gave you direct feedback. A missing lock caused a race condition. A clever abstraction became unmaintainable by the third person to touch it. A shared base class six levels deep broke four products when a parent changed. These lessons were visceral, and they stuck. The Dreyfus model would say you accumulated the situational exposure that moves you from rule-following to intuition.

Books codified what masters had learned. The Pragmatic Programmer showed how to develop craft. Code Complete provided the vocabulary for code quality. A Philosophy of Software Design showed what makes modules deep or shallow. The Mythical Man-Month showed why adding engineers to a late project makes it later, coordination cost, not coding hours, drives timelines. Research quoted there put coding at roughly 14% of total project effort. Requirements, design, testing, debugging, coordination, documentation, and operations consumed the rest. Agentic coding has compressed that 14% toward zero. The other 86% remains entirely human.


The Disruption: Design and Build Are No Longer Learned Together

In traditional software development, design and build were not two separate activities. They were one activity experienced from two angles. When you wrote the code yourself, you felt every consequence of your design decisions in real time. A bad abstraction made your own implementation painful. A missing transaction boundary caused a bug you personally had to trace to its source at midnight. You didn’t just observe these consequences. You lived them, and that is what made the lessons stick.

Agentic coding severs this connection. The engineer writes a specification, the agent produces code, and the engineer reviews the result. This workflow feels productive. It is often highly productive, for engineers who already have the judgment to specify well and review rigorously. But for engineers who are still building that judgment, it removes the primary mechanism that builds engineering judgment.

The specific failure mode for junior and mid-level engineers is the plausibility trap. Agent-generated code looks correct in the narrow local context: functions are clean, tests pass, the logic holds for the cases the spec described. What the code often lacks is correctness at the system level, the consistency guarantees that span service boundaries, the failure modes that only appear under concurrent load, the invariants that hold only if you understand the domain well enough to define them. A senior engineer reviewing that code has a whole-system mental model built through years of implementation experience. They feel when something is off even before they can articulate why. A junior engineer doesn’t have that model yet, and reviewing agent-generated code without it is like trying to spot a structural flaw in a building you’ve never seen the blueprints for.

Bertrand Meyer, in his analysis in Communications of the ACM makes this point precisely: AI-generated code creates a dangerous psychological bias because it is significantly harder to spot a subtle logical flaw in well-structured generated code than in the messy human-written code reviewers are used to. Cleanliness produces false confidence. Agents write plausible code. Plausible is not the same as correct, and the gap between the two is exactly where junior engineers without deep mental models get stuck.

For senior engineers, the risk is different but equally real. Specification, design, architecture, code review, and debugging are not static skills you acquire once and keep forever. They are maintained through practice through the practice of building systems, not just reviewing them. When senior engineers stop writing code regularly, their design intuitions gradually lose contact with implementation reality. Architectural decisions start floating free from the constraints that make them achievable. Specifications become abstract rather than grounded in how things actually work. Code review gets shallower because the reviewer is no longer holding a live mental model of how the system fits together under pressure. The decay is slow and invisible until it isn’t. I had previously seen this decay when principal/staff stopped writing code and became pure architects but agentic coding is making it more prevalent.

In AI Writes Code. You Own the Design. Here’s How to Keep It That Way, I described how AI agents resemble offshore teams more than co-located colleagues: they have a narrow context window, they lack shared understanding of your codebase, they produce locally correct work that misses the bigger picture, and they have no memory between sessions. Every session starts from zero. Amazon AWS teams learned this the hard way, AI-generated code that looked right, passed review, and then caused production incidents. Their response was to significantly tighten review policies. When a production incident costs customers millions or exposes a security breach, you cannot file a bug against Cursor or Claude Code. The engineer who approved the change is accountable.

What’s at stake, underneath all of this, is shared understanding. Fred Brooks called it conceptual integrity, the coherence of design philosophy that runs through an entire system. Conceptual integrity doesn’t live in a document. It lives in the heads of engineers who have thought deeply about the system, implemented parts of it themselves, debugged its failures, and built up a shared mental model of how the pieces fit. That shared model is what gets lost when design and implementation are permanently separated. It is also the most important and hardest-to-recover thing a team can lose. Code can be rewritten. Conceptual integrity, once gone, takes years to rebuild. Instead, the system accumulates exactly what Brooks warned against: many locally reasonable decisions that don’t cohere into a coherent whole.

We cannot give up on junior and mid-level engineers developing real skills, even as agents handle more of the typing. Code reviews by senior engineers help but we mostly learn by doing, not by watching. Engineers still need to understand how code works, how it fits into the larger system, and what the trade-offs mean under real conditions. What we build and what we understand are not separate. Pulling them apart entirely is a quality risk the whole team will pay for, slowly at first and then all at once.


The Two Skill Trees

Engineering growth has always required two parallel tracks. Agentic coding affects each differently.

Hard skills are the technical capabilities: designing systems, understanding trade-offs, debugging complex failures, recognizing code smells, reasoning about correctness under concurrency, and mastering both functional and non-functional requirements. The traditional path built these incrementally through writing code, breaking things, and fixing them. Agentic coding removes that feedback loop without replacing it.

Soft skills are the interpersonal and organizational capabilities: writing clearly, building consensus, managing ambiguity, estimating honestly, communicating with non-technical stakeholders, mentoring others, and owning outcomes across an entire project lifecycle. These have always mattered for growth from mid-level to senior and from senior to principal. Agentic coding hasn’t reduced their importance, it has raised the bar, because the differentiating value of a senior engineer shifts away from code production and toward judgment, communication, and design thinking.

Both tracks require deliberate practice. The diagram below shows the full skill landscape across career levels, mapped to the hard/soft split.


The Career Levels in Detail

Before getting to specific advice, it helps to have a clear picture of what each level actually requires.

Junior: You own software components and work on well-defined problems. You produce high-quality code under guidance, learn from review feedback. You collaborate across the full development lifecycle including code, tests, deployment, documentation but you rely on peers and managers for guidance on design. You are expected to deliver reliably within a clear scope, and you actively seek to learn.

Mid-level: You are an autonomous contributor, owning features, not just components. You design software solutions for difficult problems, though you still seek guidance on architectural strategy. You coach junior engineers. You make priority trade-offs between feature work and operational work. You participate meaningfully in code reviews not just catching bugs, but providing direction.

Senior: You lead multi-engineer projects and own team-level architecture. You work on complex problems with multiple conflicting constraints. You write for both technical and non-technical audiences. You solve problems that don’t yet have a defined technology strategy. You balance short-term delivery against long-term architectural health. You become a force multiplier and your presence should make everyone meaningfully better.

Staff / Principal: You lead across an organization, not just a team. You define technical strategy and roadmaps that span multiple teams. You take on intrinsically hard problems like major bottlenecks and undefined high-impact opportunities. You align teams toward cohesive technical visions. You earn influence through credibility and results, not title.

Junior engineers are largely tactical. Seniors span tactical and operational. Staffs/Principals are primarily operational and strategic.


Hard Skills: What to Build Deliberately

1. Learn to Write Specifications

The most immediately practical hard skill in the agentic era is writing precise specifications. Agents produce what you specify. Vague specifications produce code that fills gaps with training-data assumptions, which may or may not match your domain. This is a learnable craft.

I wrote you-got-skills framework to demonstrate how to build specification skills with use of RFC 2119 discipline. Write a one-page spec before prompting an agent, every time. For significant features, use the full design document structure I previously shared: problem statement, proposal with trade-offs, alternatives considered, non-functional requirements, and rollout plan. The act of writing this forces you to make decisions you were previously leaving implicit.

2. Build a Design Sense

One of the clearest failure patterns in junior and mid-level engineers using agentic coding is an underdeveloped design sense. They can describe what they want. They struggle to explain why one design is better than another, or to recognize when generated code silently violates conceptual integrity, which is identified in The Mythical Man-Month as the single most important property of a well-designed system.

Build this sense deliberately. Read A Philosophy of Software Design and practice the deletion test: if you deleted this module, where would the complexity go? Deep modules with small interfaces earn their place. Shallow pass-throughs add indirection without value. AI defaults to shallow modules, lots of small classes, each delegating to the next. Learning to recognize this pattern and push back on it is a concrete skill you can develop right now.

In Applying Domain-Driven Design and Clean/Hexagonal Architecture to MicroServices, I shared how Domain-Driven Design can employed for an application architecture. When AI generates code for your domain, it has no idea what your domain means. Practice making invalid states unrepresentable. Sum types that enumerate valid states, state machines that encode valid transitions, parse-don’t-validate at boundaries, These design patterns matter more in the agentic era because the compiler becomes your code reviewer when humans can’t catch everything. When AI generates code within a well-typed system, category errors that would slip through casual review become compile errors.

Study the Stable Dependencies Principle: depend in the direction of stability. As illustrated in the reusability trap analysis, the most expensive bugs often don’t come from duplicated code, they come from code shared prematurely. Recognizing when DRY has become a liability is a senior-engineer skill that requires real practice to develop.

3. Develop a Nose for Code Smells and Code Review

Reviewing AI-generated code is not casual reading. Agents write clean, plausible code. The bugs that slip through are not obvious, they’re missing idempotency tokens, race conditions that appear only under concurrent load, enum values that propagate without being handled by all consumers.

Build a structured review practice. Apply two explicit passes. The first pass looks for correctness and security: logic errors (off-by-one, null handling, TOCTOU races), security holes (injection, missing auth checks, hardcoded secrets), data loss risks, and error swallowing. The second pass looks for design: are modules deep or shallow? Are invalid states representable in the type system? Does this code separate commands from queries? Is the complexity justified by the actual problem, or has the agent added abstractions for a feature used by twelve people?

Practice this on every pull request you review, whether AI-generated or human-written. The structured passes build the intuition that experienced engineers call a “nose for code smells”.

4. Master Non-Functional Requirements

Most junior engineers understand functional requirements. Senior engineers understand non-functional requirements , how reliably, under what conditions, and with what failure behavior. This is arguably the most important distinction on the path from mid-level to senior.

When you read a feature request, train yourself to immediately ask: what is the latency budget, and at what percentile? What is the consistency model between these two data stores? What happens if this operation half-succeeds? What’s the blast radius if this component fails completely? What happens at 10x current load? These questions are what agents cannot answer from a vague prompt. In Failures in MicroService Architecture, I shared a number of production issues that I experienced with distributed systems. You can apply the outbox pattern, circuit breaker, retry with jitter, bulkheads and other patterns to remedy common production issues.

5. Keep Your Hands in the Code

Agentic coding creates pressure to delegate implementation entirely. Resist it. You do not build judgment about systems you have never built yourself. Write the spike yourself before committing to full design. Implement the critical path at least once, even if an agent later handles the boilerplate. Trace the execution of generated code in a debugger until you understand what it actually does before approving it for production.

This matters most for debugging. When something fails in production, the mental model you’ve built through implementation is what lets you form hypotheses quickly. Engineers who have only reviewed AI-generated code without deeply understanding it will struggle to diagnose the failures that code produces. The CACM analysis shows that AI-generated code introduces logical and concurrency bugs in clean-looking code that humans find harder to spot than equivalent bugs in messy human-written code.

Furthermore, as agentic coding produces more and more code we don’t fully own mentally, understanding decay sets in. Storey calls this cognitive debt. A team can have low technical debt while sitting on a mountain of cognitive debt where no one can confidently predict the impact of a change. Over time, no single engineer holds the complete picture of how the system works. This makes production incidents progressively harder to diagnose. Keeping your hands in the code, owning critical-path implementations, using agents to explain generated code you don’t immediately understand.

6. Learn Formal Methods Basics

One underappreciated direction for junior/mid-level engineers is to begin learning specification and verification techniques, not as academic exercises but as practical tools for the agentic era. I shared my experience applying TLA+ for specifications in Beyond Vibe Coding: Using TLA+ and Executable Specifications with Claude. But, you can start with property-based testing: instead of writing examples, write invariants that your system must maintain regardless of input. Start with static analysis tools and learn to interpret what they find. Write explicit pre-conditions and post-conditions for complex functions, even as comments. These habits build the specification discipline that makes your agent prompts more precise and your reviews more effective.


Soft Skills: What Separates Mid-Level from Senior from Principal

7. Write with Precision and Clarity

Writing is the highest-leverage soft skill for any engineer who wants to grow. Design documents, post-mortems, and stakeholder communications all require the same underlying capability: translating technical thinking into prose that creates shared understanding.

Practice this deliberately. Write a design document for every significant thing you build, using the full structure described in How Not to Write a Design Document: problem statement, proposal, trade-offs, alternatives considered, non-functional requirements, rollout plan. Show it to a senior engineer. Ask what questions it fails to answer. Design documents are also how you develop design skills. A bad design doc does exactly what a bad design does: it makes the solution sound inevitable, skips trade-offs, and pushes hard questions into implementation. That feels fast until production starts collecting interest on every shortcut.

8. Bring Clarity to Ambiguity

The most important skill a senior engineer develops is the ability to look at a fuzzy problem and make it concrete. This is the single most valued contribution that humans still provide in the agentic era: not the code, but the thinking that makes the code correct.

Ambiguity reduction works in both directions. On the problem side: understand the actual customer need before finalizing a solution, push back on specs that describe a solution rather than a problem, ask what the real constraint is. On the solution side: identify which design decisions are reversible versus which are one-way doors. Practice this in every design review, every planning discussion, every incident retrospective.

9. Build Alignment and Consensus

The transition from proficient to expert in any domain requires operating at the social level: building consensus among people with competing interests, aligning a technical direction through an organization that has other priorities, and navigating disagreements constructively. The trust equation from Maister et al. shows that trust has four components: credibility, reliability, intimacy (safety), and self-orientation (does it serve the system or you?). Engineers who lose influence at the senior and principal level almost always fail on the fourth element. Proposals that come across as serving “my architecture” rather than “our actual problem” collapse trust fast.

Build alignment by listening before proposing. Spend time understanding what actually hurts the team before advocating for a technical direction. Frame proposals in terms of reduced toil, reduced uncertainty instead of architectural purity. Find a long-standing pain and solve it visibly. The Aikido principle from Jerry Weinberg applies here: center, enter, turn. First be aware of yourself and what you want to accomplish. Then enter the world of the other person. Then together turn the energy in a more effective direction.

10. Communicate Upward in Business Terms

Translating technical decisions into business impact separates senior engineers from principals. The ability to tell a VP concisely, what the risk is, and what it costs to address it. Learn the metrics that matter to leadership: revenue impact, customer retention, incident cost, deployment frequency, engineer productivity. Practice expressing technical proposals in those terms. This is the failure mode highlighted in How Senior Engineers Lose Trust: communicating technical complexity without translating it into business impact, focusing on engineering outputs.

11. Estimate Honestly and Decompose Work Well

Engineers who consistently underestimate erode trust. Engineers who consistently overestimate become known as blockers. Honest estimation with explicit uncertainty ranges, clear assumptions, and candid identification of the biggest risks is a key skill.

Three practices make estimation better. First, decompose into vertical slices, not horizontal layers. A vertical slice cuts through all layers and produces something independently demoable. Horizontal slicing delays feedback as you don’t know if the feature works until the last layer is complete. Second, use three-point estimation for commitments: (Best + 4×MostLikely + Worst) / 6, and present ranges rather than single numbers. Capacity is never 100%. Budget explicitly for KTLO like operational work, incident response, and technical debt.

12. Own Outcomes Beyond Your Code

The clearest signal of an engineer ready for senior responsibility is willingness to own the work nobody wants to do: the failing test that has been skipped for months, the runbook that was never written, the technical debt accumulating in the corner nobody touches, the onboarding documentation that every new hire struggles with. This is what some call being the janitor, taking responsibility for team health and code health. It builds organizational trust faster than any individual feature. Own incidents that aren’t yours. When a production problem occurs on your team, treat it as your problem regardless of who wrote the code. In Writing Post Mortems That Actually Make You Better: A Practitioner’s Guide, I explained how to use the Five Whys and the Swiss Cheese model for documenting incident post-mortems.

13. Become a Go-To Person

Focused expertise builds the kind of reputation that earns you higher-impact work. The path to being a go-to person has three branches: project ownership, technology expertise, and domain expertise. Pick one to start. Host a learning session on something you know well. Write about it internally. Help others who are stuck on it.

14. Mentor Others

Teaching is one of the fastest ways to consolidate your own understanding. When you explain a design decision to a junior engineer, you discover exactly what you do and don’t understand. When you give code review feedback that helps someone see a flaw they missed, you sharpen your own eye.

In the agentic era, junior engineers need mentorship more than ever because the traditional mechanism of learning through building and breaking code is less available. Senior engineers who help juniors understand why AI-generated code works the way it does, how to critique it structurally, and how to reason about trade-offs are providing something genuinely important. The psychological safety research from Google’s Project Aristotle applies here: teams where members feel safe raising concerns, asking questions, and challenging designs outperform teams where they don’t. You build that culture one mentoring conversation at a time.


The T-Shape and Broken Comb Model

The most useful framework for thinking about hard skill investment is the T-shape: one area of genuine depth combined with broad familiarity across adjacent areas (the horizontal bar). As engineers progress toward principal level, the shape often becomes what practitioners call a broken comb, multiple verticals of depth across different domains, connected by broad horizontal understanding. A principal engineer might go deep in distributed systems, in observability, and in the security model of their specific domain, while maintaining enough breadth to lead design conversations across the full stack.


A Concrete Self-Guided Growth Plan

Here is a practical, time-bounded path for engineers at each stage.

If you are a junior engineer (0–3 years):

  • Write a one-page spec before prompting an agent. Compare what the agent produced to what you specified.
  • Ask to implement at least one non-trivial feature entirely yourself, even if it takes longer.
  • Read The Pragmatic Programmer and Code Complete.
  • Request structured feedback on every code review you submit.
  • Use agents to explain generated code you don’t understand.

If you are a mid-level engineer (3–6 years):

  • Write a full design document for the next significant feature you build. Share it with a senior engineer and ask specifically what questions it fails to answer.
  • Own one domain on your team completely: its documentation, its monitoring, its failure modes, its onboarding.
  • Start hosting one internal learning session per quarter on something you know well. Write it up afterward.
  • Apply a structured two-pass review to every pull request you review. Track what you catch over a month.
  • Read A Philosophy of Software Design. Apply the deletion test and bounded context thinking to your current codebase.
  • Write one post-mortem per incident using the Five Whys structure.

If you are a senior engineer aiming for staff/principal:

  • Lead one project that coordinates work across multiple engineers. Own the design and run the design review. Drive the post-project retrospective.
  • Translate one technical proposal into business impact language: metrics, incident cost, customer effect.
  • Mentor junior engineers specifically in how to critically evaluate AI-generated code.
  • Identify the most painful systemic problem on your team, the thing everyone complains about and nobody fixes. Fix it, document it, and share what you learned.

Ongoing, at every level:

Keep your hands in the code. The fraction of code that engineers write themselves will keep shrinking, but understanding what the code does, how it fits the larger system, and what its failure modes are requires someone who can read it critically, reason about it deeply, and debug it under pressure.


What We Cannot Give Up

There is real pressure in many organizations to reduce engineering involvement in requirements, design, and review to automate the entire lifecycle. This deserves serious assessment. Agents accelerate delivery but they do not absorb accountability. When code fails in production, the customer doesn’t care whether the bug was introduced by a human or a model. The engineer who approved it is responsible. Code review, even partially automated still requires human engineers who understand the system well enough to know what they’re reviewing. Junior engineers who bypass the developmental stages that build that understanding will produce reviews that miss what matters. Organizations that accept this trade-off in exchange for short-term velocity will eventually pay compounding interest.

The goal is not to resist agentic coding. The productivity gains are real and the trend is irreversible. The goal is to keep all three in check: technical debt in the code, cognitive debt in the team’s shared understanding, and intent debt in the artifacts. Agentic coding, used carelessly, accelerates all three simultaneously.


Further Reading

June 13, 2026

The Reusability Trap: When DRY Becomes a Liability

Filed under: Computing,Technology — admin @ 11:32 am

Reusability sounds like an obvious good practice. Write it once, use it everywhere. Don’t repeat yourself or DRY principle was popularized by The Pragmatic Programmer book. Every senior developer preaches it. But the most expensive production bugs I’ve seen didn’t come from code that was duplicated. They came from code that was shared when it shouldn’t have been. This post is about what happens when reusability becomes an obsession. I’ll show you the patterns that cause the most damage, and what to do instead. And I’ll end with a new angle that I think is underappreciated: why agentic AI coding assistants work dramatically better on well-designed, modular codebases and how the reusability trap actively makes them worse.


The Prophets Already Warned Us

The software industry has been here before. Fred Brooks warned about over-engineering in The Mythical Man-Month (1975):

“The general tendency is to over-design the second system, using all the ideas and frills that were cautiously sidetracked on the first one.”

Brooks also observed something cutting about reuse in practice: barriers to reuse sit on the consumer side, not the producer side. Yourdon estimated that reusable components require twice the effort of a one-shot component. Brooks put the multiplier at three. Parnas put it plainly:

“Reuse is something that is far easier to say than to do. Doing it requires both good design and very good documentation. Even when we see good design, which is still infrequently, we won’t see the components reused without good documentation.”

More recently, Sandi Metz landed on the same truth from a different angle:

“Duplication is far cheaper than the wrong abstraction.”

And Rob Pike, in the Go Proverbs:

“A little copying is better than a little dependency.”

These aren’t arguments against sharing code. They’re arguments against sharing code prematurely before the right abstraction reveals itself. The cost of the wrong abstraction is front-loaded with apparent savings and back-loaded with compounding debt.


Part 1: Inheritance, The Reuse That Keeps on Costing

The Promise vs. The Reality

One of pillar of object oriented languages is inheritance for reuse. Two classes share behavior? Extract a base class, done. Here’s the actual cost breakdown:

ApproachCost to CreateCost to ChangeBug Blast Radius
Duplicated code (2 copies)2× (independent)Local
Shared base class (inheritance)0.8×5–20× (understand all subclasses)Cascading
Composition1.2×1× (swap implementation)Local

The savings of inheritance are front-loaded. Every future change requires understanding the entire hierarchy. In a system with 100+ subclasses, that’s not a 20% savings, it’s a 2000% tax on every modification.

Anti-Pattern: The Fragile Base Class

I worked on a system where a senior executive was obsessed with reusability. The result was inheritance chains 10 levels deep. The worst example: a control-plane listener that inherited from a data-plane input class, just to reuse TCP socket handling.

WorkerListener --> TcpDataInput --> BaseTcpIn --> BaseInput --> BaseStatusReporter --> Serviceable --> EventEmitter

The listener’s actual job was: accept TCP connections from workers, validate auth tokens, register workers, distribute config bundles, and receive heartbeats. But it inherited an event processing pipeline it never used, IP whitelisting via regex it never used, proxy protocol support it never used, and socket idle timeouts that could kill healthy long-lived worker connections.

This nested hierarchy was a continuous source of bugs when making changes in the parent classes and broke products that inherited the unexpected changes.

The Stability Trap

There’s a design principle that explains exactly why the fragile base class is so dangerous: Stable Dependencies Principle (SDP), from Agile Software Development says:

Depend in the direction of stability. A component is stable when many things depend on it and few things it depends on can change underneath it. A component is instable when few things depend on it and it changes frequently. The principle gives you a metric for this:

I = Ce / (Ca + Ce)

Where Ca is the number of components that depend on the component (afferent couplings, things that would break if you changed it), and Ce is the number of components it depends on (efferent couplings, things that could change and break it). I = 0 means maximally stable (everyone depends on it, it depends on nothing). I = 1 means maximally instable (nothing depends on it, it depends on everything).

The SDP rule: if component A depends on component B, then B’s instability score should be lower than A’s. You should depend on things that are more stable than you are, never less. Now look at what inheritance actually does to these scores.

TcpDataInput in the example above has many consumers, it’s a shared base class used across the data plane. High Ca. That makes it look stable. But it’s also an actively maintained class that changes as data-plane requirements evolve like new connectors, security patches, protocol changes. Every change is a potential breaking change for every class that inherits it.

Inheritance creates a hidden stability inversion. The consuming class looks stable (high Ca, others depend on it), but it secretly depends on something instable (low I score from its own perspective, it changes for reasons the consumers don’t control).

This is why the principle matters beyond just “don’t change base classes carelessly.” The architecture itself needs to route dependencies in the direction of stability. Abstract interfaces are maximally stable (I = 0 by definition — they contain no implementation to change). Concrete implementations are instable. So:

  • Stable components should depend on abstract interfaces, not concrete implementations.
  • Instable components (leaf classes, frequently changing logic) should sit at the edge, depending inward toward stable abstractions.

The LSP Smell Test

Liskov Substitution Principle says: if S is a subtype of T, you should be able to substitute S anywhere T is expected without breaking anything. You’re violating LSP and inheritance is the wrong tool when you find yourself:

  • Overriding methods just to disable inherited behavior
  • Checking instanceof in calling code
  • Adding if (this instanceof ChildClass) in the parent
  • Setting this.checkDiskUsage = new NOOPDiskUsageChecker() in the constructor

I’ve seen a RingBufferOut that extended FileSystemOutput and used approximately 200 lines of it, a 5% utilization rate. It disabled disk usage checking, eliminated staging/upload separation, disabled orphan file reconciliation, and completely overrode bucket naming and retention logic. The ring buffer carried 2,700 lines of dead weight: cloud upload logic, parquet format support, staging directory management, none of which it used. The “savings” from inheritance were illusory. The dead weight made every change a minefield.

The rule of thumb: if you override more than 30% of inherited methods, or disable features in your constructor, you want composition, not inheritance.

Anti-Pattern: The Serviceable Base That Taxes Everything

A “Serviceable” base class forced EventEmitter onto 102 subclasses:

class Serviceable extends EventEmitter {
  private static INSTANCES: Serviceable[] = []; // Global tracking

  constructor(interval: number) {
    super(); // EVERY subclass is now an EventEmitter — whether it emits or not
    Serviceable.INSTANCES.push(this);
    this.serviceInterval = setInterval(() => this.service(), interval);
  }

  static destroyAll(): void {
    Serviceable.INSTANCES.forEach(s => s.destroy()); // kills everything, all at once
  }
}

// Result: 102 classes inherit this. Many NEVER emit events:
class DiskUsageReporter extends Serviceable {}  // never emits
class BackupManager extends Serviceable {}       // never emits
class HealthMonitor extends Serviceable {}       // never emits
class MetricsBatcher extends Serviceable {}      // never emits

The reasoning was: many components need a periodic timer, and EventEmitter is useful, let’s put both in a base class for reusability. The result: 102 classes carry EventEmitter’s overhead regardless of whether they ever emit a single event. Worse, the static INSTANCES array creates hidden coupling between all 102 subclasses. A destroyAll() call kills backup managers, metric batchers, and health monitors indiscriminately, no lifecycle ordering, no dependency-aware shutdown.

Fix it with composition:

// Timer is a composable utility — not an inheritance tax
class ServiceTimer {
  constructor(private callback: () => Promise<void>, private intervalMs: number) {}
  start(): void { this.handle = setInterval(() => this.callback(), this.intervalMs); }
  stop(): void { clearInterval(this.handle); }
}

class MetricsBatcher {
  private timer: ServiceTimer;

  constructor(interval: number) {
    this.timer = new ServiceTimer(() => this.flush(), interval);
  }
  // No EventEmitter. No global instance tracking. No forced API surface.
}

Each class composes only what it needs. Lifecycle is explicit. Testing is trivial.

Anti-Pattern: Depth-5 Inheritance for a Simple HTTP POST

The SaaS observability output in one system needed to POST metrics to a single endpoint with an API key and gzip compression. Reasonable enough. But it inherited from a 5-level chain:

BaseOutputter (~1K LOC) --> HTTPOut (~2K LOC) --> HTTPLoadBalancedOut (~400 lines)
  --> BatchedHTTPOut (~200 lines) --> BaseSaaSOut --> VendorMetricsOut

Total inherited before any vendor-specific code: ~4K lines. What the SaaS output actually needed: POST to one endpoint, one API key header, gzip compression, retry on 429/5xx. What it actually inherited: DNS resolution, endpoint health tracking, weighted routing, full request construction across TLS and proxy, cookie management, pipeline wiring, and backpressure signaling. Developers knew it was wrong. A TODO in production code said:

// TODO: create new class that handles multiple HTTP destinations
// instead of cascading inheritance chain

But inheritance makes fixing it prohibitively expensive. Every existing subclass depends on the hierarchy. The wrong abstraction becomes load-bearing. Fix it with a middleware stack (decorator pattern):

type HttpMiddleware = (req: HttpRequest, next: NextFn) => Promise<HttpResponse>;

const retrying: HttpMiddleware = (req, next) => retryWithBackoff(next, req, { maxRetries: 3 });
const compressing: HttpMiddleware = (req, next) =>
  next({ ...req, body: gzip(req.body), headers: { ...req.headers, 'Content-Encoding': 'gzip' }});
const authenticating = (apiKey: string): HttpMiddleware =>
  (req, next) => next({ ...req, headers: { ...req.headers, 'DD-API-KEY': apiKey }});

class SaaSMetricOutput {
  private transport: HttpTransport;

  constructor(config: SaaSOutputConfig) {
    // Build transport as middleware — no 3,350-line inheritance
    this.transport = buildTransport([
      authenticating(config.apiKey),
      compressing,
      retrying,
    ]);
  }
}

The SaaS output shrinks to ~100 lines. Adding a new vendor requires composing the right middleware, not reading a 5-level hierarchy.

Anti-Pattern: Empty Subclasses as Configuration

A system had 12 subclasses of an S3-compatible output. Seven were empty:

export class StorjS3Out extends S3Output {}        // 3 lines
export class CloudflareR2Out extends S3Output {}   // 3 lines
export class AlibabaCloudS3Out extends S3Output {} // 3 lines
// Each carries 4,500+ lines: local staging, orphan reconciliation,
// parquet writing, dead letter dirs — for cloud providers that need none of it

Each existed only for type registration in a factory map. Variant behavior is configuration, not subclasses:

const S3_PROVIDERS: Record<string, S3ProviderConfig> = {
  storj:         { pathStyle: true, region: 'global' },
  cloudflare_r2: { pathStyle: true, region: 'auto' },
  alibaba:       { pathStyle: false, endpoint: '{region}.aliyuncs.com' },
};

The Fix: Composition with Focused Interfaces

Each composed dependency has a focused interface. You can swap IWorkerAuth for mTLS without touching transport. You can test connection tracking with a fake server. A bug fix in data-plane TLS cannot reach WorkerListener.


Part 2: Cyclomatic Complexity, The Tax on Reused Code

When a class serves five different purposes, every execution path has to be guarded. When a module supports four modes, the mode checks spread like mold into every file that imports it. In one real system: 320 files contained topology checks (isLeader, isWorker, isEdge). 186 files checked feature flags deep in domain logic. 488 files accessed process.env directly. This is the direct consequence of reusing the same codebase to serve incompatible purposes.

// This pattern, scattered across hundreds of files:
if (ProcessInfo.isLeaderMode()) {
  this.startDistributedLeader();
  if (FeatureFlags.check('search-v2')) { /* ... */ }
  if (license.tier === 'enterprise') { /* ... */ }
} else if (ProcessInfo.isWorkerMode()) {
  this.connectToLeader();
  if (ProcessInfo.isRunningInCloud()) { /* ... */ }
} else if (ProcessInfo.isEdgeMode()) {
  this.startMinimalPipeline();
  if (FeatureFlags.check('edge-metrics')) { /* ... */ }
}

Every new mode requires touching 20+ files. You cannot test one mode without loading all mode code. Cyclomatic complexity of a single bootstrap method exceeds 20. Adding a deployment mode means auditing hundreds of files for hidden conditionals.

Anti-Pattern: Feature Flags as Global Conditionals

The same problem appears with feature flags. When they’re scattered inline across 186+ files, they become indistinguishable from mode checks, entitlement checks, and license checks, all mixed together:

if (FeatureFlags.check('auth-token')) {
  const { TokenStore } = require('./auth/TokenStore');
  rpc.register(new TokenStore(conf), TokenStore.ID);
}
if (FeatureFlags.check('data-insights') && Product.isWorker(mode)) { /* ... */ }
if (FeatureFlags.check('search') && license.tier === 'enterprise') { /* ... */ }

The fix is to resolve capabilities once at startup and inject them as either real implementations or no-ops:

interface ISearchCapability {
  registerEndpoints(router: Router): void;
  executeQuery(query: Query): Promise<Results>;
}

class NoOpSearch implements ISearchCapability {
  registerEndpoints(): void { /* no-op */ }
  async executeQuery(): Promise<Results> { return Results.empty(); }
}

// Resolve ONCE at startup — never scattered inline
function resolveCapabilities(flags: FeatureFlags, license: License): AppCapabilities {
  return {
    search: flags.check('search') && license.allows('search')
      ? new SearchModule(config)
      : new NoOpSearch(),
  };
}

// Boot is clean
async function boot(caps: AppCapabilities, router: Router): Promise<void> {
  caps.search.registerEndpoints(router); // dead code path simply doesn't exist
}

The Fix: Strategy Pattern + Policy Injection

Define behavior as strategy interfaces. Create one implementation per mode. Resolve the policy once at startup, everything else receives it:

class NodePolicyFactory {
  static create(role: NodeRole, license: License): NodePolicy {
    // THIS is the ONLY place that mode-switches
    switch (role) {
      case 'leader': return {
        processing: { maxWorkers: 0, enableSearch: true },
        behavior: new LeaderBehavior(),
      };
      case 'edge': return {
        processing: { maxWorkers: 1, maxHeapMB: 512, enableSearch: false },
        behavior: new EdgeBehavior(),
      };
    }
  }
}

// All other code receives the policy — zero mode checks
class PipelineEngine {
  constructor(private policy: NodePolicy) {}
  async start(): Promise<void> {
    const workerCount = this.policy.processing.maxWorkers; // no if-else
  }
}

Runtime complexity goes from O(modes × flags × tiers) to O(1).


Part 3: The God Class, Reuse at the Wrong Granularity

When developers try to build a “reusable” class that serves many purposes, they often produce a God Class where a single class that does everything so it can serve everyone. One system had classes like:

FileLinesResponsibilities
ApplicationServer~2K LOCBootstrap, mode detection, process spawning, metrics, REST startup, shutdown
FileSystemOutput~3K LOCStaging, upload, cleanup, metrics, parquet, reconciliation
ProcessManager~1.5K LOCProcess lifecycle, metrics init, license, git, config helpers, warm pool
HttpBaseInput~2K LOCHTTP server, TLS, health, auth, parsing, compression, routing, proxy
RemoteConnection~2,5K LOCWorker lifecycle, config push, metrics, commands, upgrades

The problem isn’t the line count. It’s that every responsibility changes for different reasons at different times. When the metrics subsystem needs a change, you’re editing the same file that controls TLS configuration. When a new output format is added, you’re touching the same class that manages staging directories.

HttpBaseInput is a good example of the architectural layer problem. It mixed transport (TCP socket management, TLS), protocol (NDJSON parsing, compression), authentication (token validation, auth state machine), application logic (field extraction, time parsing), metrics (request counts, latency histograms), and load balancing, all in one class. Every HTTP-based input (Splunk HEC, OTLP, Elastic, Datadog) inherited all ~2K lines. Changing the TLS configuration risked disrupting field extraction. Adding a health endpoint risked breaking authentication middleware. Fix it by separating layers:

// Each layer is independent — compose at construction time
class SplunkHecInput {
  constructor(
    private transport: IHttpServer,        // Layer 1: socket, TLS
    private auth: IAuthenticator,          // Layer 2: token validation
    private protocol: ISplunkHecParser,    // Layer 3: /services/collector format
    private pipeline: IEventSink,          // Layer 4: deliver events downstream
    private metrics: IInputMetrics,        // Cross-cutting: counters, latency
  ) {}
}
// Changing TLS (transport) cannot break Splunk parsing (protocol)
// Testing protocol parsing requires NO HTTP server — just pass mock events

Part 4: Missing Layers, REST Endpoints Doing Direct I/O

Here’s a less obvious form of the same problem. REST handlers that reach directly into the filesystem:

class AppsEndpoint {
  async handlePut(req: Request): Promise<Response> {
    await writeFile(targetPath, req.body);          // direct fs
    await mkdir(artifactDir, { recursive: true });
    const files = await readdir(configDir);
    // No abstraction, no transaction, no testability
  }
}

This prevents swapping storage backends, adding transaction semantics, unit testing without filesystem mocks, and centralized corruption detection. The application layer reached through the persistence layer, a layer violation that makes both layers impossible to change independently. The fix is a persistence abstraction:

interface IConfigStore {
  read(path: ConfigPath): Promise<Buffer>;
  write(path: ConfigPath, data: Buffer): Promise<void>;
  transaction<T>(fn: (tx: IConfigTransaction) => Promise<T>): Promise<T>;
}

class AppsEndpoint {
  constructor(private store: IConfigStore) {}

  async handlePut(req: Request): Promise<Response> {
    await this.store.transaction(async (tx) => {
      await tx.write(targetPath, req.body);
      await tx.write(metadataPath, metadata);
      // Atomic: both succeed or both roll back
    });
  }
}

Part 5: CRUD as Architecture, Generic APIs That Serve Nobody

CRUD generators are another form of pathological reuse. One model, one handler, one UI pattern for everything. They deliver APIs optimized for the database schema rather than user intent.

// "Reusable" CRUD generator applied to 40+ resources
createCrudEndpoints('workers', workerSchema, workerStore);
createCrudEndpoints('pipelines', pipelineSchema, pipelineStore);

// PUT /workers/:id demands ALL 10 fields, even though:
//   "Rename a worker" only needs { description }
//   "Move to a group" only needs { group }
//   "Scale up" only needs { maxProcesses, heapSizeMB }

Callers must research which fields matter for their specific operation. Concurrent callers doing GET –> modify one field –> PUT back create race conditions. The fix models what users actually do, not what the database stores:


Part 6: npm and the Dependency Chain Problem

Inheritance abuse at the code level has a direct analog at the package level. I used PERL’s CPAN extensively in the 1990s with the Mason web templating system. It worked beautifully until it didn’t. Then came Maven, pip, npm, RubyGems, Cargo. Each language built its own package ecosystem. Each package could depend on other packages, creating dependency trees that look like fractals. We never developed mature patterns for managing these at scale. The npm ecosystem exemplifies the chaos. In 2016, a developer unpublished left-pad, an 11-line function that padded strings with spaces. Thousands of projects broke overnight. Babel, React, and countless applications depended on it through layers of transitive dependencies. This pattern repeats. I’ve seen production applications import packages for:

  • is-odd / is-even: check if a number is odd (n % 2 === 1)
  • is-array: check array type (JavaScript has Array.isArray() built-in)
  • string-split: split text

The MIT Sloan Management Review and ACM both document the risks of software reuse at scale. The core finding: reuse shifts risk from “building the wrong thing” to “inheriting the wrong dependency chain.” A single Go project might pull in hundreds of transitive dependencies, each a potential security vulnerability. Both costs are real. Only the first one gets measured.


Part 7: Reusing Security Tokens, The Shared Blast Radius

The most dangerous form of reuse isn’t in code. It’s in credentials.

class InstanceSettings {
  // One token — shared by every worker in the fleet of thousands
  authToken: string = crypto.randomBytes(16).toString('hex');
}

if (req.headers['x-auth-token'] !== this.authToken) {
  return res.status(401).json({ error: 'Unauthorized' });
}

A single compromised worker exposes every worker. Revoking one worker’s access requires rotating the shared secret for the entire fleet, a coordinated operation that takes the whole fleet offline simultaneously. In one system, we shared same token between the control plane and the data plane for euse optimization. This caused innumerable bugs when control plane changed its token scheme from opaque tokens to JWT. The fix is per-identity tokens with short TTLs:

class WorkerTokenIssuer {
  async issueToken(identity: WorkerIdentity): Promise<AccessToken> {
    return this.mint({
      sub: identity.clientId,           // unique per worker
      scopes: identity.scopes,           // minimal privilege
      exp: Date.now() + this.tokenTTLMs, // short-lived
      jti: ulid(),                        // unique — enables revocation
    });
  }

  async revokeWorker(clientId: string): Promise<void> {
    await this.revocationList.add(clientId);
    // Other 9,999 workers unaffected
  }
}

Every system managing thousands of agents at scale like Datadog, Prometheus exporters, Kubernetes kubelets issues per-agent certificates or short-lived tokens. Shared credentials aren’t a cost saving. They’re a single blast radius for your entire fleet.


Part 8: Shared Modules, How Common Code Slows Teams

Shared or “common” modules feel like the right call. One place for utilities, helpers, shared models. Every team uses the same battle-tested code. No duplication. In practice, these modules become the most contested real estate in the codebase.

Team A needs a small change to a shared validation function. They open a PR. But the common module is owned by a platform team that maintains a release cadence. Team A waits for the next release window. Team B is blocked on a different change to the same file. Both PRs conflict. The platform team spends a sprint mediating merge conflicts they didn’t create. I’ve seen this pattern repeat at multiple companies:

  • A common module starts as a home for genuinely shared utilities, timestamp parsing, config validation, ID generation.
  • Teams start adding features to it because “it’s already shared.” Team A adds a flag to change behavior for their use case. Team B adds a different flag. The module grows a conditional for every team’s edge case.
  • The module that was supposed to prevent duplication becomes the largest source of complexity, merge conflicts, and broken builds in the codebase.

Brooks identified the organizational dimension of this in The Mythical Man-Month: corporate-level reuse “implies changes in project accounting and measurement practices to give credit for reusability.” Teams get credit for shipping features, not for investing in shared infrastructure. The incentives push toward adding to common quickly, and away from the expensive work of designing a proper stable interface. The result is that common gets additions but rarely deletions, refinements, or principled breaking changes.

What works instead:

  • Narrow, stable libraries: utilities with pure functions (parseTimestamp, generateId), no state, no side effects. These can be shared safely because they have no behavior to conflict over.
  • Published interfaces, not shared implementations: agree on the contract, let each team implement. If two teams share an interface rather than a class, their implementations evolve independently.
  • Internal packages with semantic versioning: treat shared code like a real library. Pin versions per team. Break changes intentionally and explicitly. Don’t silently couple release trains.
  • Copy for divergence: if Team A and Team B both need slightly different behavior from a shared function, copy it. Let each version evolve toward its actual use case. The right abstraction will reveal itself only after divergence, not before.

Part 9: The Monolithic Binary, Inheritance Made Physical

Inheritance abuse has a physical consequence: it makes separation architecturally impossible. When WorkerListener extends TcpDataInput, you cannot compile WorkerListener without the entire data-plane input hierarchy. You cannot deploy the leader without bundling all input connector code. When HeartbeatSender extends TcpSender, you cannot deploy a worker without bundling all output connector code. The result in one system: a single binary exceeding 200MB containing all modes, all 150 connectors, and all feature code, regardless of which node role deployed it.

SystemArchitectureAgent Size
Monolithic inheritance systemSingle binary, all modes200–400MB
Datadog AgentGo binary, plugin-based~50MB
Fluent BitC binary, plugin-based10–30MB
VectorRust binary, feature-flagged30–50MB
TelegrafGo binary, registry pattern~60MB

The inheritance chain creates a compile-time dependency graph that makes separation physically impossible even if you wanted a “leader-only” binary, the import chain through inheritance pulls in every connector. Competitors use composition-based plugin architectures from the start:

// Telegraf: no class inherits from another — each plugin is independent
func init() {
    inputs.Add("kafka", func() telegraf.Input { return &KafkaInput{} })
}
// Adding a plugin: add one file. No core file modified.
// Building a minimal binary: don't compile that file.

Each module declares its activation events. The kernel loads only modules matching the current role and entitlements. A bug in the Kafka connector cannot affect S3. Adding a connector requires zero changes to core.


Part 10: Shared Mutable State, The Singleton Tax

In one system, we had 474 singletons. That’s how many I counted in one codebase.

Configuration.instance().loadSystem('app');
GitMgr.instance().ignore();
AuthTokenAuthority.instance().createToken(claims);
InputMgr.instance().getInput(id);
// ... 20+ more

Every singleton creates invisible coupling: any code can access any singleton without declaring the dependency. Creation and destruction order is undefined. Tests cannot provide mocks without manipulating global state. Request-scoped, group-scoped, and process-scoped data all use the same pattern. Module-level mutable state is the same problem in a different form. One system had 30+ pipeline functions with module-level variables:

let _primaryCache = new Map();
let _numEventsReceived = 0;

exports.process = (event) => {
  const key = _expression.evalOn(event);
  _primaryCache.get(key).count++;  // global mutation in hot path
  _numEventsReceived++;
};

There’s no isolation between pipeline instances sharing the same module, and race conditions emerge the moment processing is parallelized. The fix is closure-encapsulated state — state is local to the instance, not the module:

function createProcessor(config: ProcessorConfig): Processor {
  let primaryCache = new Map<string, CacheEntry>(); // local to THIS instance

  return {
    process(event) {
      const key = config.keyExpr.evalOn(event);
      const entry = primaryCache.get(key) ?? createEntry();
      entry.count++;
      return entry.count <= config.maxToAllow ? event : null;
    },
  };
}

const processor1 = createProcessor(config1);
const processor2 = createProcessor(config2); // completely independent

Part 11: The New Angle, Agentic AI Thrives on Modular Code

Here’s something I’ve observed that doesn’t get written about enough: the quality of AI-generated code degrades sharply with the complexity of the codebase it works in.

Agentic coding tools like Claude Code, Cursor, Copilot in agent mode, and others are transformative for well-structured codebases. But point them at a codebase with deep inheritance hierarchies, scattered conditional logic, god classes, and shared mutable singletons, and the output becomes unreliable in predictable ways.

Why Bad Structure Amplifies AI Mistakes

  • Context window exhaustion. When a class inherits from a 7-level hierarchy, understanding what any method does requires reading across 3+ directories and thousands of lines. AI tools have a finite context window. A god class of 2,000+ lines, a shared common module with hundreds of exports, or a deep inheritance tree consumes that window before the model even reaches the code it’s supposed to change. The model ends up reasoning from partial context and partial context produces confident-looking but wrong code.
  • Conditional logic compounds errors. When 320 files contain mode checks and 186 contain scattered feature flag conditionals, the model has to track implicit state through the entire call graph to reason correctly about any change. Every missed conditional is a latent bug. I’ve seen AI agents introduce a change that was correct for isLeaderMode() but silently wrong for isEdgeMode()because the conditional branching was too diffuse to track reliably.
  • Inheritance hierarchies hide side effects. When a model generates code for a leaf class in a deep hierarchy, it may not realize that super.init() triggers a chain of side effects through five parent classes, or that overriding getTimeout() will be called in 12 different contexts. The model sees the method signature. It doesn’t see the full inheritance contract. The result looks plausible but breaks at runtime.
  • Shared mutable state creates invisible dependencies. A model generating a new component might not know that a singleton it touches is also modified by three other components during the same request lifecycle. In a clean dependency-injected system, those dependencies are declared. In a singleton-heavy system, they’re invisible and invisible dependencies produce bugs that are hard to reproduce and harder to explain to an AI that’s trying to help you fix them.

What AI Agents Do Well and Where Structure Helps

The pattern I keep seeing: AI agents work best when they can work on one focused thing at a time. A well-designed system with:

  • Small classes with single responsibilities
  • Explicit interfaces and dependency injection
  • Focused modules with clear boundaries
  • No cross-domain inheritance
  • Composition over inheritance throughout

The cleanest formulation I’ve found: the codebases that benefit most from AI-assisted development are exactly the codebases that already practice good design.


The Decision Framework

MechanismSafe WhenDangerous When
Copy-paste2–3 instances, likely to divergeNever
Shared utility functionPure logic, no state, no side effectsWhen it accumulates parameters to serve all callers
Shared interfaceMultiple implementations of same contractWhen the interface grows to satisfy one implementation
CompositionReusing behavior across unrelated concernsAlmost never dangerous
InheritanceTrue “is-a”, LSP holds, < 30% overrideDifferent domains, constructor disabling, >30% override
Common moduleStable, narrow, pure utilitiesAnything with mutable behavior, ownership ambiguity
CRUD generatorSimple reference dataResources with distinct business operations
Shared config/tokenNeverAlways

Conclusion: Duplication You Can See vs. Coupling You Can’t

The drive for reusability is real. Duplicated logic is a real cost. But the engineers who warn against premature abstraction like Brooks, Metz, Pike, Beck, Parnas are pointing at something specific: coupling is invisible at creation time and expensive at change time. Duplicated code can be changed independently. The wrong abstraction propagates changes to every consumer. A shared inheritance hierarchy means a security fix in the control plane can take down the data plane. A shared token means one compromised worker compromises the fleet. A shared common module becomes the shared surface for every team’s bugs and merge conflicts.

And now there’s a new dimension to this: a well-structured, modular codebase with clear boundaries and composition over inheritance is also the codebase where AI agents work reliably. The investment in clean design pays dividends across every developer you add whether human or AI.

The safest question to ask before sharing anything: what happens when this needs to change? If the answer is “nothing else breaks,” share it. If the answer is “everything that depends on it,” think harder about whether you’re creating an abstraction or a trap. Start with duplication. Let the right abstraction reveal itself. Then share via composition, narrow interfaces, and well-bounded modules. The cost of the wrong abstraction always exceeds the cost of a little repetition.


May 26, 2026

The Complexity Trap: Why Simple, Bug-Free Systems Can Hurt Your Career

Filed under: Computing — admin @ 10:06 pm

I have worked for both large tech companies and startups. Two patterns kept showing up across every company I worked at startup and large company alike that both punish the engineers doing the right thing.

At startups, the pressure is entirely on shipping features. Engineers who move fast and ship constantly get rewarded. Security, observability, scalability become “future problems.” The engineers who slow down to build things properly, who push back on cutting corners, get treated as obstacles. The corners get cut anyway. When the system eventually breaks under load or gets breached, nobody connects it back to the decisions made two years earlier. The engineers who raised concerns are long gone or drowned out.

At large companies, a different trap. Ship something clean with simple design, solid implementation, few follow-up bugs and people move on. Nobody notices the problems that didn’t happen. Nobody gets promoted for the outages that never occurred. But ship something overengineered, watch it fall apart in production, spend months firefighting and suddenly you’re a hero. The tech lead who pushed patches at 2am gets noticed. Management reads the complexity as evidence of a hard problem solved. The tech lead gets promoted and moves to the next team. The engineers left behind inherit the mess.

Same outcome, different path. In both cases, the engineers who built things well are invisible. The ones who created the problems or thrived on them get ahead.


Essential vs. Accidental Complexity

In The Mythical Man-Month, Fred Brooks defined two kinds of complexity. Essential complexity is the irreducible difficulty built into the problem domain itself. Accidental complexity is the difficulty we add through poor abstractions, unnecessary coupling, and artificial layers. Larry Tesler’s Law of Conservation of Complexity says essential complexity can’t be eliminated, only moved. Push it out of the user interface and it lands in your middleware.

What most companies reward the accidental kind. Many moving parts, multiple failure modes, a fleet of services with their own deployment pipelines as these look like a hard problem solved by smart engineers. A system that just works, simply and reliably, signals nothing. The people who built it must have been working on something easy. I saw this repeatedly at larger companies. Senior engineers with years of incremental, principled improvements couldn’t get promoted because their work wasn’t considered “complex enough.” The implicit rule was clear: elegance doesn’t get you promoted.


War Stories

The database migration that became a platform. At a large tech company, we needed a simple migration from one database to another but it turned into a real-time data synchronization system. Suddenly there were shadow testing components, reconciliation pipelines, anti-entropy jobs for fixing discrepancies, and runbooks for each failure mode. The project stretched from months into years. The original problem, move data from A to B, never required any of it. But the complexity generated headcount, resources, and career advancement that a clean migration would never have produced.

The microservices migration that never finished. A monolith-to-microservices transition ran so long the team ended up maintaining both systems simultaneously. The migration date kept slipping. Nobody could tell you which services were fully cut over. The codebase became a graveyard of abandoned halfway points. Years of engineering time consumed, several promotions justified. The engineers who eventually inherited it had no idea what was intentional and what was just never cleaned up.

The Erlang rewrite. At a FinTech company, senior executive decided to rewrite an order management system from Java to Erlang, not for a specific technical reason, but because Erlang was interesting. Brooks called this the second-system effect: when engineers rewrite something they think they now understand, they pile in everything they held back the first time. The effort was far larger than anyone expected. Management abandoned it partway through. The team was left with two halves of the same system in two different languages, domain knowledge split across both.

The Go rewrite. The same executive years later decided to rewrite a Java financial system in Go because Go was what the industry was talking about. Years passed, the migration stalled. Some parts in Go, most still in Java. The team gave up. Meanwhile the actual urgent problems like data consistency, observability, performance at scale went unaddressed because everyone’s attention was on the rewrite. Nobody owned the full picture of dependencies or understood the consistency guarantees. Meanwhile, sales sold the system as a low-latency and four nine availability but in practice it was based on false illusion due to poor observability.

The postscript at that second company: when AI became the new shiny thing, the pattern played out again. Engineers who built flashy demos got promoted. The people fixing real infrastructure problems had nothing visible to show.


Conceptual Integrity Breaks Down as Organizations Grow

In the original Mythical Man-Month, Brooks argued that the most important property a system can have is conceptual integrity, one coherent design philosophy, with someone who holds the whole system in mind and says no to things that don’t fit. His prescription was a chief architect with real authority over what goes in and what stays out. That works when one person can still comprehend the system. As organizations grow and systems get divided among teams, nobody has that view anymore. Each team makes locally reasonable decisions. Accidental complexity accumulates not from individual mistakes but from the disconnect between groups who can’t see each other’s work.

Cross-cutting concerns like security, authentication, observability are where this gets dangerous fastest. I saw one system where authentication behaved differently depending on whether you were on-premises or in the cloud, and whether you were hitting the control plane or data plane. Secrets in some places, JWTs in others, config files in some environments, environment variables in others, a wall of conditional logic tying it together. No single person understood the whole thing. That mess led to a significant security breach and customer churn. Nobody designed it. It grew, one locally reasonable decision at a time.


Two Different Failure Modes

Startups and large companies both get this wrong, but for opposite reasons.

Startups are under pressure to ship customer-facing features. Security, observability, performance, operational burden become “future problems.” Sometimes that’s the right call. A startup that dies building the perfect architecture ships nothing. But the technical debt from ignored non-functionals doesn’t disappear. It accumulates, and it usually arrives all at once right when the company is trying to scale. That’s the worst possible time to deal with it.

Large companies have the opposite problem. The incentive structure rewards visible complexity. Tech leads propose ambitious architectures, staff up around them, ship something complicated, and move to the next team before the consequences mature. The engineers who inherit the system didn’t choose the design, can’t fully explain it, and can’t safely simplify it because they don’t understand what each piece is actually doing.

In both cases, the people who make the architectural decisions aren’t around to live with them. That gap between decision and consequence is the core of the problem.


The Goldilocks Principle

The approach that actually works is simpler than it sounds: start with the least complex architecture that handles the real requirements. Add complexity only when something forces you to.

Not simple for its own sake, e.g., if the domain genuinely requires distributed coordination, the design should say so. But the default should be: prove the complexity is necessary before building it. “This is how I’ve seen it done at bigger companies” and “this technology is interesting” are not justifications. Neither is designing for scale you don’t have. I’ve watched teams build for ten million users when they had ten thousand, then spend two years maintaining infrastructure that served no real requirement.

Vertical slices enforce this discipline. When you ship thin, end-to-end cuts of real functionality that a user can actually touch then you find out fast whether your design is right. The feedback loop is short. A wrong assumption costs a week, not six months. You can correct before the mistake becomes load-bearing.


AI Accelerates This Problem

With tools like Claude Code and Cursor, the implementation bottleneck is largely gone. A team using AI assistants can build a distributed system with five services in the time it used to take to build one. That’s progress if the design is right. If the incentive structure still rewards accidental complexity, AI just produces it faster.

In When Copying Kills Innovation: My Journey Through Software’s Cargo Cult Problem, I shared the cargo-cult behavior like adding components because they look sophisticated happens at higher velocity now. An AI agent given a vague prompt and no design constraints defaults to patterns common in its training data. That means microservices when a monolith would do, event buses when a direct call would do, five abstractions where two would do.

As I wrote in AI Writes Code. You Own the Design., the thinking parts like the what and why can’t be delegated to an agent. AI handles the how. Engineers who can identify essential complexity, strip the accidental kind, and hold a design together are more valuable now than before. But only if the organization’s reward structure reflects that.


How Do You Fix the Reward Structure?

I don’t have a clean answer. But here’s where the levers are.

  • Reward outcomes, not artifacts. Most promotion processes credit visible artifacts: the design doc for a complex system, the heroic incident response, the fleet of services owned. The outcomes that actually matter, a system that stayed up for two years, a migration that finished in six weeks, a design that five new engineers understood on day one are harder to see and usually go uncredited. Engineering leaders have to explicitly define what good engineering looks like and measure it over time horizons long enough to see consequences.
  • Make accountability follow decisions. Connect tech leads to the consequences of their architectural choices twelve to eighteen months later. Not as punishment as designs fail for unforeseeable reasons. But an engineer who never sees what their decisions cost never updates their model. Right now the feedback loop doesn’t exist for most people who make these calls.
  • Credit the “no.” The engineers who prevent bad architectures from being built are the hardest to recognize. The bad system was never built, so there’s nothing to point to. If you want more of this behavior, name it explicitly and credit it explicitly. Otherwise the rational move for any ambitious engineer is to propose the complex thing and let someone else clean it up.
  • Add a simplicity lens to design reviews. Most design reviews ask: will this work? Fewer ask: is this more complex than it needs to be? Formally asking “what would we remove without losing essential functionality?” changes the conversation. The burden of proof shifts to adding a component, not removing one.

The Conversation Worth Having

Brooks wrote that conceptual integrity is the most important consideration in system design. What the book doesn’t address is that most organizations are structured to undermine it like rewarding the engineers who add complexity and moving them on before they face the consequences. The engineers who hold the line against unnecessary moving parts, who ship systems that work quietly for years, who say “we don’t need this” and mean it are doing some of the hardest work in software. In most companies, they’re not the ones getting promoted.

With AI accelerating the implementation layer, the judgment required to distinguish essential from accidental complexity matters more than it ever has. If the reward structure doesn’t change to reflect that, we’ll just build the wrong things faster.


Related reading:

May 19, 2026

AI Writes Code. You Own the Design. Here’s How to Keep It That Way

Filed under: Computing,Methodologies — admin @ 9:48 pm

The Eternal Quest to Make Coding Simpler

I wrote my first program in BASIC on an Atari in the 1980s with line numbers, GOTOs, no debugger. Turbo Pascal changed everything: integrated editing, instant compilation, step-through debugging. Then Borland C++, then Visual Basic, then Eclipse, then IntelliJ. This pattern where new tool arrives, productivity jumps, complexity catches up has repeated itself every few years across my entire three-decade career.

In the early 1990s, 4GL tools promised to eliminate coding entirely. dBase, FoxPro, PowerBuilder — the pitch was always the same: “Business users can build their own applications.” Simple CRUD apps were easy. Real systems with business logic, error handling, and concurrent users turned out harder than writing code from scratch. UML consumed the next decade. I spent years with Rational Rose doing forward and backward engineering from class diagrams. The generated code was rigid scaffolding that fought you. Diagrams drifted from reality within weeks, because maintaining two representations of the same truth is inherently unsustainable.

The lesson I keep relearning: every attempt to separate “what to build” from “how to build it” through tooling alone produces rigid, brittle systems. The gap between specification and implementation is a thinking problem. Tools that hide it make things worse.


The AI Inflection Point

Around 2020, I started using GitHub Copilot for autocomplete. ChatGPT and Claude helped with isolated problems — boilerplate, algorithm refreshers. Useful but incremental. Then Claude Code arrived in early 2025, and everything changed. I’ve used it for 100% of my coding for over a year, not as autocomplete but as a full development partner: architecture, implementation, testing, debugging, deployment. The productivity gains are real. The failure modes are real too. Amazon AWS teams learned this the hard way, AI-generated code that looked right, passed superficial review, then caused production incidents. Their response was to tighten review policies significantly. I’ve seen the same pattern repeatedly: AI ships code that introduces subtle bugs in unfamiliar codebases, silently violates domain invariants, or creates architectural inconsistencies that compound over weeks. The problem isn’t that AI writes bad code. It writes locally correct code that doesn’t fit the bigger picture.


The Memento Problem

People compare AI coding agents to interns. That analogy breaks in one critical way: AI agents suffer from anterograde memory loss. Like the protagonist in Memento, every session starts from zero. An intern who made a mistake yesterday remembers it today. They build mental models of your codebase, internalize conventions through repetition. An AI agent? Session ends, memory gone. Tomorrow it will make the exact same architectural mistake, violate the same naming convention, choose the same wrong abstraction. It doesn’t learn from correction, it only learns from context provided in each session.

This is why rules, conventions, and structured knowledge aren’t optional nice-to-haves for AI-assisted development. They’re the equivalent of Leonard’s tattoos and photographs, which is the external memory system that makes coherent action possible despite the inability to form new long-term memories. I built these skills because I got tired of repeating the same corrections. Every session I found myself saying “no, we use Result types here, not exceptions” or “no, that should be a sum type” or “no, you need an idempotency token on that create endpoint.” The skills encode these corrections permanently so I stop repeating myself.

The Outsourcing Parallel

Every offshore engagement I’ve run hit the same wall: limited overlap hours, different definitions of ‘done,’ and a gap between what I envisioned and what arrived. Formal process wasn’t optional, it was the only thing that worked. What I learned: formal process wasn’t optional with outsourced teams. The teams that succeeded had detailed specs, explicit acceptance criteria, structured handoffs, and review gates. The teams that failed relied on “they’ll figure it out” and got back code that met the requirements on surface. This spawned CMM, RUP, Six Sigma — frameworks so heavy the documentation cost exceeded its value. Agile won because lightweight feedback loops beat upfront specification when communication bandwidth is high. Agile methodologies won because they recognized that lightweight, iterative feedback loops beat heavyweight upfront specification for teams with high-bandwidth communication.

AI agents resemble outsourced teams more than co-located colleagues. They have a narrow context window — like limited overlap hours across time zones. They lack shared understanding of your codebase. They produce locally correct work that misses the bigger picture. The lesson from outsourcing holds: formal process works when communication bandwidth is constrained. These skills apply that lesson with minimum ceremony — just enough structure to preserve conceptual integrity across sessions, without recreating the documentation burden that killed RUP.

Production agent systems need tiered memory: short-term (current session), medium-term (project conventions), and long-term (organizational knowledge). These skills are the middle tier, project-level knowledge that persists across sessions without requiring permanent documentation. They’re the bridge between ephemeral conversation and hard-coded policy.


Conceptual Integrity in the Age of AI

Fred Brooks wrote this in The Mythical Man-Month (1975). Martin Fowler recently reminded us it’s never been more relevant:

“I will contend that conceptual integrity is the most important consideration in system design. It is better to have a system omit certain anomalous features and improvements, but to reflect one set of design ideas, than to have one that contains many good but independent and uncoordinated ideas.”

This principle has never been more relevant. When an AI agent generates code, it produces locally correct solutions like the function works, the test passes, the API responds. But without conceptual integrity, each generated piece reflects a different design philosophy. One module uses exceptions, another uses Result types. One endpoint follows REST conventions, another doesn’t. One service uses the outbox pattern for events, another dual-writes to the database and message queue. Over time, the codebase becomes exactly what Brooks feared: “many good but independent and uncoordinated ideas.”

Code serves two purposes: machine instructions and conceptual modeling. AI commoditizes the first. The second, the model that captures how your domain actually works, remains yours to own. Generate code 10x faster without protecting that model, and you get systems 2x harder to maintain. Spec-driven development frameworks like OpenSpec and Spec-Kit push toward treating prompts as first-class delivery artifacts, versioned, reviewed, maintained alongside code. That’s the gap these skills fill. They encode conceptual integrity, design philosophy, conventions, quality standards into reusable artifacts that survive across sessions.


What You Own vs. What AI Owns

“We adopted AI coding but it hasn’t increased revenue.” Of course not. AI doesn’t solve what to build, it accelerates how to build it. You still need product/market fit, customer feedback, and domain expertise. More importantly: when AI causes a security incident or production outage, you can’t fire it. You’re accountable. Here’s the ownership boundary I enforce:

You OwnAI Accelerates
What to build (product vision)How to build it (implementation)
Why it matters (business context)Boilerplate and mechanical translation
Quality standards and conventionsApplying those standards consistently
Architecture decisionsExploring design alternatives quickly
Security postureChecking against known vulnerability patterns
Production accountabilityMonitoring, alerting, runbook generation
Domain knowledgeTranslating that knowledge into code

The skills encode this boundary explicitly: you drive the what and why; AI executes the how within guardrails you define. Every skill in the set reinforces this split.


Why Formalized SDLC Works Better with AI

I’ve worked in both worlds: big-company SDLC with architecture reviews, security reviews, production readiness checklists and startups where you discuss an idea over coffee and ship by afternoon. AI works better with the formalized approach. The reason is the same one that sank outsourcing arrangements with vague requirements: if you can’t state precisely what you want, the other party fills gaps with assumptions. Here’s why structure helps specifically with AI:

  • Structure gives AI context. A well-written PRD tells the agent why it’s building something, what constraints matter, which edge cases to handle. Without this, AI fills gaps with assumptions from training data, which may not match your domain.
  • Checkpoints catch drift early. When AI generates 800 lines in one session, reviewing it as a monolithic diff is overwhelming. I learned this the hard way. Now I break work into smaller tasks and enforce checkpoints every 5 files where build and test must pass before proceeding. Small, verified increments compound into reliable systems.
  • Conventions reduce error surface. When you explicitly state “use Result types for errors, never exceptions” and “all IDs are ULIDs, never UUIDs” then AI follows them. Without explicit conventions, it defaults to whatever was most common in training data, which varies wildly by context.
  • Smaller increments compound. AI excels at small, well-defined tasks with clear acceptance criteria. This isn’t new wisdom as vertical slicing and thin end-to-end increments have been SDLC best practice for decades. What’s good for human developers turns out to be good for AI too
  • Sloppy codebases amplify AI mistakes. In clean, well-structured code with clear module boundaries, AI makes fewer errors. It can hold the relevant context. In sprawling, inconsistent codebases with 2000-line files and mixed conventions, AI hallucinates patterns, mixes styles, and creates subtle inconsistencies. Well-structured code isn’t just readable for humans, it’s how AI holds context without drifting.

The Skills: A Structured SDLC for AI-Assisted Development

Here’s the full lifecycle, with each phase mapped to a skill and the key lessons that shaped it:


Phase 1: Requirements Refinement (/ygs-refine-prd)

I’ve watched AI build the wrong thing fast more times than I can count. The root cause is always the same: vague requirements. When I tell an agent “build a notification system,” it picks a design based on training data patterns. When I tell it “build a notification system that MUST deliver within 500ms for P0 alerts, SHOULD batch P2 notifications into hourly digests, and MAY support user-defined routing rules” then it builds something specific and testable. The refine-prd skill forces this precision through structured questioning. It interviews me relentlessly: one question at a time, providing its recommended answer, waiting for my feedback before continuing. It challenges vague language: “fast means what: 100ms? 1 second? Faster than the current system?” It pushes me to define concrete scenarios with Given/When/Then acceptance criteria borrowed from OpenSpec.

Key lessons encoded:

  • RFC 2119 keywords force commitment. Labeling requirements as MUST (P0), SHOULD (P1), or MAY (P2) prevents the “everything is critical” trap. I’ve seen projects fail because nobody ranked requirements, so the team optimized for P2 features while P0 requirements remained unmet.
  • Capabilities mapping reveals brownfield complexity. Categorizing changes as New/Modified/Removed surfaces the reality that most “new features” actually modify existing behavior, which is always harder than greenfield and needs different estimation.
  • Non-goals prevent scope creep. Explicitly stating what you will NOT build is as important as defining what you will. Without non-goals, AI treats every tangent as in-scope.

This is where you own the what. The AI sharpens your thinking, but the product decisions stay yours.


Phase 2: Technical Design (/ygs-refine-trd)

Without a technical design document, AI makes architectural decisions implicitly and they’re often wrong. I watched an agent choose microservices for a problem that needed a single process with good module boundaries. Another time it introduced an event bus between components that were always co-located and synchronous. Both were “correct” patterns applied to wrong contexts. The refine-trd skill challenges my technical approach through structured questioning, then produces a design document with explicit trade-off analysis and requirements traceability with every design decision maps back to a PRD requirement with rationale. For larger efforts spanning multiple components, I use a comprehensive design doc template that I previously shared in my blog. It covers the full lifecycle: from problem statement through architecture, alternatives analysis, non-functional requirements, rollout plan, and inline ADRs recording every key decision with its rationale and reversibility. The most powerful design tool isn’t testing, it’s the type system. When I rebuilt a Rust observability pipeline around algebraic data types and explicit state machines, entire bug categories disappeared:

Making Invalid States Impossible

The most powerful design tool isn’t testing, it’s the type system. Restructuring a pipeline around algebraic data types and explicit state machines made entire bug categories impossible to write:

  • Sum types enumerate valid states explicitly. I can’t accidentally process a Pending message as if it were Confirmed because the compiler won’t let me.
  • Typestate pattern encodes valid transitions in the type system. A Draft document can move to Review or Deleted, but never directly to Published. Invalid sequences are compile errors, not runtime bugs.
  • Parse, don’t validate transforms unstructured input at boundaries into strongly-typed domain objects. Once parsed, code trusts the types internally without defensive null checks scattered through business logic.
  • Errors as values using Result<T, E> types cannot be silently ignored. Compare this to exceptions that propagate invisibly through 14 stack frames before someone catches them with an empty catch block.
  • Functional core, imperative shell separates pure domain logic from I/O orchestration. The domain code is trivially testable because it has no side effects. The shell is thin and mechanical.

These principles matter enormously for AI-generated code because the compiler becomes your reviewer. When AI generates code within a well-typed system, category errors that would slip through human review become impossible to express.

Deep Modules Over Shallow

AI defaults to shallow modules, lots of small classes, each delegating to the next without adding value. A Philosophy of Software Design encourages modules with small interfaces and rich implementations. I’ve reviewed too many codebases where every class has an interface, every interface has one implementation, and understanding a feature requires bouncing through 15 files, each delegating to the next without adding value. The deletion test cuts through this: imagine deleting the module. If complexity vanishes, it was a pass-through and adding nothing but indirection. If complexity reappears across N callers, it was earning its keep. I apply this ruthlessly now. One adapter means a hypothetical seam. Two adapters means a real one. Don’t build seams speculatively.

Cognitive Load as Design Constraint

Three constraints keep AI-generated functions reviewable:

  • Methods stay under 24 lines. Working memory holds 4-7 chunks, code exceeding this becomes unmanageable regardless of how “clean” it looks.
  • No more than 7 concepts in a section. If I need a comment to explain what a block does, it should be a function with that name instead.
  • Fractal decomposition. Each level hides details while allowing drill-down. The system is comprehensible at every zoom level.

AI agents benefit from these constraints more than humans do. A function under 24 lines fits entirely in the context window. A deep module with a small interface can be understood without reading its implementation. Clean structure gives AI less opportunity to hallucinate.


Phase 3: Architecture (/ygs-refine-architecture)

For changes spanning multiple components, I use architecture refinement to capture system-level decisions that no single PR review can validate. The skill interviews me about module boundaries, seam placement, data flow, and failure modes and challenging shallow designs and pushing for depth. Three hard lessons shape every distributed system I design:

  • Transaction Boundaries Drive Architecture: I learned this lesson the expensive way: atomicity requirements dictate service boundaries, not the other way around. Teams that draw service boundaries first and then try to maintain consistency across them end up with distributed transactions, eventual consistency bugs, and data loss scenarios that take months to resolve.
  • The dual-write problem is the #1 source of data inconsistency I’ve encountered in microservice architectures. Writing to a database and publishing an event in separate operations means either can succeed while the other fails — leaving your system in an inconsistent state. The outbox pattern solves this: write the event to an outbox table in the same database transaction, then relay it asynchronously. Simple, reliable, non-negotiable for any system I design now.
  • For operations spanning multiple services, SAGA with explicit compensation replaces distributed transactions. Each step has a defined undo operation. When step 4 of 6 fails, steps 3, 2, and 1 execute their compensating actions. The key insight: design compensation logic before the happy path, because it’s always harder than you think.

Domain-driven design adds three more constraints that AI consistently gets wrong without explicit guidance:

  • Bounded contexts draw ownership lines. Each microservice owns one context where one set of domain concepts with one consistent vocabulary. Cross-context communication happens through well-defined events, not shared databases.
  • Ubiquitous language prevents the translation bugs I’ve seen kill projects. When the code says Order but the domain expert means Reservation, every conversation introduces subtle misunderstandings that compound into wrong implementations.
  • Hexagonal architecture (ports and adapters) means dependencies point inward. Domain logic knows nothing about HTTP, databases, or message queues. This isn’t academic purity, it’s what makes the system testable without spinning up infrastructure.

Fault Tolerance Is Architecture, Not Code

Fault tolerance is an architecture decision, not an implementation detail. Bolt it on after the fact and you get a system that fails catastrophically under load:

  • Circuit breakers prevent cascade failures. When a downstream service is unhealthy, stop sending it requests. I’ve seen a single slow database query bring down six upstream services because nobody implemented this.
  • Retry with jitter uses exponential backoff plus randomization. Without jitter, all clients retry at the same moment after an outage resolves, creating a thundering herd that triggers another outage.
  • Bulkhead isolation gives each dependency its own thread/connection pool. A slow payment provider shouldn’t exhaust your entire connection pool and take down order processing.
  • Graceful degradation means deciding in advance what to show users when a dependency fails. Not an error page, a degraded experience.
  • No hard startup dependencies. Services start even when dependencies are unavailable. They serve degraded responses and recover automatically when dependencies come back.

Phase 4: Estimation (/ygs-estimate)

Management wants dates. Engineers want to build. This tension has existed since the first software project went over schedule. I wrote about estimation practices years ago, and the core lessons haven’t changed: estimates are not commitments, decomposition reduces error, and teams consistently underestimate because they scope only the coding work. The estimate skill bridges the gap between “we need a date” and “it’ll be done when it’s done” with structured complexity-based estimation:

  • T-shirt sizing at the feature level. Before diving into details, I size each major capability as XS through XL based on complexity, uncertainty, and integration surface. An XL (4-8 weeks, architectural change) signals that the feature itself needs decomposition before meaningful estimation is possible. Uncertainty multipliers compound: new technology × external dependency = 2x your initial guess.
  • Story points at the task level. Using Fibonacci sequence (1, 2, 3, 5, 8, 13, 21) with planning poker when multiple people are involved. The power of Fibonacci isn’t magical, it’s that the gaps between numbers grow, forcing you to acknowledge increasing uncertainty rather than pretending you can distinguish between “7 days” and “8 days” of work.
  • Three-point estimation for commitments:
Expected = (Best + 4×MostLikely + Worst) / 6

Present ranges, not single numbers. “3-4 weeks with a tail risk of 6 weeks if the external API integration is harder than expected” gives management real information to plan around.

Key lesson: capacity is never 100%. I’ve seen teams plan sprints assuming full developer availability and then wonder why they deliver 60%. The reality:

CategoryTypical Budget
Feature work50-60%
KTLO (maintenance, tech debt, bug fixes)20-30%
On-call / incidents5-15%
Vacation / holidays / sick10-15%
Meetings / reviews / planning5-10%

Some teams I’ve worked with budget 40% for KTLO. If your system is old and fragile, that’s not pessimism, that’s realism. The skill asks the user what their team’s actual allocation is, because it varies enormously.

The most common estimation failure: forgetting everything that isn’t “writing code.” Engineers estimate the implementation and forget testing (20-40% of the work), deployment changes (IaC, Kubernetes manifests, feature flags), observability (metrics, dashboards, alerts, tracing), on-call runbooks and troubleshooting guides, data migration scripts, security review fixes, and documentation. My rule of thumb: if the estimate only covers writing code, double it to account for everything needed to ship to production safely.


Phase 5: Spike (/ygs-spike) — When You Don’t Know Enough

Not every feature goes straight from design to implementation. Some involve risky unknowns like a new database, an unfamiliar integration, an algorithm you’ve never tried at scale. The spike skill exists for these moments: a time-boxed experiment to answer a specific question before committing to a full design. The spike lives on a spike/ or fafo/ branch, deliberately relaxes production standards, and produces exactly one artifact: a findings doc with a clear verdict. What spikes are for:

  • Performance validation: “Can our schema handle 10K writes/sec?” Write the hot path, add a benchmark harness, measure.
  • Integration feasibility: “Does this library work with our auth stack?” Wire two systems together, make one end-to-end call work. Done.
  • Algorithm proof: “Is this fast enough for real-time?” Implement the core loop, feed it representative data, measure latency at p99.

The spike skill enforces this discipline: define hypothesis up front, scope what’s allowed, build the minimum experiment, record findings with evidence, and recommend next steps. If the spike confirms feasibility, you proceed to full design with confidence. If it refutes your hypothesis, you’ve saved weeks of wasted implementation.


Phase 6: Work Breakdown Structure (/ygs-wbs)

AI excels at small, well-defined tasks. It struggles with large, ambiguous ones. The WBS skill hierarchically decomposes deliverables into vertical slices, thin end-to-end cuts through all layers, each independently demoable and verifiable. Like a traditional Work Breakdown Structure, it divides complex projects into manageable components at three levels: deliverables (major features), work packages (independently shippable units), and tasks (atomic implementation steps).

Key lessons from years of estimation and delivery:

  • Vertical over horizontal. Each task cuts through UI, API, and database, not “build all the models, then all the APIs, then all the UI.” Horizontal slicing delays feedback. You don’t know if the feature works until the last layer is complete. Vertical slicing gives you a working thin slice from day one.
  • Dependency ordering prevents blocked work. Data model tasks before API tasks before UI tasks. Shared utilities before their consumers. I sequence tasks so each one builds on verified, tested foundations.
  • Scope signals trigger splits. When I see “and also…” or “and verify…” in a task description, that’s two tasks disguised as one. Exception: causally dependent steps (create migration + update model + update handlers for same entity) stay together.
  • Size drives ceremony. Small tasks (1-3 files, <300 lines) get standard workflow. Large tasks (8+ files, 800+ lines) get flagged immediately for splitting. I’ve learned that tasks AI implements in one session should stay under 300 lines of change, beyond that, coherence degrades.

Phase 7: Implementation (/ygs-implement)

Without guardrails, AI will modify 30 files in one session, introduce subtle coupling between components that should be independent, and produce a diff too large to review meaningfully. I’ve had sessions where the agent touched 12 files to implement a feature that should have required 4, each extra file an “improvement” that wasn’t asked for. The implement skill enforces discipline:

Scope guardrails I enforce:

  • 3+ unplanned files -> STOP. The agent reports the deviation and asks me to confirm expanded scope. This single rule has prevented more architectural drift than any other practice.
  • Checkpoint every 5 files. Build and tests must pass before proceeding. Catches regressions early when they’re cheap to fix.
  • Deviation tracking. When implementation differs from design: “Design said X, did Y because Z.” This documentation prevents the next session from reverting the deviation or making it worse.

Three testing rules I enforce regardless of who wrote the code:

  • Stubs only at 3rd-party/OS boundaries: HTTP clients, system clocks, filesystem, randomness. Everything else uses real implementations.
  • If you can’t test without mocking internal code, the design is wrong. This is a litmus test I apply relentlessly. Mocking internals means your modules are coupled. Fix the coupling, don’t paper over it with mocks.
  • Test the public contract, not implementation details. Tests that verify internal method calls break every refactor. Tests that verify external behavior survive decades.

Four tidying rules that prevent AI from refactoring itself into bugs:

  • Tidy first but only when it makes the next change cheaper. I’ve watched AI eagerly refactor things that don’t need refactoring, burning context and introducing bugs. The rule: cost(tidy) + cost(change after tidy) < cost(change without tidy). Otherwise, leave it.
  • Guard clauses over nested conditionals. Early returns flatten code and make the happy path obvious.
  • One pile first. Before splitting scattered code into elegant modules, consolidate it in one place. Understand the full picture before decomposing. AI tends to decompose prematurely, creating abstractions before understanding what varies.
  • Tidy in separate commits from behavior changes. Never mix formatting with functionality. It makes review impossible and rollback dangerous.

Phase 8: Code Review (/ygs-code-review)

AI-generated code passes syntax checks and basic tests but can contain subtle logic errors, security holes, and design violations that only emerge under careful structured review. I don’t trust casual “looks good” scanning instead I use a two-pass approach with explicit criteria.

Pass 1 Critical issues (blocks merge):

  • Logic errors. Off-by-one bugs, null handling, race conditions (TOCTOU, check-then-act, find-or-create without locks).
  • Security holes. Injection (SQL, XSS, SSRF, path traversal), hardcoded secrets, missing auth checks.
  • Data loss. Destructive operations without confirmation, missing transactions around multi-step mutations.
  • Error swallowing. Empty catch blocks, ignored return values, Result types discarded with .unwrap() or _ =.
  • Partial failure. What if the operation half-succeeds? I’ve seen update endpoints that modify 3 records in sequence, e.g., if #2 fails, #1 is already committed and the system is in an inconsistent state.
  • Enum completeness. New enum values must be traced through ALL consumers. One unhandled match arm in a downstream service can cause silent data loss.

Pass 2 Design and maintainability:

  • Immutability and state. Is mutable state minimized? Are invalid states representable? Should this use an explicit state machine instead of boolean flags?
  • Type safety. Sum types for variants? Newtypes for semantically different IDs (UserId vs OrderId)? Parse-don’t-validate at boundaries?
  • Command-Query Separation. Methods either change state OR return data, never both. Violations make code unpredictable and untestable.
  • Interface design. Deep modules with small interfaces? Or shallow pass-throughs adding indirection without value?
  • Performance. N+1 queries hiding inside loops, missing database indexes for common query patterns, O(n^2) operations on collections that grow.
  • Proportionality. Is the complexity justified by data? I’ve reviewed PRs that introduced three new abstractions for a feature used by 12 people. Proportionality means the solution matches the problem’s actual scale.

Severity classification:

  • MUST — Blocks merge (correctness, security, data loss)
  • SHOULD — Strong recommendation (design, performance, testability)
  • MAY — Suggestion (naming, style, minor optimization)

You don’t get the same understanding from reviewing as from writing, that tension is real. But structured multi-pass review with explicit criteria gets you closer than rubber-stamping ever could.


Phase 9: Security Review (/ygs-security-review)

AI doesn’t think adversarially. It generates happy-path code that works when used as intended. Attackers don’t use things as intended. I’ve seen AI-generated endpoints that validated input on the frontend but accepted anything on the backend, that logged full request bodies including passwords, that built SQL queries with string interpolation “because the ORM was too slow.” The security review skill forces red-team thinking for every changed endpoint.

Lessons from my previous post on building secure microservices:

  • Injection vectors. I check for SQL injection (raw queries with interpolation), command injection (exec/system with user input), template injection (SSTI), XSS (unescaped user content in responses), SSRF (user-controlled URLs in server requests), and path traversal (user input in file paths).
  • Authentication & authorization. Missing auth checks on new endpoints (AI doesn’t always copy the middleware pattern). Broken access control where user A can access user B’s resources by changing an ID in the URL. Privilege escalation through parameter manipulation.
  • Data exposure. Sensitive data in logs (I’ve caught AI logging full request bodies including auth tokens). Secrets in error messages returned to clients. Debug information in production responses.
  • Supply chain. Vulnerable or unpinned dependencies. Deserialization of untrusted data (pickle, YAML.load, eval). AI loves pulling in libraries without checking their security posture.

Red-team perspective: I ask these questions for every endpoint:

  • What happens if someone sends 10,000 requests per second? (Rate limiting)
  • What if they bypass the frontend entirely and craft raw API calls? (Server-side validation)
  • What’s the blast radius if this component is fully compromised? (Lateral movement, data access)
  • What happens on double-submit within 100ms? (Idempotency)
  • Is there defense in depth, or does one failed check expose everything? (Layered security)

The CIA triad applied to every data flow:

  • Confidentiality: Encryption at rest and in transit, access controls at every hop, zero-trust between services
  • Integrity: Cryptographic verification of artifacts, input validation at trust boundaries, tamper detection
  • Availability: Redundancy, failover, rate limiting to prevent DoS, graceful degradation under attack

For systems with significant attack surface, I produce a formal STRIDE threat model, systematically enumerating threats per subsystem, classifying assets by sensitivity, identifying trust boundaries, and tracking mitigations to completion. The structured template ensures nothing falls through the cracks: every threat gets an owner, a mitigation plan, and a security test that verifies the fix.


Phase 10: SRE Review (/ygs-sre-review)

Code that works in development fails in production. AI has no intuition for this because it’s never been paged at 3am. It doesn’t know that a missing index causes 30-second queries under load, or that an unbounded list endpoint will OOM the service when it hits 10 million records. The SRE review skill forces failure-mode analysis from my production readiness experience:

For every changed component, I analyze:

  1. What happens when it fails? Crash, hang, corrupt data, or silent degradation? Each demands a different mitigation.
  2. Blast radius. Does failure cascade? A single unhealthy pod shouldn’t take down the cluster. Circuit breakers and bulkheads contain damage.
  3. Recovery path. Auto-recovers (best), requires restart (acceptable), requires manual intervention (document it), requires data repair (unacceptable without backups).
  4. Partial failure. What if step 3 of 5 succeeds but step 4 fails? Is the system in a consistent state? Are there compensating actions?

Observability because you can’t fix what you can’t see:

  • Metrics: Latency percentiles (p50, p95, p99), error rates, throughput, saturation (CPU, memory, connections, disk).
  • Logging: Structured with correlation IDs. Proper levels. No PII. Enough context to diagnose without reproducing.
  • Tracing: Distributed tracing end-to-end. When a request touches 6 services, I need to see the full path without grepping logs across clusters.
  • Alerting: Threshold-based AND anomaly detection. Every alert links to a runbook. If an alert fires and the responder doesn’t know what to do, the alert is useless.

Deployment safety:

  • Canary releases: Deploy to 1% of traffic, monitor for 15 minutes, auto-rollback on metric breach. This catches issues that tests miss.
  • Backward-compatible schema changes: Two-phase releases (add column -> deploy code that writes both -> migrate data -> remove old column -> deploy code that reads new). Never lock a production table.
  • Feature flags: For anything risky, ship dark and enable gradually. This decouples deployment from release.
  • Immutable infrastructure: No in-place patches. Every deployment is a fresh container from a verified image.

Testing pyramid from Google SRE practices:

LayerProportionWhat It Catches
Unit tests80%Logic errors, edge cases, regressions — fast, isolated, deterministic
Integration tests15%Component interactions, contract violations, real DB behavior
End-to-end tests5%Critical user journeys, cross-service flows — expensive, flaky, essential
Chaos testingPeriodicFailure recovery, cascade prevention, degradation behavior
Property-basedWhere applicableInvariant violations across random inputs, edge cases you didn’t imagine

In my post about caching, I shared caching related production failures I’ve encountered repeatedly:

  • Thundering herd after cache expiry. All clients hit the backend simultaneously. Stagger TTLs and use cache stampede prevention.
  • Stale data during update failures. Serving old data is sometimes acceptable, sometimes catastrophic, know which case you’re in.
  • Cache unavailability causing cascading failures. Test performance without cache during peak load. If your system can’t function without cache, cache is a hard dependency, not an optimization.
  • Security: cache keys MUST respect authorization boundaries. I’ve seen cached responses served to unauthorized users because the cache key didn’t include tenant ID.
  • Bimodal behavior: when the system behaves fundamentally differently with vs. without cache, you have two systems to understand and debug. Minimize this.

Phase 11: QA and UAT (/ygs-qa, /ygs-uat)

I separate QA from UAT because they catch different failure modes. Code can be functionally correct and still unusable. An API can return the right data and still violate the user’s mental model of how the workflow should behave.

QA (/ygs-qa) tests the system objectively:

  • Functional correctness: Does core logic produce right results for valid inputs?
  • Edge cases: Boundary values, empty inputs, maximum limits, null handling, Unicode, special characters
  • Error paths: Invalid input, network failures, timeouts, partial failures — does the system degrade gracefully or crash?
  • Regressions: Do existing features still work after the change? This is where AI causes the most subtle damage: fixing one thing while breaking something adjacent.
  • Performance: Response times acceptable? No degradation under load? No memory leaks in long-running processes?

I score each category 0-10 and produce an overall health rating (0-50). This gives me a quantitative signal for ship readiness rather than a vague “looks good.”

UAT (/ygs-uat) tests from the customer’s perspective:

  • Walk through actual user stories end-to-end. Not individual API calls, complete workflows as a user would experience them.
  • Error messages must be helpful, not technical. “Connection refused to localhost:5432” is a developer error message. “We’re having trouble loading your data, please try again” is a user error message.
  • Check the golden path AND the “what if the user does something weird” paths. What if they double-click? What if they navigate back mid-flow? What if they have 10,000 items instead of 10?

Both must pass before shipping. I’ve shipped code that was technically correct but confused every user who touched it.


Phase 12: Ship and Learn (/ygs-ship, /ygs-retro)

Sync (/ygs-sync) addresses a problem I’ve seen kill design docs across every team I’ve worked with: docs drift from reality within weeks. The OpenSPDD project formalizes this as bidirectional synchronization. When code changes during review or refactoring, the design documents must update to reflect actual implementation, not just planned implementation. Stale docs are worse than no docs because they actively mislead. The sync skill compares implementation against spec, identifies drift, and proposes updates with rationale (“Design said Strategy pattern; implementation uses simple switch because only 2 variants exist”).

Ship (/ygs-ship) enforces the pre-merge ceremony I’ve seen skipped too many times:

  • All tests pass (not “most tests pass” ALL tests pass)
  • Diff reviewed against base branch, no debug code, no .env files, no build artifacts
  • Version bumped appropriately (patch for fixes, minor for features, major for breaking changes)
  • Changelog updated so consumers know what changed
  • PR created with clear description for the record

No shortcuts. The ceremony exists because every shortcut I’ve taken in 30 years has eventually cost more than the ceremony would have.

Retro (/ygs-retro) closes the feedback loop — and this is where learning happens:

  • What went well: Practices to keep. Architectural decisions that paid off. Estimation accuracy.
  • What didn’t: Missed estimates (why specifically?). Bugs that shipped (what review would have caught them?). Scope creep (where did it come from?).
  • Patterns: Recurring issues across tasks reveal systemic problems. The same type of bug appearing three times isn’t bad luck — it’s a missing test category or a design flaw.

Five Whys with the Swiss Cheese model drives every retro:

  1. Why did the system fail? -> Direct cause
  2. Why was that possible? -> Missing guard
  3. Why wasn’t it prevented? -> Process gap
  4. Why wasn’t it detected? -> Monitoring gap
  5. Why wasn’t impact contained? -> Isolation gap

Multiple barriers had to fail simultaneously for the incident to reach customers. The fix is never “be more careful”, it’s always a structural change: a new test category, a new circuit breaker, a new alert threshold, a new deployment gate.


The Code-to-Production Pipeline

See my post on production readiness:


Beyond Vibe Coding: Specifications as the Missing Layer

Most teams use AI in what I call vibe coding mode: describe what you want in natural language, generate code, iterate. It works for small problems. It fails for complex systems. I tested this boundary directly by combining TLA+ formal specifications with Claude. The insight: AI fails not because of intelligence limits, but because we give it vague specifications. “Create a task management API” produces guesses. A TLA+ spec defining valid state transitions, invariants, and concurrent scenarios produces code that satisfies those properties precisely. You don’t need TLA+ for every feature. But the spectrum matters:

  • Vague natural language ? AI guesses, inconsistent edge case handling
  • Structured requirements (RFC 2119 + Given/When/Then) ? AI follows rules, mostly correct
  • Formal specifications (TLA+) ? AI implements verified properties, comprehensive test coverage from execution traces

Writing TLA+ properties reveals design flaws before implementation. I discovered that sequential task IDs create security vulnerabilities — a flaw that wouldn’t surface until production. The model checker found it automatically. The SDLC skills sit in the practical middle: structured enough to eliminate ambiguity, lightweight enough to use daily.

The REASONS Canvas: Structured Prompts as Design Contracts

The OpenSPDD project takes this further with a 7-dimension framework called the REASONS Canvas: Requirements, Entities, Approach, Structure, Operations, Norms, Safeguards. The distinction between a plan and a REASONS Canvas is the distinction between a suggestion and a contract. Plans describe intent; structured prompts define constraints that eliminate AI improvisation. I’ve incorporated the most valuable elements into these skills:

  • Entities as an explicit TRD questioning dimension — forcing domain model clarity before implementation
  • Norms and Safeguards — explicit negative constraints (“do NOT refactor existing structures unless requirements demand it”) that prevent AI from improvising
  • Operations sequencing — implementation order based on dependency analysis, not arbitrary file ordering
  • Bidirectional sync — the insight that design docs must stay accurate as code evolves, not just at initial creation

The key insight from SPDD’s design philosophy resonates: capability and control are separate dimensions. AI models keep getting smarter (capability improves), but that doesn’t automatically improve alignment with your specific intent (control).


Prompting Frameworks: Why Structure Beats Eloquence

Following prompting frameworks shaped how I designed every skill in this set:

  • R.E.A.S.O.N. (Role, Environment, Action, Steps, Output, Negatives): The Negatives dimension is underappreciated. Telling AI what NOT to do eliminates entire categories of unwanted behavior more reliably than telling it what to do. Every skill includes explicit constraints: “do not refactor existing code,” “do not touch files outside task scope,” “do not fix without establishing root cause.”
  • PRISM for reasoning models (Problem, Relevant Information, Success Measures): For newer reasoning models, step-by-step instructions can degrade performance. Define the problem, provide context, specify what success looks like, then let the model’s internal reasoning find the path. The refine skills work this way: instead of prescribing exact steps, they define dimensions to explore and quality criteria to meet.
  • Context hygiene:Agent quality is roughly 75% model, 25% context. Long sessions degrade as context fills and compacts. The SDLC skills address this structurally: each phase is a separate invocation, artifacts persist as files (not conversation history), and small vertical-slice tasks complete within a single focused session. Since the agent can’t remember across sessions, encode everything important into files that do.
  • Multi-Shot and Few-Shot Patterns: Providing examples of desired output format dramatically improves consistency. The skills encode this implicitly, e.g., the templates (PRD, TRD, design doc, threat model, task, ADR) serve as few-shot examples of the expected output structure. When the AI reads a template before generating, it produces output that matches the format without being told explicitly. The design doc template encodes the 9-section structure I’ve refined over years of writing design documents at scale: executive summary, background/problem statement, proposal with stakeholders, architecture with failure paths, alternatives considered, functional requirements traced to PRD, non-functional requirements (performance, security, operations, cost), rollout plan with phases, and a decision log recording ADRs inline. The threat model template follows STRIDE methodology with 13 sections: from defining security tenets and trust boundaries through systematic threat analysis grouped by subsystem, to security test plans and compliance checklists.

Model Selection: Match the Model to the Phase

Not every SDLC phase needs the same model. I’ve settled on a pattern that optimizes for both quality and cost:

Reasoning-heavy phases -> strongest model (Opus-class):

  • Requirements refinement (/ygs-refine-prd): Needs to challenge assumptions, find contradictions, explore implications
  • Technical design (/ygs-refine-trd): Needs architectural reasoning, trade-off analysis, pattern recognition across the codebase
  • Architecture refinement (/ygs-refine-architecture): System-level thinking, identifying failure modes, deep module analysis
  • Code review (/ygs-code-review): Catching subtle logic errors, race conditions, partial failure scenarios
  • Security review (/ygs-security-review): Adversarial thinking, attack path analysis, red-team perspective

Implementation phases -> fast model (Sonnet-class):

  • Implementation (/ygs-implement): Following well-defined specs, writing code within established patterns
  • Grooming (/ygs-grooming): Mechanical decomposition of well-understood requirements
  • Ship (/ygs-ship): Running tests, creating PRs, version bumping

Either works:

  • Estimation (/ygs-estimate): Benefits from reasoning for uncertainty analysis, but doesn’t require it
  • QA/UAT (/ygs-qa, /ygs-uat): Testing scenarios benefit from creativity but are often mechanical
  • Sync (/ygs-sync): Comparison is largely mechanical, but drift detection benefits from reasoning

The logic: design and review require judgment; implementation requires following instructions. A cheaper, faster model that faithfully executes a well-specified task often outperforms an expensive model given a vague one. This is why investing effort in the refinement phases (where you use the strongest model to produce precise specs) pays dividends in the implementation phase.

Industry Patterns for Model Routing

The practical takeaway: the quality of your specs determines how capable your implementation model needs to be. A well-specified task with clear acceptance criteria, explicit constraints, and defined negative boundaries (what NOT to do) can be implemented correctly by a fast model. A vague task requires a reasoning model to fill gaps, and it will fill them with assumptions from training data, not your domain knowledge.


Lessons from Agentic AI Design Patterns

I’ve catalogued 50 design patterns for generative and agentic AI across six categories — from content control and RAG to multi-agent orchestration. Several patterns directly inform how I structured these skills:

  • Reflection pattern: Agents that evaluate and revise their own output produce better results than single-shot generation. The SDLC skills implement this as separate review phases: generate (implement) -> evaluate (code review) -> revise (fix findings). The review skills ARE the reflection pattern, externalized into a structured workflow.
  • Prompt chaining over autonomy: Decomposing complex tasks into sequential, well-defined steps consistently outperforms giving an agent unbounded autonomy. The WBS skill does exactly this: hierarchically decomposes large features into small, sequential tasks with clear acceptance criteria. Each task is one link in the chain.
  • Tool calling with clear contracts: Agents that invoke well-defined tools with explicit input/output contracts produce more reliable results than agents reasoning in open-ended conversation. The skills serve as “tools” for the AI coding agent — each one a well-defined workflow with clear inputs (what phase we’re in, what artifacts exist) and outputs (specific deliverables with completion status).
  • Human-in-the-loop at decision points: The most reliable pattern across all my agent systems is autonomous execution for mechanical work with human checkpoints for judgment calls. The implementation skill embodies this: AI codes autonomously but STOPS at 3+ unplanned files, checkpoints every 5 files, and reports all deviations. You make the judgment calls; AI does the typing.
  • Memory tiers for context management: Production agents need structured memory: short-term (current session), medium-term (project conventions), and long-term (organizational knowledge). These skills serve as the medium and long-term memory tiers — encoding patterns and standards that survive across sessions.

The operational lesson from building all these systems: production AI requires the same engineering discipline as any distributed system. Circuit breakers for external API calls. Cost tracking with hard limits. Observability with correlation IDs. Graceful degradation when dependencies fail. These aren’t optional — they’re what separates demos from systems that run in production without 3am pages. The same discipline applied to AI coding workflows is what these skills encode.


Why This Matters Now

Martin Fowler recently asked the fundamental question: can AI evade the tar pit, or will it struggle in the accumulated complexity that slows every software project? The answer: AI doesn’t escape the tar pit. It digs faster. Autonomous AI agents mostly mean ‘I don’t know what it’s going to do.’ Structured workflows beat autonomy for production code. Most AI coding benefits from structured workflows, not autonomous agents making unbounded decisions. Jessica Kerr’s insight about double feedback loops matches how I use these skills: one loop builds features; another improves the development process. The skills aren’t static, each post-mortem adds a check to security review, each escaped bug extends the code review criteria. The AI benefits from that evolution without needing to “learn” it.


The Paradox: Writing vs. Reviewing

When you review AI-generated code, you don’t build the same understanding as when you write it. Here’s the middle path that works for me:

  1. Own the design. Write the architecture docs yourself. Define the interfaces. Specify the state machines. Draw the data flow diagrams. This is where deep thinking happens — at the design level, not the implementation level.
  2. Delegate the implementation. Let AI fill in the mechanical details within your design constraints. The type system and test suite verify it got the details right.
  3. Review with structure. Multi-pass review with explicit criteria catches what casual reading misses. Two passes (critical then design) force different modes of attention.
  4. Learn through refinement. The structured questioning in refinement sessions forces you to think deeply about the problem space. You can’t answer “what happens when this fails halfway through?” without building real understanding.

The skills encode this approach: you think deeply during refinement, design, and review. AI accelerates the mechanical middle. The result maintains conceptual integrity because the design philosophy flows from structured artifacts that persist across sessions, not from the agent’s ephemeral training data biases. As Brooks said: conceptual integrity matters more than any individual feature. These skills are how I maintain it while leveraging AI for the implementation work that used to consume 80% of my time.


Getting Started

# Install
git clone https://github.com/bhatti/you-got-skills.git ~/.claude/skills/you-got-skills

# Start with an idea
/ygs-refine-prd

# Work through the lifecycle
/ygs-refine-trd -> /ygs-estimate -> /ygs-spike (if risky) -> /ygs-wbs -> /ygs-implement -> /ygs-code-review -> /ygs-ship

The skills are pure markdown, no compilation, no dependencies, no telemetry. Read any skill in 30 seconds. Understand the full set in 10 minutes. Extend by adding a SKILL.md file in a new directory. Each skill stands alone. Use any subset in any order. Skip what doesn’t apply. The power isn’t in following a rigid process, it’s in having structured knowledge available when you need it, so the AI works with your standards instead of against them. The repository: github.com/bhatti/you-got-skills


Conclusion

The quest to make coding simpler is as old as coding itself. BASIC to 4GLs to UML to AI agents — every generation promises the same thing: focus on what, not how. Every generation delivers the same lesson: the thinking is the hard part, and you can’t automate it away. What’s different about AI coding agents is that they genuinely accelerate the how in ways previous tools never achieved. But acceleration without direction is faster wandering. Acceleration without conceptual integrity fragments your system’s design philosophy at speed.

These skills answer the question I kept returning to: how do you maintain conceptual integrity when the agent starts from zero every session? You encode your standards, conventions, and design philosophy into structured artifacts that survive across sessions. You own the what and the why. You let AI accelerate the how. You review everything through principles that have survived three decades of paradigm shifts. You own the what and the why. You let AI accelerate the how.


The skills discussed in this post are available at github.com/bhatti/you-got-skills. Built for Claude Code but the principles apply to any AI-assisted development workflow.

Related Blog posts:

TopicKey Insight
Functional PipelineType system beats testing for correctness. Immutable data flows eliminate aliasing bugs. State machines make illegal transitions impossible.
API Design50 anti-patterns I now check automatically like Idempotency, Command-Query Separation, etc.
Production Readiness and IncidentsFailures are multi-cause; fixes must be structural
Domain Driven and Hexagonal DesignBounded context, ubiquitous language, separation of concerns.
Production AI Agents such as enterprise AI platforms with vLLM, multi-agent architectures with MCP and A2A, API compatibility checking, PII detection, and personal productivity.The protocol is 10% of the work

May 13, 2026

From Big Ball of Mud to Functional Pipeline: Building an Observability Platform in Rust

Filed under: Computing,Technology — admin @ 2:19 pm

I. The Big Ball of Mud

In your career, you often have to deal with a legacy codebase that nobody wants to touch but everyone depends on. I had to deal with a similar real-time observability system that ingested logs, metrics, and traces and routed them to storage, alerting, and analytics systems. It started as a small Node.js project but then grew into a Big Ball of Mud over the years: a system with no discernible structure, where everything depends on everything else, and changes in one area trigger cascading failures across the codebase. The symptoms were textbook:

  • God classes: A single PipelineManager had grown to thousand of lines, handling config loading, event parsing, routing, batching, error recovery, and metrics reporting.
  • Singletons everywhere: dozens of module-level mutable instances accessed via getInstance(). Testing required elaborate startup sequences and teardown.
  • Type erasure: thousands of any in the TypeScript codebase. Refactoring was impossible because the compiler couldn’t help.
  • Silent failures: hundres of catch {} blocks that swallowed errors. Production incidents took hours to diagnose because the system happily continued with corrupted state.
  • Deep inheritance: A 6-level class hierarchy for “processors” where each level overrode different methods in incompatible ways.

This impacted business in terms of feature velocity, onboarding for new engineers and high change failure rate (see dora metrics). But here is the thing: not everything was broken. Buried under layers of mutation, global state, and type erasure, there were sound architectural ideas. The original designers made some good calls.

This post describes how functional programming patterns, domain-driven design, and hexagonal architecture (see https://shahbhat.medium.com/applying-domain-driven-design-and-clean-onion-hexagonal-architecture-to-microservic-284d54b3a874) with a POC implementation can be used toeliminate entire categories of bugs and restore the ability to move fast.


II. Patterns Worth Preserving

The legacy system had three core architectural patterns that deserved preservation but can be implemented better in Rust.

Pipes and Filters

The legacy system used pipes and filter pattern to flow events through a chain of independent processing stages. Each stage does one thing like parse, filter, enrich, mask, route and passes the result to the next stage. The problems were mutable events shared across stages, untyped filter functions, and no backpressure between stages. The chain was there, but the links were rusty.

The new POC implementation keeps Pipes and Filters as the backbone. Each stage is immutable, strongly typed, and composable. A stage receives an owned event and returns a new event (or drops it, or splits it into many). No stage can observe or interfere with another stage’s work.

// Legacy: mutable, untyped, no backpressure
// function processStage(event: any): any { event.stage = "done"; return event; }

// New: immutable, typed, composable
pub trait PipelineFn: Send + Sync {
    fn name(&self) -> &str;
    fn process(&self, event: Event) -> FnResult;
}

Decorator/Enrich: Adding Context to Events

The legacy system enriched events with metadata like adding timestamps, source identifiers, routing tags, geo-IP data. This is the Decorator pattern applied to streaming data, and it is essential. Raw events from producers are incomplete; the pipeline adds context. The problem was mutation. The legacy enrichment stages modified events in place, so downstream stages could not trust what they received. The new POC system keeps enrichment but uses immutable event copies. Each enrichment stage returns a new event with the added data. The original is untouched.

// Enrichment returns a new event — the original is unchanged
pub fn enrich_with_timestamp(event: Event) -> Event {
    event.set_field("_enriched_at", FieldValue::Int(now_millis()))
}

Source/Sink: The Endpoints

Every pipeline has endpoints: where data comes in (sources) and where it goes out (sinks). The legacy system had these abstractions, though they were concrete classes rather than interfaces. The new POC system makes sources and sinks trait-based and pluggable. You can swap a Kafka source for an HTTP source without touching the pipeline logic. You can add a new sink type without modifying existing code.

pub trait EventSource: Send + Sync {
    async fn start(&mut self) -> Result<(), SourceError>;
    fn stream(&mut self) -> Pin<Box<dyn Stream<Item = Event> + Send + '_>>;
}

pub trait EventSink: Send + Sync {
    async fn write(&self, events: Vec<Event>) -> Result<(), SinkError>;
    async fn flush(&self) -> Result<(), SinkError>;
}

These three patterns (Pipes and Filters, Decorator/Enrich, Source/Sink) are natural fits for functional style because they already think in terms of data transformation rather than stateful objects. Pipes and Filters is literally function composition: f ? g ? h. Decorator/Enrich is fmap over an event applying a function to the value inside a context without touching the structure. Source/Sink maps to the producer/consumer model at the heart of stream combinators.


III. The Architecture: DDD + Hexagonal in Rust

I previously wrote about DDD and Hexagonal architecture in https://shahbhat.medium.com/applying-domain-driven-design-and-clean-onion-hexagonal-architecture-to-microservic-284d54b3a874. I organized the POC as a Rust workspace with four crates, each representing a layer of the hexagonal architecture. Hexagonal architecture (also called ports and adapters) means: business logic sits in the center and knows nothing about the outside world. It defines “ports” as trait interfaces that the outside world must implement. The infrastructure layer provides “adapters” that fulfill those ports. The result is that you can test your domain logic without a database, without a network, without any I/O at all.

Dependencies point inward only: Interfaces depend on Application, Application depends on Domain, Infrastructure depends on Domain. The domain never imports anything from the outer layers. Here is how the Pipes and Filters pattern looks as an event flow through the system:

Each box in the filter chain is an independent PipelineFn. Each arrow carries an immutable Event. The chain is configured at runtime via the pipeline definition, but each stage is statically typed and independently testable.

The critical insight: Rust’s crate system makes architectural boundaries a compile-time guarantee. The domain crate literally cannot import infrastructure code. There is no way to “just quickly” add a database call to a domain service. This is the difference between architecture as aspiration and architecture as enforcement. The domain crate’s dependencies tell the whole story:

[dependencies]
ulid = { version = "1", features = ["serde"] }
serde = { version = "1", features = ["derive"] }
thiserror = "2"
async-trait = "0.1"
futures-core = "0.3"

No I/O. No database drivers. No HTTP clients. No channels. Just data structures, pure functions, and trait definitions (ports) that the infrastructure layer must implement.


IV. Group 1 Foundations: Types, Errors, and Dependencies

These six patterns form the bedrock.

Antipattern 1: Singletons to Dependency Injection

Before: The legacy system used module-level singletons for everything like database connections, config, registries:

// Module-level mutable state, accessed globally
let pipelineManager: PipelineManager;

export function getInstance(): PipelineManager {
  if (!pipelineManager) {
    pipelineManager = new PipelineManager(/* hardcoded deps */);
  }
  return pipelineManager;
}

// Somewhere far away in the codebase:
getInstance().processBatch(events); // untestable, hidden dependency

Testing was a nightmare. You could not create a PipelineManager with a mock database because it internally called DatabaseSingleton.getInstance().

After: Every dependency is passed explicitly through constructors. The composition root (main.rs) is the only place that knows how to wire things together:

// Composition root: wiring happens once, at startup
let pipeline_repo = Arc::new(SqlitePipelineRepository::new(conn));
let route_repo = Arc::new(SqliteRouteRepository::new(conn));
let event_bus = Arc::new(ChannelEventBus::new(256));

// Services receive their dependencies — they don't hunt for them
let handler = CreatePipelineHandler::new(
    pipeline_repo.clone(),
    event_bus.clone(),
);

This is the Reader monad made explicit: each handler is a function Config -> A, where the configuration (its dependencies) is threaded through construction rather than pulled from a global. No DI framework needed and the type system enforces what each component depends on.

Antipattern 2: Module-Level Mutable State to Immutable Values

Before: Events were passed by reference and mutated in place across pipeline stages:

function processEvent(event: any): void {
  event.timestamp = Date.now();        // mutate in place
  event.fields.processed = true;       // caller's copy is changed
  event.metadata.stage = "enriched";   // invisible side effect
}

This is where the Decorator/Enrich pattern went wrong in the legacy system. The enrichment was correct in intent but destructive in implementation.

After: Events are immutable value objects. Every transformation returns a new event:

// Event is immutable — set_field returns a NEW event
pub fn set_field(&self, name: impl Into<FieldName>, value: FieldValue) -> Self {
    let mut new_event = self.clone();
    new_event.fields.insert(name.into(), value);
    new_event
}

// Pipeline functions take ownership and return new values
pub trait PipelineFn: Send + Sync {
    fn process(&self, event: Event) -> FnResult;
}

An immutable Event is referentially transparent and enrich_with_timestamp(event) can be replaced by its result value anywhere in the program with no change in behavior. No aliasing bugs. The type system guarantees that if you have a reference to an event, nobody else is changing it.

Antipattern 5: God Class to Bounded Contexts

The thousands of lines in PipelineManager was split across four crates. Each crate has exactly one responsibility:

// domain/   — Event, Pipeline, Route, FnResult (pure data + logic)
// app/      — CreatePipelineHandler, IngestEventHandler (orchestration)
// infra/    — SqlitePipelineRepository, ChannelEventBus (I/O adapters)
// api/      — REST endpoints, CLI commands (user interface)

The compiler enforces the boundaries. You cannot accidentally couple the routing logic to the database layer.

Antipattern 7: Error Swallowing to Result Types

Before: Errors vanished into the void:

try {
  const pipeline = await loadPipeline(id);
  const result = pipeline.process(event);
  await sink.write(result);
} catch (e) {
  // "it's fine"
}

Hundreds of catch blocks like this in the legacy codebase. When something went wrong in production, the system kept running in a corrupted state.

After: Errors are values in the type signature. You cannot ignore them without the compiler warning you:

#[derive(Debug, thiserror::Error)]
pub enum DomainError {
    #[error("validation: {0}")]
    Validation(String),
    #[error("{0} not found: {1}")]
    NotFound(String, String),
    #[error("pipeline execution: {0}")]
    PipelineExecution(String),
    #[error("persistence: {0}")]
    Persistence(String),
}

// Every function that can fail declares it in its type
pub async fn handle(&self, cmd: CreatePipelineCommand) -> Result<Pipeline, DomainError> {
    pipeline.validate()?;  // ? propagates errors — impossible to forget
    self.pipeline_repo.save(&pipeline).await?;
    Ok(pipeline)
}

The ? operator is syntactic sugar for monadic bind over Result. The for-comprehension equivalent in Scala (for { x <- f1; y <- f2 } yield ...) and Rust’s ?-chaining are the same pattern: sequence dependent computations and short-circuit on the first failure, propagating the error with full context.”

Antipattern 11: Primitive Obsession to Newtypes

Before: IDs were raw strings. Mix them up and nothing stops you:

function linkPipeline(pipelineId: string, routeId: string) { ... }
// Oops: arguments swapped, compiles fine, fails at runtime
linkPipeline(routeId, pipelineId);

After: Each ID is a distinct type. The compiler catches mix-ups:

macro_rules! define_id {
    ($name:ident) => {
        #[derive(Debug, Clone, PartialEq, Eq, Hash, Serialize, Deserialize)]
        pub struct $name(String);
        impl $name {
            pub fn new() -> Self { Self(ulid::Ulid::new().to_string()) }
            pub fn as_str(&self) -> &str { &self.0 }
        }
    };
}

define_id!(PipelineId);
define_id!(RouteId);
define_id!(EventId);
// fn link(pipeline: &PipelineId, route: &RouteId) — can't swap these

This is the phantom type pattern: PipelineId and RouteId are both String at runtime, but they are different types at compile time because the wrapper carries no runtime data. Zero cost, full safety.

Antipattern 18: any Types to Generics and Trait Bounds

Before: The pipeline function interface accepted and returned any:

type ProcessorFn = (event: any) => any;
// No contract. No guarantees. Runtime explosions.

After: Trait bounds make the contract explicit and compiler-checked:

pub trait PipelineFn: Send + Sync {
    fn name(&self) -> &str;
    fn process(&self, event: Event) -> FnResult;
}

pub trait PipelineFnFactory: Send + Sync {
    fn create(&self, config: &serde_json::Value) -> Result<Box<dyn PipelineFn>, String>;
}

The trait says: “Give me an Event, I’ll give you an FnResult (Pass, Split, or Drop).” No ambiguity. No any. The compiler enforces the contract at every call site.


V. Group 2 Data Modeling: Making Illegal States Unrepresentable

Antipattern 3: Mode/Env Branching to Sum Types

A sum type (also called an algebraic data type or ADT) is an enum where each variant carries different data. Instead of one struct with optional fields where only some combinations are valid, you define each valid combination as its own variant.

Before: Configuration types were discriminated by strings, with every consumer doing defensive checking:

interface FunctionConfig {
  type: string;         // "eval" | "drop" | "mask" | ... maybe?
  field?: string;       // required for some types
  pattern?: string;     // required for mask and regex
  expression?: string;  // required for eval
  targetFields?: string[];  // only regex
}

// Every consumer:
if (config.type === "eval") {
  if (!config.field || !config.expression) throw new Error("invalid");
}

After: An enum makes illegal states unrepresentable. Each variant carries exactly its required data:

pub enum FunctionConfig {
    Eval { field: String, expression: String },
    Drop { filter: String },
    Mask { field: String, pattern: String, replacement: String },
    RegexExtract { field: String, pattern: String, target_fields: Vec<String> },
}

// Pattern matching is exhaustive — add a new variant and the compiler
// shows you every place that needs updating
fn resolve(config: &FunctionConfig) -> Result<Box<dyn PipelineFn>, DomainError> {
    match config {
        FunctionConfig::Eval { field, expression } => { /* guaranteed present */ }
        FunctionConfig::Drop { filter } => { /* ... */ }
        FunctionConfig::Mask { field, pattern, replacement } => { /* ... */ }
        FunctionConfig::RegexExtract { field, pattern, target_fields } => { /* ... */ }
    }
}

Similarly, the result of processing an event is a sum type:

pub enum FnResult {
    Pass(Event),       // event continues downstream
    Split(Vec<Event>), // one event becomes many
    Drop,              // event is discarded
}

This is the core ADT insight: product types (structs, where a value has field A and field B) model data that is always fully present; sum types (enums, where a value is variant A or variant B) model data where only some combinations are valid. Illegal states become unrepresentable by construction. FnResult is a sum type that makes the three possible outcomes of a pipeline stage explicit. The legacy equivalent was return null | Event | Event[], but invisible to the type system and easy to miss in a catch {} block.

Antipattern 4: Type-String Dispatch to Registry Pattern

Before: Function types were resolved with an if/else chain that grew with every new type:

function createFunction(config: any): ProcessorFn {
  if (config.type === "eval") return new EvalFn(config);
  else if (config.type === "drop") return new DropFn(config);
  else if (config.type === "mask") return new MaskFn(config);
  // ... grows forever, easy to forget one
  else throw new Error(`unknown type: ${config.type}`);
}

After: A registry maps type names to factories. Adding new types does not touch existing code:

pub struct DefaultFunctionRegistry {
    factories: HashMap<String, Box<dyn PipelineFnFactory>>,
}

impl DefaultFunctionRegistry {
    pub fn new() -> Self {
        let mut registry = Self { factories: HashMap::new() };
        registry.factories.insert("eval".into(), Box::new(EvalFnFactory));
        registry.factories.insert("drop".into(), Box::new(DropFnFactory));
        registry.factories.insert("mask".into(), Box::new(MaskFnFactory));
        registry.factories.insert("regex_extract".into(), Box::new(RegexExtractFnFactory));
        registry
    }
}

The registry is an interpreter pattern where you separate the description of what to do (FunctionConfig as a DSL) from how to do it (PipelineFnFactory as the interpreter). This is the same structure as Free Monads: define your algebra as data (each FunctionConfig variant is an AST node), then write interpreters against it (production factories, test stubs, dry-run validators). The registry approach is the pragmatic version without monad transformer overhead, just a HashMap of factories. The key property is the same: you can swap the interpreter without touching the program description.

Antipattern 8: Temporal Coupling to Typestate Builder

Typestate is a pattern that uses the type system to enforce valid state transitions at compile time. You encode the object’s lifecycle phase into its type, so calling methods in the wrong order is a compiler error rather than a runtime error.

Before: Pipelines could be created in invalid states — no functions, empty description — and the error only surfaced at runtime:

const pipeline = new Pipeline();
pipeline.save(); // Oops: no functions, no description. Runtime error.

After: The builder uses phantom types to make the invalid state impossible to compile:

pub struct PipelineBuilder<State> {
    id: PipelineId,
    description: String,
    functions: Vec<PipelineFunction>,
    _state: PhantomData<State>,
}

// Can only add functions in the NoFunctions state (transitions to HasFunctions)
impl PipelineBuilder<NoFunctions> {
    pub fn add_function(self, func: PipelineFunction) -> PipelineBuilder<HasFunctions> { ... }
}

// build() only exists on HasFunctions — you literally cannot call it without functions
impl PipelineBuilder<HasFunctions> {
    pub fn build(self) -> Pipeline { ... }
}

Rust’s ownership system is an affine type system: values may be used at most once (moved, not copied, unless Copy). The typestate builder exploits this: add_function(self) takes ownership of the builder and returns a new one in the next state. You literally cannot hold onto the old PipelineBuilder<NoFunctions> after calling add_function and the borrow checker makes it a compile error. This is stronger than a runtime lifecycle check: the invalid state cannot exist in memory, not just in logic.

Antipattern 9: Global Mutable Registry to Persistent Data Structures

Before: The route table was a global mutable singleton. Updates caused race conditions and stale reads:

class RouteRegistry {
  private static instance: RouteRegistry;
  private rules: RouteRule[] = []; // mutated by multiple threads
  addRule(rule: RouteRule) { this.rules.push(rule); } // race!
}

After: Route tables are immutable values. “Updating” returns a new version:

impl RouteTable {
    pub fn add_rule(&self, rule: RouteRule) -> Self {
        let mut new_table = self.clone();
        new_table.rules.push(rule);
        new_table.version += 1;
        new_table
    }
}

In a real persistent data structure (Clojure’s HAMT, Haskell’s finger trees), ‘copying’ only involves copying the path from the modified node to the root with O(log n) nodes, not O(n). Rust’s clone() here is a simple structural copy, which is fine for small route tables. The principle is the same: multiple versions coexist safely because neither modifies the other.

Antipattern 12: Signal-Based Dispatch to Handler Map

Before: Event handling used a giant switch statement that grew with every new event type:

function handleSignal(signal: string, data: any) {
  switch (signal) {
    case "pipeline.created": notifyUI(data); break;
    case "pipeline.deleted": cleanupCache(data); break;
    // ... 40 more cases
  }
}

After: A handler map registers handlers by event type. New events are handled by registering a new handler, not by modifying existing code:

// Register handlers at composition time
let mut handlers: HashMap<String, Box<dyn EventHandler>> = HashMap::new();
handlers.insert("pipeline.created".into(), Box::new(NotifyUiHandler));
handlers.insert("pipeline.deleted".into(), Box::new(CleanupCacheHandler));

// Dispatch is a single lookup — no switch statement
if let Some(handler) = handlers.get(event.event_type()) {
    handler.handle(event).await?;
}

Antipattern 13: Anemic Domain Model to Rich Domain Objects

Before: Pipeline was a data bag with all logic living in external “service” classes:

class Pipeline {
  id: string;
  functions: FunctionConfig[];
  // That's it. No behavior. Just a struct with public fields.
}

class PipelineService {
  validate(p: Pipeline) { /* 200 lines */ }
  addFunction(p: Pipeline, f: FunctionConfig) { /* 50 lines */ }
}

After: The pipeline owns its behavior. Invariants are maintained internally:

impl Pipeline {
    pub fn add_function(&mut self, func: PipelineFunction) {
        self.functions.push(func);
        self.version += 1; // version always tracks mutations
    }

    pub fn validate(&self) -> Result<(), DomainError> {
        if self.description.is_empty() {
            return Err(DomainError::Validation("description cannot be empty".into()));
        }
        if self.functions.is_empty() {
            return Err(DomainError::Validation("must have at least one function".into()));
        }
        Ok(())
    }

    pub fn active_functions(&self) -> impl Iterator<Item = &PipelineFunction> {
        self.functions.iter().filter(|f| !f.disabled)
    }
}

VI. Group 3: Composition and Control Flow

Antipattern 6: forEach + Push to Iterator Combinators

Before: Processing was imperative loops accumulating into mutable vectors:

function processBatch(events: any[], functions: ProcessorFn[]): any[] {
  const results: any[] = [];
  for (const event of events) {
    let current = event;
    for (const fn of functions) {
      const result = fn(current);
      if (result === null) break;
      if (Array.isArray(result)) { results.push(...result); break; }
      current = result;
    }
    if (current) results.push(current);
  }
  return results;
}

After: The pipeline engine uses fold (reduce) over the function chain. This is the Pipes and Filters pattern made explicit where each function is a filter stage, the vector is the pipe:

pub struct PipelineEngine;

impl PipelineEngine {
    pub fn process_event(event: Event, functions: &[&dyn PipelineFn]) -> Vec<FnResult> {
        let mut current_events = vec![event];
        let mut final_results = Vec::new();

        for func in functions {
            let mut next_batch = Vec::new();
            for evt in current_events {
                match func.process(evt) {
                    FnResult::Pass(e) => next_batch.push(e),
                    FnResult::Split(es) => next_batch.extend(es),
                    FnResult::Drop => final_results.push(FnResult::Drop),
                }
            }
            current_events = next_batch;
        }

        final_results.extend(current_events.into_iter().map(FnResult::Pass));
        final_results
    }
}

The pipeline engine’s inner loop is a fold (catamorphism) over the function list, with the accumulator being the current set of live events. Every iteration either passes events forward, fans them out (Split), or drops them. This is the structural recursion pattern: the shape of the computation mirrors the shape of the data (a linear chain of functions).

Antipattern 10: Callback Chains to Async Composition

Before: Nested callbacks (or deeply chained .then() promises) with error handling at each level:

loadConfig()
  .then(config => loadPipeline(config.pipelineId))
  .then(pipeline => pipeline.process(event))
  .then(result => sink.write(result))
  .catch(e => { /* which step failed? */ });

After: Rust’s async/await with ? gives linear, readable control flow:

async fn handle(&self, cmd: IngestEventCommand) -> Result<Vec<Event>, DomainError> {
    let route_table = self.route_repo.get_table().await?;
    let decisions = RoutingEngine::route_event(&cmd.event, &route_table)?;
    for decision in decisions {
        let pipeline = self.pipeline_repo.get(&decision.pipeline_id).await?;
        // ... each ? short-circuits on error with full context
    }
    Ok(all_output)
}

Antipattern 14: Eager Initialization to Lazy Evaluation

Before: All pipeline functions, parsers, and regex patterns were compiled at startup, even if never used:

// All compiled eagerly at module load time, even for pipelines never triggered
const ALL_PATTERNS = compileAllRegexPatterns(); // 500ms startup cost

After: Expensive initializations are deferred until first use with once_cell::Lazy, and streams are demand-driven:

use once_cell::sync::Lazy;

static REGEX_CACHE: Lazy<HashMap<String, Regex>> = Lazy::new(|| {
    // Only compiled when first accessed
    HashMap::new()
});

// Sources produce events on demand — pull, not push
impl EventSource for FileSource {
    fn stream(&mut self) -> Pin<Box<dyn Stream<Item = Event> + Send + '_>> {
        // Lines are read only when the consumer calls .next()
        Box::pin(self.reader.lines().map(|line| parse_event(line)))
    }
}

Lazy::new is memoization with a single input (the unit type): the computation runs at most once and its result is cached forever. This is safe only because the initializer is pure with same (empty) input always produces the same output. If the initializer had side effects, re-running it vs. caching would produce different behavior.

Antipattern 15: Mixed I/O + Logic to Effect Separation

Before: Business logic was interleaved with database calls, HTTP requests, and logging:

async function processEvent(event: any) {
  const config = await db.getConfig();      // I/O
  event.enriched = transform(event, config); // logic
  await kafka.publish(event);                // I/O
  metrics.increment("processed");            // I/O
  if (event.severity > 3) {
    await alertService.fire(event);          // I/O
  }
  return event;
}

After: Domain services are pure functions. I/O lives exclusively in the infrastructure layer:

// Domain service: PURE — no I/O, no side effects
impl PipelineEngine {
    pub fn process_batch(events: Vec<Event>, functions: &[&dyn PipelineFn]) -> BatchResult {
        // Pure computation: transform events through functions
    }
}

// Application layer: orchestrates I/O around pure domain logic
impl IngestEventHandler {
    pub async fn handle(&self, cmd: IngestEventCommand) -> Result<Vec<Event>, DomainError> {
        let route_table = self.route_repo.get_table().await?;   // I/O: read
        let decisions = RoutingEngine::route_event(&cmd.event, &route_table)?; // Pure
        // ... resolve functions (I/O), process (pure), return results
    }
}

This is Functional Core, Imperative Shell (FCIS) in practice: PipelineEngine::process_batch is the functional core with a pure function, trivially testable, no mocks needed. IngestEventHandler::handle is the imperative shell that orchestrates I/O around the pure core, calling out to repositories and event buses. The pattern is the same as Haskell’s IO monad: describe what to do (pure), defer execution to the edge (impure).

Antipattern 16: Monolithic Functions to Function Composition

The key insight from the pipeline engine: each transform is a small, independent function that composes with others. Instead of one 500-line processEvent() method that does everything, we have a chain of focused transforms:

// Each function is tiny and testable in isolation
struct MaskFn { field: String, regex: Regex, replacement: String }

impl PipelineFn for MaskFn {
    fn name(&self) -> &str { "mask" }
    fn process(&self, event: Event) -> FnResult {
        match event.get_field(&self.field) {
            Some(FieldValue::Str(value)) => {
                let masked = self.regex.replace_all(value, self.replacement.as_str());
                FnResult::Pass(event.set_field(&self.field, FieldValue::Str(masked.into())))
            }
            _ => FnResult::Pass(event),
        }
    }
}

This is the Pipes and Filters pattern at the code level. Each PipelineFn is a filter. The engine composes them into a pipeline. You can test each filter in isolation, reorder them, add new ones without touching existing filters.

Each PipelineFn implementation is a pure function transformer: it takes an Event and returns an FnResult. The engine is function composition at runtime — the pipeline definition is a list of function names that the registry resolves into a chain of Box<dyn PipelineFn>. Adding a new stage means writing one new impl PipelineFn block, not touching the engine.

Antipattern 17: No Rollback to Saga Pattern

Before: Multi-step operations had no compensation logic. If step 3 of 5 failed, steps 1-2 left orphaned state:

await db.savePipeline(pipeline);
await registry.register(pipeline);  // if this fails, DB has orphan
await bus.publish("created");       // if this fails, registry is stale

After: Command handlers treat publish failures as non-fatal (eventual consistency), and the pattern supports full compensation:

pub async fn handle(&self, cmd: CreatePipelineCommand) -> Result<Pipeline, DomainError> {
    self.pipeline_repo.save(&pipeline).await?;

    // Non-critical: event publication. If it fails, the pipeline still exists.
    // A background reconciler can re-publish later.
    if let Err(e) = self.event_publisher.publish(event).await {
        tracing::warn!("Failed to publish PipelineCreated event: {}", e);
    }

    Ok(pipeline)
}

This is the simplified saga pattern, treating non-critical steps (event publication) as best-effort with background reconciliation, rather than requiring two-phase commit. Full saga compensation (explicit rollback actions for each step) would be appropriate if, say, publishing failure meant the pipeline should be marked inactive. The pattern scales from ‘log and retry’ to full compensating transactions depending on consistency requirements.


VII. Group 4: Concurrency and Architecture

Antipattern 20: Monolithic Startup to Plugin Architecture

Before: Adding a new source or sink type required modifying core initialization code in multiple files:

// startup.ts — grows with every new component
import { KafkaSource } from './sources/kafka';
import { S3Sink } from './sinks/s3';
import { HttpSource } from './sources/http';
// ... 30 more imports

function init() {
  registerSource('kafka', KafkaSource);
  registerSource('http', HttpSource);
  // ... grows linearly
}

After: Cargo features allow components to be compiled in or out. The function registry pattern means new types are added without modifying existing code:

[features]
default = ["http-source", "file-source", "stdout-sink"]
http-source = []
file-source = []
stdout-sink = []
memory-sink = []
// New source? Implement the trait and register in the feature-gated module.
// No existing code changes.
#[cfg(feature = "http-source")]
registry.register_source("http", Box::new(HttpSourceFactory));

Antipattern 21: OS Process Forking to Actor Model

Before: The legacy system scaled by forking OS processes, each with its own copy of global state:

import cluster from 'cluster';
if (cluster.isPrimary) {
  for (let i = 0; i < numCPUs; i++) cluster.fork();
} else {
  startWorker(); // entire app copied, 200MB per worker
}

After: Lightweight async actors communicate through bounded channels:

pub struct PipelineActor {
    rx: mpsc::Receiver<PipelineActorMsg>,
    output_tx: mpsc::Sender<Vec<Event>>,
    functions: Vec<Box<dyn PipelineFn>>,
    state: PipelineActorState,
}

impl PipelineActor {
    pub async fn run(mut self) {
        while let Some(msg) = self.rx.recv().await {
            match msg {
                PipelineActorMsg::ProcessBatch(events) => {
                    let result = PipelineEngine::process_batch(events, &fn_refs);
                    self.state.processed += result.passed.len() as u64;
                    if !result.passed.is_empty() {
                        let _ = self.output_tx.send(result.passed).await;
                    }
                }
                PipelineActorMsg::Shutdown => break,
            }
        }
    }
}

This is Erlang’s actor model translated to Tokio tasks. The key insight from both models: if there is no shared mutable state, there is nothing to race over. Tokio’s mpsc bounded channel is the CSP channel where both sender and receiver synchronize on the buffer, and backpressure propagates automatically when the buffer is full.

Antipattern 22: Leader Bottleneck to Version Vectors

Rather than a single leader node holding all configuration state, each entity carries its own version number. Concurrent updates to different pipelines do not conflict.

pub struct Pipeline {
    pub version: u64, // incremented on every mutation
    // ...
}

impl Pipeline {
    pub fn add_function(&mut self, func: PipelineFunction) {
        self.functions.push(func);
        self.version += 1;
    }
}

// Optimistic concurrency: "update only if still at version 7"
pub async fn save(&self, pipeline: &Pipeline) -> Result<(), DomainError> {
    let rows = sqlx::query("UPDATE pipelines SET ... WHERE id = ? AND version = ?")
        .bind(pipeline.id.as_str())
        .bind(pipeline.version - 1) // expected previous version
        .execute(&self.pool).await?;
    if rows.rows_affected() == 0 {
        return Err(DomainError::ConcurrencyConflict);
    }
    Ok(())
}

The principled FP alternative to optimistic locking is Software Transactional Memory (STM): compose atomic operations on shared memory without locks, with automatic retry on conflict. Haskell’s atomically $ do { modifyTVar from subtract; modifyTVar to (+) } makes multi-step updates composable where either all happen or none do. Rust doesn’t have STM in the standard library, and for database-backed state, optimistic locking (version vectors + UPDATE WHERE version = N) achieves the same semantic: detect conflicts at commit time, retry at the application layer. STM is preferable when conflicts are rare and the critical section is in-memory; version vectors scale to distributed state across process boundaries.

Antipattern 23: Shared Code Bloat to Feature-Gated Modules

The Cargo features system means you only compile what you need. A deployment that only uses HTTP sources does not include the file-tailing code. Binary size stays small, and the dependency graph is explicit.

// Only compiled when the feature is enabled
#[cfg(feature = "file-source")]
pub mod file_source;

#[cfg(feature = "http-source")]
pub mod http_source;

Antipattern 24: Push Without Backpressure to Bounded Channels

Before: Producers pushed events into unbounded queues. Under load, memory grew until the process OOM’d:

const queue: Event[] = []; // grows forever
source.on('data', event => queue.push(event)); // no limit!

After: Bounded channels create natural backpressure. When the buffer is full, producers wait:

pub struct HttpEventSource {
    sender: mpsc::Sender<Event>,
    receiver: Option<mpsc::Receiver<Event>>,
}

impl HttpEventSource {
    pub fn new(buffer_size: usize) -> Self {
        let (sender, receiver) = mpsc::channel(buffer_size); // bounded!
        Self { sender, receiver: Some(receiver) }
    }
}

Bounded channels are the Rust equivalent of reactive streams backpressure: when the downstream consumer can’t keep up, the sender.send().await call suspends the producer task rather than buffering unboundedly. The pipeline becomes a dataflow graph where each stage’s throughput is constrained by its slowest downstream neighbor.

Antipattern 25: Polling to Lazy Pull Streams

Before: Workers polled for new data on a timer, wasting CPU when idle and introducing latency when busy:

setInterval(async () => {
  const batch = await queue.poll(); // wasteful when idle
  if (batch.length > 0) process(batch);
}, 100); // 100ms latency floor

After: Event sources implement the Stream trait. Consumers pull one item at a time via .next().await, which parks the task until data is available:

use futures::StreamExt;

// Consumer pulls events on demand — no polling, no wasted cycles
while let Some(event) = source.stream().next().await {
    let results = PipelineEngine::process_event(event, &fn_refs);
    for result in results {
        sink.write(result).await?;
    }
}

A Stream is corecursive: where recursion consumes a finite structure by breaking it down (a catamorphism, like AP 28), corecursion produces a potentially infinite structure by building it up one step at a time (an anamorphism). FileSource::stream() is an anamorphism over the file: the seed is the file handle, each step produces one event and a new handle position, and the stream terminates when the handle is exhausted. The Stream trait is Rust’s lazy sequence and the functional equivalent of Haskell’s LazyList or Scala’s LazyList. Nothing is computed until the consumer calls .next().await. This is demand-driven (pull) evaluation: the producer runs exactly as fast as the consumer needs, with no intermediate buffering and no polling overhead.


VIII. Group 5: Advanced Functional Patterns

Antipattern 19: Opaque Service Interfaces to Capability Traits

Before: Services exposed god-interfaces with dozens of methods, most irrelevant to any given caller:

interface PipelineService {
  create(p: Pipeline): void;
  delete(id: string): void;
  process(event: any): any;
  getMetrics(): Metrics;
  reload(): void;
  // ... 20 more methods
}

After: Each capability is a separate trait. Callers depend only on what they need:

// Fine-grained capability traits
pub trait FunctionResolver: Send + Sync {
    fn resolve(&self, config: &FunctionConfig) -> Result<Box<dyn PipelineFn>, DomainError>;
}

pub trait PipelineRepository: Send + Sync {
    async fn get(&self, id: &PipelineId) -> Result<Pipeline, DomainError>;
    async fn save(&self, pipeline: &Pipeline) -> Result<(), DomainError>;
}

// Callers declare exactly what they need — nothing more
struct IngestHandler {
    resolver: Arc<dyn FunctionResolver>,
    repo: Arc<dyn PipelineRepository>,
}

Fine-grained capability traits are Tagless Final in practice. Instead of a concrete PipelineService god-object, you declare your algebra as a set of type class constraints: fn ingest<R, P>(resolver: &R, repo: &P, event: Event) where R: FunctionResolver and P: PipelineRepository. The function is polymorphic over its effects and you substitute production implementations at the composition root and test stubs in unit tests, with zero runtime overhead compared to dynamic dispatch.

Antipattern 26: Deep Inheritance to Trait Composition

Before: A 6-level inheritance hierarchy where each level overrode different methods:

class BaseProcessor { ... }
class FilteringProcessor extends BaseProcessor { ... }
class EnrichingProcessor extends FilteringProcessor { ... }
class BatchingEnrichingProcessor extends EnrichingProcessor { ... }
// "Which version of transform() am I actually running?" — nobody knows

After: Behavior is defined through trait composition. No inheritance. Each implementation is independent and flat:

pub trait PipelineFn: Send + Sync {
    fn name(&self) -> &str;
    fn process(&self, event: Event) -> FnResult;
}

// Each implementation is flat — no hierarchy, no overriding
impl PipelineFn for EvalFn { ... }
impl PipelineFn for DropFn { ... }
impl PipelineFn for MaskFn { ... }
impl PipelineFn for RegexExtractFn { ... }

You never ask “which version of process() am I actually running?” There is exactly one implementation per type. No surprises.

Antipattern 27: Unbounded Recursion to Iterative Fold

Before: Batch processing used recursion that could blow the stack on large inputs:

function processAll(events: any[], fns: Function[], idx: number): any[] {
  if (idx >= fns.length) return events;
  return processAll(events.map(fns[idx]), fns, idx + 1); // stack overflow risk
}

After: The pipeline engine uses iterative fold. Stack overflow is impossible regardless of pipeline length:

// Iterative: each function is applied in a loop, not via recursion
for func in functions {
    let mut next_batch = Vec::new();
    for evt in current_events {
        match func.process(evt) {
            FnResult::Pass(e) => next_batch.push(e),
            FnResult::Split(es) => next_batch.extend(es),
            FnResult::Drop => {}
        }
    }
    current_events = next_batch;
}

Antipattern 28: Ad-Hoc Recursion to Catamorphism

A catamorphism is a recursive fold over a tree structure and you define how to handle each node type, and the recursion follows the shape of the data automatically. The routing engine evaluates filter expressions using this pattern:

pub fn evaluate_filter(filter: &FilterExpr, event: &Event) -> Result<bool, DomainError> {
    match filter {
        FilterExpr::Eq(field, expected) => {
            Ok(event.get_field(field) == Some(expected))
        }
        FilterExpr::And(left, right) => {
            Ok(Self::evaluate_filter(left, event)? && Self::evaluate_filter(right, event)?)
        }
        FilterExpr::Or(left, right) => {
            Ok(Self::evaluate_filter(left, event)? || Self::evaluate_filter(right, event)?)
        }
        FilterExpr::Not(inner) => Self::evaluate_filter(inner, event).map(|b| !b),
        FilterExpr::True => Ok(true),
    }
}

The catamorphism’s real value is that it separates what to compute at each node from how to recurse. You never write the recursive traversal by hand and the match on the enum is the recursion. Add a new FilterExpr variant and every unhandled match becomes a compile error.

Antipattern 29: Hardcoded Parsers to Parser Combinators

Before: Filter expressions were parsed with regex and string splitting, growing more fragile with each new operator:

function parseFilter(expr: string): Filter {
  if (expr.includes(' AND ')) {
    const parts = expr.split(' AND ');
    return { type: 'and', left: parseFilter(parts[0]), right: parseFilter(parts[1]) };
  }
  // fails silently on malformed input
}

After: Parser combinators (using nom) build complex parsers from small, tested pieces:

fn parse_comparison(input: &str) -> IResult<&str, FilterExpr> {
    let (input, field) = parse_identifier(input)?;
    let (input, _) = multispace0(input)?;
    let (input, op) = alt((tag("=="), tag("!="), tag(">"), tag("<"), tag("contains")))(input)?;
    let (input, _) = multispace0(input)?;
    let (input, value) = parse_value(input)?;

    let expr = match op {
        "==" => FilterExpr::Eq(field, value),
        "!=" => FilterExpr::Neq(field, value),
        ">" => FilterExpr::Gt(field, value),
        "<" => FilterExpr::Lt(field, value),
        "contains" => FilterExpr::Contains(field, value),
        _ => unreachable!(),
    };
    Ok((input, expr))
}

fn parse_and(input: &str) -> IResult<&str, FilterExpr> {
    let (input, left) = parse_atom(input)?;
    let (input, _) = delimited(multispace0, tag_no_case("AND"), multispace0)(input)?;
    let (input, right) = parse_expr(input)?;
    Ok((input, FilterExpr::And(Box::new(left), Box::new(right))))
}

Parser combinators are applicative by nature: parse_comparison and parse_and are independent parsers composed with alt (choice) and sequence (both must succeed). This is the Applicative pattern and unlike a monad, where each step depends on the previous result, applicative composition runs independent effects and combines their outputs. alt((tag("=="), tag("!="))) is f <*> g where both parsers are defined statically, with no dependency between them.

Antipattern 30: Stringly-Typed Field Access to Typed Lenses

Before: Accessing nested event data was a chain of string lookups with no type safety:

const value = event.fields["user"]["email"]; // undefined? string? number? who knows
if (value) { /* hope it's a string */ }

After: Typed accessor methods (lens-style) provide safe, focused access to nested data:

// get_field returns Option<&FieldValue> — forces the caller to handle absence
let email = event.get_field("user.email");

// set_field returns a new event — the lens "focuses" on one field
// and produces a new whole from the modified part
let masked = event.set_field("user.email", FieldValue::Str("[REDACTED]".into()));

// Type-safe: you know exactly what you're getting
match event.get_field("severity") {
    Some(FieldValue::Int(level)) => route_by_severity(*level),
    Some(FieldValue::Str(s)) => route_by_severity(s.parse()?),
    None => route_to_default(),
    _ => Err(DomainError::Validation("unexpected severity type".into())),
}

Antipattern 31: Implicit Mutable State to Reducer Pattern

The actor’s message loop is a reducer: it receives a message and transitions to a new state. The state is always consistent because there is only one owner (the actor itself):

// State transitions are explicit and atomic
PipelineActorMsg::ProcessBatch(events) => {
    let result = PipelineEngine::process_batch(events, &fn_refs);
    self.state.processed += result.passed.len() as u64;
    self.state.dropped += result.dropped;
}

No concurrent access. No locks. No race conditions. The actor pattern plus Rust’s ownership model guarantees single-writer semantics.

Antipattern 32: Monkey-Patching to Extension via Traits

Before: Extending behavior meant modifying existing classes or patching prototypes at runtime:

// Monkey-patching: modifying someone else's class at runtime
Pipeline.prototype.customProcess = function() { /* surprise! */ };

After: You implement a trait for your type. The registry accepts any Box<dyn PipelineFn> — your custom function is a first-class citizen without modifying any framework code:

// Your custom function — no framework modification needed
struct MyCustomFn { config: MyConfig }

impl PipelineFn for MyCustomFn {
    fn name(&self) -> &str { "my_custom" }
    fn process(&self, event: Event) -> FnResult { /* your logic */ }
}

// Register it alongside built-in functions
registry.register("my_custom", Box::new(MyCustomFnFactory));

Antipattern 33: Implicit Ordering to Typestate Lifecycle

The actor has a clear lifecycle: Created, Running, Stopped. The run() method consumes self, making it impossible to use the actor after it has been started (unless you keep the handle):

impl PipelineActor {
    pub async fn run(mut self) { // takes ownership — actor is "consumed"
        while let Some(msg) = self.rx.recv().await { ... }
        // When this returns, the actor is done. No zombie state.
    }
}

// After spawning, you only have the handle — not the actor itself
let handle = tokio::spawn(actor.run()); // actor moved into the task
// actor.do_something(); // COMPILE ERROR: actor has been moved

Antipattern 34: Window via Mutation to Comonad-Style

A comonad is a structure that provides context around a focused element. Think of it as the dual of a monad: where a monad wraps a value you can map over, a comonad gives you a value plus its neighborhood.

Before: Sliding windows were implemented as mutable arrays with index arithmetic:

class SlidingWindow {
  private buffer: any[] = [];
  private index = 0;
  push(item: any) { this.buffer[this.index++ % this.size] = item; }
  getContext() { /* complex index math, off-by-one bugs */ }
}

After: A comonad-style window provides extract() (get the focused value) and extend() (apply a context-aware function at every position):

pub struct SlidingWindow<T> {
    items: VecDeque<T>,
    focus_idx: usize,
    window_size: usize,
}

impl<T: Clone> SlidingWindow<T> {
    /// Get the focused element (comonad extract)
    pub fn extract(&self) -> Option<&T> {
        self.items.get(self.focus_idx)
    }

    /// Apply a function at every position, producing a new window (comonad extend)
    pub fn extend<B, F>(&self, f: F) -> SlidingWindow<B>
    where
        F: Fn(&SlidingWindow<T>) -> B,
        B: Clone,
    {
        let mut results = VecDeque::with_capacity(self.items.len());
        for i in 0..self.items.len() {
            let shifted = SlidingWindow {
                items: self.items.clone(),
                focus_idx: i,
                window_size: self.window_size,
            };
            results.push_back(f(&shifted));
        }
        SlidingWindow { items: results, focus_idx: self.focus_idx, window_size: self.window_size }
    }
}

A monad lets you chain ‘what to do next’ (flatMap), a comonad lets you ask ‘what does the context around this value say’ (extend). The classic examples are spreadsheets (each cell is a value with a grid of neighbors) and Conway’s Game of Life (extend step grid applies the evolution rule at every cell simultaneously). In the pipeline, extend lets you compute a moving average or rate-of-change at every position in one pass, without index arithmetic.

Antipattern 35: Static Worker Assignment to Work-Stealing

Before: Work was distributed round-robin to a fixed number of workers, causing hot spots:

const workers = Array.from({ length: 4 }, () => new Worker());
let nextWorker = 0;
function dispatch(batch) {
  workers[nextWorker++ % workers.length].send(batch); // unbalanced
}

After: For CPU-bound batch processing, rayon‘s parallel iterators provide work-stealing scheduling:

use rayon::prelude::*;

// rayon automatically distributes work across cores
let results: Vec<BatchResult> = batches
    .par_iter()
    .map(|batch| PipelineEngine::process_batch(batch.clone(), &fn_refs))
    .collect();

Use rayon for CPU-bound batch processing where tasks are independent and similar in size. Use the actor-per-pipeline model (Antipattern 21) for I/O-bound work and heterogeneous task sizes and actors handle backpressure and message ordering; rayon just parallelizes.”


IX. The Human Cost

The patterns described here are not primarily about performance, they are about cognitive load. When errors are values, when states are explicit in types, when illegal states are unrepresentable, and when each function does one thing, a new engineer can understand any individual piece in isolation. That is the real dividend of functional discipline: onboarding time and debugging time drop together.

Each pattern from above addresses a real cost that the team paid every day. For example, new engineers on the legacy system could not ship features for months. Not because observability pipelines are conceptually hard. It was because the system had enormous artificial complexity. There was no way to understand one piece in isolation because everything depended on everything else.

When errors are swallowed, states are implicit, and types are erased, debugging a production incident means reading every log line and reconstructing what happened. In the new system, errors propagate with context. The route table is immutable, so corruption is structurally impossible. All of these costs reinforce each other. Slow onboarding means fewer experienced engineers. Fewer experienced engineers means less refactoring capacity. Less refactoring means more debt.


X. Conclusion

This is not a story about Rust vs. TypeScript and it comes with a working POC at github.com/bhatti/pipeflow that implements all the patterns described. TypeScript with strict: true, branded types, and careful architecture can achieve many of the same guarantees. The lesson is about principles:

  1. Keep what works. Pipes and Filters, Decorator/Enrich, Source/Sink worked. The problem was their implementation, not their design.
  2. Make illegal states unrepresentable. Use sum types (enums where each variant carries different data) and typestate (using the type system to enforce valid state transitions) to shift runtime errors to compile-time.
  3. Separate effects from logic. Pure domain functions are trivially testable and infinitely composable.
  4. Enforce boundaries with the build system. Architecture diagrams lie. Compiler errors do not.
  5. Prefer immutable data. Clone when you need to diverge. The clarity is worth the allocation.
  6. Make errors explicit. Result<T, E> in the type signature. No swallowing. No surprises.
  7. Compose small functions. A pipeline of 5 focused transforms beats one 500-line method.
  8. Name the patterns. Immutable values, sum types, typestate, catamorphism, comonad are not buzzwords. They are compressed names for solutions that took decades to discover. Knowing the name means knowing the laws, the composability guarantees, and the tradeoffs.

The mud did not accumulate overnight, and it will not disappear overnight. But every boundary you draw, every type you make explicit, every error you refuse to swallow makes the next change slightly easier. That is how you reverse the flywheel.

Source code: The full POC implementing all patterns described here is available as an open-source Rust project at github.com/bhatti/pipeflow.


XI. Pattern Index

#Antipattern -> SolutionCore FP Concept(s)Section
1Singletons -> Dependency InjectionReader Monad, Functional Core/Imperative ShellIV
2Mutable State -> Immutable ValuesReferential Transparency, Value SemanticsIV
3Mode Branching -> Sum TypesADT (Sum Types), Exhaustive Pattern MatchingV
4String Dispatch -> RegistryTagless Final (lite), Open/Closed, First-Class FunctionsV
5God Class -> Bounded ContextsModule Systems, FCIS, Separation of ConcernsIV
6forEach + Push -> Iterator CombinatorsFunctor (map), Fold / Catamorphism, Lazy PipelinesVI
7Error Swallowing -> Result TypesMonad (bind / ?), Either / Option, Monadic ChainingIV
8Temporal Coupling -> Typestate BuilderPhantom Types, Affine / Linear Types, TypestateV
9Global Registry -> Persistent Data StructuresPersistent DS, Structural Sharing, Immutable UpdatesV
10Callback Chains -> Async CompositionMonad (sequential composition), CPS (async/await desugaring)VI
11Primitive Obsession -> NewtypesNewtype Pattern, Phantom Types, Zero-Cost AbstractionIV
12Signal Dispatch -> Handler MapFirst-Class Functions, Open Dispatch, Strategy PatternV
13Anemic Model -> Rich Domain ObjectsADTs, Encapsulation of Invariants, Expression-OrientedV
14Eager Init -> Lazy EvaluationThunks, Memoization (evaluate-once semantics)VI
15Mixed I/O + Logic -> Effect SeparationIO Monad, Algebraic Effects, Functional Core / Imperative ShellVI
16Monolithic Functions -> Function CompositionFunction Composition, Point-Free Style, Pipes and FiltersVI
17No Rollback -> Saga PatternEventual Consistency, Compensating TransactionsVI
18any Types -> Generics + Trait BoundsType Classes, Parametric Polymorphism, Ad-Hoc PolymorphismIV
19God Interface -> Capability TraitsInterface Segregation, Type Classes, Dependency InversionVIII
20Monolithic Startup -> Plugin ArchitectureOpen/Closed Principle, Feature-Gated ModulesVII
21OS Process Forking -> Actor ModelActor Model, CSP (message-passing), Isolated Mutable StateVII
22Leader Bottleneck -> Version VectorsOptimistic Concurrency, STM (contrast), Immutable VersioningVII
23Shared Code Bloat -> Feature-Gated ModulesConditional Compilation, Module System BoundariesVII
24Unbounded Push -> Bounded ChannelsCSP Channels, Reactive Streams, BackpressureVII
25Polling -> Lazy Pull StreamsLazy Evaluation, Corecursion, Demand-Driven StreamsVII
26Deep Inheritance -> Trait CompositionComposition over Inheritance, Type Classes, Flat DispatchVIII
27Unbounded Recursion -> Iterative FoldTrampolining, Tail Recursion, Accumulator-Passing StyleVIII
28Ad-Hoc Recursion -> CatamorphismRecursion Schemes (Catamorphism), Structural RecursionVIII
29Hardcoded Parsers -> Parser CombinatorsParser Combinators, Applicative Functor, MonadVIII
30Stringly-Typed Access -> Typed LensesLenses / Optics, Profunctors, Focused Immutable UpdateVIII
31Implicit Mutation -> Reducer PatternFold, State Monad, Single-Writer SemanticsVIII
32Monkey-Patching -> Extension via TraitsType Classes, Retroactive Extension, CoherenceVIII
33Implicit Ordering -> Typestate LifecycleLinear / Affine Types, Typestate, Ownership as ProtocolVIII
34Mutable Window -> Comonad-StyleComonad (extract / extend), Context-Aware ComputationVIII
35Round-Robin Workers -> Work-StealingParallel Collections, Work-Stealing, parMapVIII

April 28, 2026

Building Mini OpenClaw: Secure AI Agents with Actors, WASM, and Supervision

Filed under: Agentic AI,Computing — admin @ 7:17 pm

Introduction

Most agent frameworks start simple: one process, one conversation loop, one tool registry, one memory store, and one pile of credentials. That simplicity is useful for demos, but dangerous for enterprise systems. If a prompt injection reaches a tool with broad permissions, the whole runtime becomes part of the blast radius (see https://arxiv.org/abs/2403.02691). If one tool call hangs or crashes, it can stall the agent loop. If memory and sessions are shared by convention instead of isolated by construction, tenant boundaries depend on every developer remembering every guardrail every time. Enterprise teams need a different foundation. They need agents that isolate state, limit blast radius, enforce tenant boundaries, and recover from failures without operator intervention. They need the same properties that telecom systems have delivered for four decades: per-process isolation, supervision trees, guardian processes, and location-transparent messaging.

This post shows how I built Mini OpenClaw as a proof of concept implementation that runs entirely on PlexSpaces, an actor-based distributed runtime inspired by Erlang/OTP. OpenClaw-style systems are useful because they give developers a programmable agent runtime: tools, memory, planning, execution, and orchestration. MiniClaw keeps that spirit, but changes the failure and security model. Instead of one runtime owning everything, each responsibility becomes an actor with its own state, permissions, lifecycle, and supervision boundary. MiniClaw deploys ten actors inside a WebAssembly + Firecracker sandbox to deliver a secure, fault-tolerant agent system. Every actor owns its state exclusively. Every message travels through explicit channels and every failure triggers a supervised restart instead of full-system crash.

OpenClaw’s 2026.4.29 release triggered plugin dependency repair loops at startup and cold paths due to monolithic core owns too many responsibilities. MiniClaw starts from the opposite position: every responsibility is an actor from the beginning, with its own state, and its own explicit message contract.


Part 1: Agents and Actors Isomorphism

1.1 The Same Computational Model

An LLM agent has four things: state (conversation history, tool results), a processing loop (receive message, reason, act), communication (call tools, delegate to other agents), and failure modes (timeouts, hallucinations, rate limits). An actor has exactly the same structure. This is not a coincidence. Both actors and agents derive from the same computational model, isolated units of stateful computation that communicate by passing messages.

# From examples/python/apps/miniclaw/agent.py
# An agent IS an actor same structure, same guarantees
# For readability, this POC keeps message history directly on the `AgentActor`. 
# In a production deployment, I would usually run one actor instance per session or 
# store history by `session_id` to avoid cross-session context mixing.
@actor
class AgentActor:
    """Core agent: receive user message, call LLM, execute tools, loop until end_turn."""

    system_prompt: str = state(default="You are a helpful AI assistant with access to tools.")
    messages: list  = state(default_factory=list)   # Conversation state
    max_history: int = state(default=50)            # Context window bound
    total_chats: int = state(default=0)             # Usage counter
    agent_name: str  = state(default="general-assistant")

    @init_handler
    def on_init(self, config: dict) -> None:
        args = config.get("args", {})
        self.agent_name = args.get("agent_name", self.agent_name)
        self.system_prompt = args.get("system_prompt", self.system_prompt)
        host.process_groups.join("svc:agent")        # Announces itself for discovery
        write_actor_info(self.actor_id, self.agent_name,
                         "Core agent loop with tool calling and session memory",
                         ["chat", "tool_use", "memory"])

    @handler("chat")
    def chat(self, message: str = "", session_id: str = "") -> dict:
        # Agent processing loop: receive message -> reason -> act
        ...

The mapping is direct. Every agent concept has an actor primitive:

Agent ConceptActor PrimitiveMiniClaw Implementation
Conversation historyActor-private statemessages: list (serialized, isolated)
Tool callingInter-actor messagingask(tool_reg_id, "execute_tool", ...)
Agent delegationLocation-transparent Askask(agent_id, "chat", ...) via process groups
Crash recoverySupervisor restart + durability facetState checkpointed to SQLite, restored on restart
Rate limitingPer-actor circuit breaker statecircuit_open, consecutive_failures in actor state
MemoryScoped KV + TupleSpaceGlobal/agent/session scopes via MemoryActor
Audit trailFire-and-forget GenEventhost.send(audit_id, "log_event", ...) — non-blocking

1.2 Four Behaviors Map to Four Agent Archetypes

PlexSpaces provides four actor behaviors. Each maps to a distinct agent archetype:

BehaviorAgent ArchetypeMiniClaw ActorDecorator
GenServerTool executor, stateful helperAgentActor, LLMRouterActor, ToolRegistryActor, MemoryActor, SessionManagerActor, TaskQueueActor, HealthMonitorActor@actor
GenEventAudit logger, event publisherAuditEventActor@event_actor
GenStateMachineState-machine agent, quality gateAgentStateFSM@fsm_actor(states=[...], initial="idle")
WorkflowOrchestrator, pipeline coordinatorOrchestratorActor@workflow_actor

Part 2: PlexSpaces Primitives

Before walking through each actor, it helps to see the five low-level primitives that every actor uses. These are the only operations available inside the WASM sandbox without filesystem or global state.

2.1 Process Groups and Object Registry for Location-Transparent Discovery

Every actor is registered in an actor-registry and can optionally join a named process group on @init_handler. Callers look up the first member with pg_first(), a one-liner that hides whether the target is local or on a remote node:

# From examples/python/apps/miniclaw/helpers.py
def pg_first(group: str) -> Tuple[Optional[str], Optional[str]]:
    """Return (actor_id, None) for the first member of a process group, or (None, error)."""
    try:
        members = host.process_groups.members(group)
        if members:
            return members[0], None
        return None, f"no members in {group}"
    except Exception as e:
        return None, str(e)

Every actor announces itself on startup:

@init_handler
def on_init(self, config: dict) -> None:
    host.process_groups.join("svc:agent")
    write_actor_info(self.actor_id, self.agent_name,
                     "Core agent loop with tool calling and session memory",
                     self.capabilities)

The orchestrator discovers agents via pg_first("svc:agent"), it does not know the agent’s address, node, or port. The framework routes the message transparently.

2.2 Fire-and-Forget Audit with host.send, Never host.ask

The audit trail uses host.send() (fire-and-forget) rather than host.ask() (request-reply). This is a deliberate design choice: audit events must never add latency to the agent’s critical path.

# From examples/python/apps/miniclaw/helpers.py
def fire_audit(event_type: str, detail: str) -> None:
    """Fire-and-forget audit event. Failures are logged, never raised."""
    audit_id, err = pg_first("svc:audit")
    if err or not audit_id:
        host.debug(f"fire_audit: {err}")
        return
    try:
        host.send(audit_id, "log_event", {
            "op": "log_event",
            "event_type": event_type,
            "detail": detail,
            "timestamp": host.now_ms(),
        })
    except Exception as e:
        host.warn(f"fire_audit: send failed: {e}")

Every actor calls fire_audit() after each meaningful operation. The audit actor receives the event asynchronously. If the audit actor is slow or temporarily down, callers are unaffected, they never wait for a response.

2.3 TupleSpace: Queryable Shared Coordination State

TupleSpace (host.ts) is the coordination layer. Unlike KV (point lookup by key), TupleSpace supports pattern queries like read all tuples matching a template with None wildcards:

# Write a memory tuple
host.ts.write(["memory", "global", "user_name", "Alice"])

# Read all global memories — None matches any value in that position
tuples = host.ts.read_all(["memory", "global", None, None])

# Read all audit events of a specific type
events = host.ts.read_all(["audit", "tool_executed", None, None])

# Orchestrator checkpoints sub-task results for crash recovery
host.ts.write(["orch_result", task_id, i, str(result)])

The write_actor_info helper uses TupleSpace to publish actor capabilities for external discovery without blocking callers:

# From examples/python/apps/miniclaw/helpers.py
def write_actor_info(actor_id: str, name: str, description: str, capabilities: list) -> None:
    """Write actor capability tuples to TupleSpace for discovery."""
    try:
        host.ts.write(["agent_card", actor_id, name, description])
        for cap in capabilities:
            host.ts.write(["agent_cap", cap, actor_id])
    except Exception as e:
        host.warn(f"write_actor_info: {e}")

2.4 send_after for Scheduling Timers

The health monitor uses host.send_after() to schedule a self-message after every poll interval. No cron job, no external scheduler, the actor manages its own polling timeline:

@init_handler
def on_init(self, config: dict) -> None:
    # Schedule first poll; each tick reschedules the next
    host.send_after(self.poll_interval_ms, "poll_tick", {"op": "poll_tick"})

@handler("poll_tick", "cast")
def poll_tick(self) -> None:
    # ... do poll work ...
    # Re-arm: each tick schedules the next — no external scheduler needed
    host.send_after(self.poll_interval_ms, "poll_tick", {"op": "poll_tick"})

2.5 host.channel for Channel-Backed Durable Queues

The Channel primitive provides at-least-once message delivery with explicit ack/nack:

# Producer: send to channel
msg_id = host.channel.send("", _TASK_CHANNEL, task_type, task)

# Consumer: receive, process, then ack or nack
msg, ok, _ = host.channel.receive("", _TASK_CHANNEL, timeout_ms)
if ok:
    host.channel.ack("", _TASK_CHANNEL, msg["msg_id"])   # commit
    # OR
    host.channel.nack("", _TASK_CHANNEL, msg["msg_id"], True)  # requeue

2.6 The Let-It-Crash Philosophy

Monolithic agent frameworks force developers to write defensive error handling around every tool call, every LLM request, and every memory access. MiniClaw takes the Erlang philosophy: let actors crash, and let guardians restart them in a clean state. A guardian supervisor watches its children. When one crashes, it applies a restart strategy. The other children continue running, unaffected without cascading failures and global error handlers.

# From examples/python/apps/miniclaw/app-config.toml
[supervisor]
strategy = "one_for_one"          # Restart ONLY the crashed actor
max_restarts = 10                 # Allow up to 10 restarts
max_restart_window_seconds = 60   # Within a 60-second window
# If 10 crashes in 60s -> escalate to parent supervisor

PlexSpaces provides three restart strategies, each suited to different failure patterns:

StrategyBehaviorAgent Use Case
one_for_oneRestart only the crashed actorIndependent tools: calculator crash does not affect weather
rest_for_oneRestart crashed actor + all actors started after itPipeline stages: if retriever crashes, restart generator and validator too
one_for_allRestart all children when any crashesTightly coupled team: research + analysis + writing agents share context

2.7 Monitors and Links

PlexSpaces provides two mechanisms for actors to watch each other (similar to Erlang):

  • Monitors (host.monitor()) provide one-way observation. The monitoring actor receives a __DOWN__ message when the monitored actor stops.
  • Links (host.link()) provide bidirectional fate-sharing. If either linked actor crashes abnormally, the other receives an __EXIT__ message.
# Monitor: one-way watch. ValidatorAgent watches workers.
monitor_ref = host.monitor(worker_id)

@handler("__DOWN__", "cast")
def on_down(self, monitor_ref: str = "", down_from: str = "", down_reason: str = "") -> None:
    """Monitored worker stopped. ValidatorAgent stays alive and compensates."""
    self.failed_workers.append(down_from)
    # Spawn replacement, redistribute work, alert operator

# Link: bidirectional fate-sharing. Coordinating agents share fate.
host.link(peer_id)

@handler("__EXIT__", "cast")
def on_exit(self, exit_from: str = "", exit_reason: str = "") -> None:
    """Linked peer died abnormally. Clean up shared resources."""
    self.linked_peers.remove(exit_from)

In MiniClaw, the guardian supervisor monitors all ten actors. If the LLMRouterActor crashes, the supervisor restarts it with a clean state. The AgentActor‘s in-flight request receives a timeout error while the MemoryActor, the AuditEventActor, and every other actor continues running without interruption.

The supervisor IS the guardian pattern from Erlang. Every MiniClaw actor runs under guardian supervision for crash recovery.


Part 3: WASM + Firecracker Sandbox

3.1 Defense in Depth

MiniClaw actors run inside three concentric isolation layers:

  1. Actor isolation: Each actor owns its state exclusively. No shared memory, no global variables, no cross-actor data access. Communication happens only through host.ask() and host.send().
  2. WASM + Firecracker sandbox: Each actor compiles to a WebAssembly module that runs inside a hardware-enforced memory sandbox. The WASM linear memory is isolated per actor instance. In production deployments, each WASM runtime itself runs inside a Firecracker microVM, a lightweight KVM-based hypervisor that boots in ~125ms and provides hardware-level memory and I/O isolation between tenants.
  3. Tenant isolation: Every PlexSpaces operation requires a RequestContext with explicit tenant and namespace identifiers via JWT authentication. The framework rejects cross-tenant access before the request reaches the actor.

3.2 What the Two-Layer Sandbox Prevents

Attack VectorMonolithic FrameworkWASM SandboxWASM + Firecracker
open("/etc/passwd")Succeeds with full FS accessBlocked with no FS import in WITBlocked with separate VM filesystem
os.environ["API_KEY"]Succeeds with env vars sharedBlocked with no env access in WASMBlocked with separate VM env
Read another actor’s memorySucceeds with shared processBlocked with WASM linear memory is per-instanceSeparate VM address space
Escape WASM sandbox via JIT bugPossible in theoryPartially mitigatedBlocked with hypervisor hardware boundary
Cross-tenant KV accessPossible if scoping misconfiguredBlocked with RequestContext enforcedBlocked with separate VM tenant

The WIT (WebAssembly Interface Types) definition explicitly declares what the actor can access:

// From wit/plexspaces-actor/host.wit
// The actor can ONLY call these imports — nothing else
interface host {
    send: func(to: string, msg-type: string, payload: payload) -> result<_, actor-error>;
    ask: func(to: string, msg-type: string, payload: payload, timeout-ms: u64) -> result<payload, actor-error>;
    kv-get: func(key: string) -> result<payload, actor-error>;
    kv-put: func(key: string, value: payload) -> result<_, actor-error>;
    http-fetch: func(link-name: string, method: string, path: string, request: payload) -> result<payload, actor-error>;
    // No filesystem. No env vars. No raw network. No process exec.
}

3.3 Tenant Isolation by Construction

Every PlexSpaces operation propagates tenant context through the call chain. KV keys, TupleSpace tuples, object-registry and process groups are all scoped by tenant and namespace. A session created by tenant acme cannot be retrieved by tenant globex and the framework rejects the request before it reaches the actor.

# Every API request carries tenant context — enforced at framework level
# KV keys scoped:     tenant-acme:prod:session:sess-001
# TupleSpace scoped:  tenant-acme:prod:["memory", "global", "user_name", "Alice"]
# Process groups:     tenant-acme:prod:svc:llm_router

There is no internal() bypass for application code. Tenant boundaries are enforced by construction, not by convention.


Part 4: MiniClaw Architecture

MiniClaw decomposes the agent framework into ten actors. Every actor runs as a WebAssembly module inside the PlexSpaces runtime, discovers collaborators through object-registry or process groups, and persists state through the durability facet.

ActorBehaviorResponsibilitySecurity Property
LLMRouterActorGenServerRoute LLM calls, circuit-break on failureReal API keys never leave the actor (phantom token proxy)
ToolRegistryActorGenServerRegister tools with schemas, execute in isolationSchema validation prevents malformed tool inputs
AgentActorGenServerCore agent loop: message -> LLM -> tool -> repeatBounded iteration (max 5) prevents infinite loops
SessionManagerActorGenServerMap users to sessions, enforce tenant scopeTenant-scoped KV keys prevent cross-tenant access
OrchestratorActorWorkflowDecompose tasks, delegate, checkpoint progressDurable checkpoints survive crashes
MemoryActorGenServerScoped memory (global/agent/session)KV + TupleSpace dual-write with tenant scoping
AuditEventActorGenEventImmutable log of every actor operationFire-and-forget; senders never block on audit
AgentStateFSMGenStateMachineLifecycle guard: idle -> processing -> tool_executing -> respondingValidates transitions; rejects illegal states
TaskQueueActorGenServerDurable task queue backed by Channel; enqueue/dequeue/ack/nackAt-least-once delivery; no external broker
HealthMonitorActorGenServerPeriodic PG membership polling via send_after; writes health snapshotsSimple polling eliminates subscription races

Part 5: Design Patterns Used in MiniClaw

The NanoClaw project introduced an important design philosophy: instead of reaching for external infrastructure when you hit a constraint, first ask whether the primitives you already have can solve the problem.

Pattern 1: Phantom Token / Credential Proxy

The constraint: Agents need to call an LLM provider, but callers should never see real API keys. Storing keys in the agent payload means any log line or bug report leaks credentials.

The actor solution: LLMRouterActor owns the credential store. It exposes a register_credential op that stores phantom_token -> real_api_key in its private KV namespace. Callers pass only the opaque token; the actor resolves the real key internally and discards it before building any response.

# Phantom token: real key stored in actor-private KV — never echoed to callers
@handler("register_credential")
def register_credential(self, phantom_token: str = "", api_key: str = "") -> dict:
    if not phantom_token or not api_key:
        return {"error": "phantom_token and api_key required"}
    host.kv_put(f"cred:{phantom_token}", api_key)  # Only this actor reads it
    return {"status": "ok", "phantom_token": phantom_token}  # api_key never returned

@handler("chat_completion")
def chat_completion(self, messages: list = None, tools: list = None,
                    phantom_token: str = "") -> dict:
    resolved_key = host.kv_get(f"cred:{phantom_token}") if phantom_token else ""
    # resolved_key used by real HTTP client; discarded here
    # ... call LLM, build response ...
    return {"status": "ok", "response": response}  # resolved_key never in response

Actor-private state means the real key is inaccessible from any other actor, any other tenant, and any logged payload. Even if a prompt injection tricks the agent into returning its full state, the real credential is not in the agent, it is in the router actor, which never echoes it back.

Pattern 2: Task Queue (TaskQueueActor)

The constraint: The orchestrator needs to enqueue work items for agents to process asynchronously but the environment already has the Channel primitive and no external message broker.

The actor solution: TaskQueueActor is a thin wrapper around host.channel. The Channel handles durability, at-least-once delivery, and redelivery on nack transparently:

# From examples/python/apps/miniclaw/infra.py
_TASK_CHANNEL = "tasks:pending"

@actor
class TaskQueueActor:
    """Thin actor wrapper around the host Channel primitive."""

    enqueued: int = state(default=0)
    completed: int = state(default=0)
    failed: int = state(default=0)

    @handler("enqueue")
    def enqueue(self, task_type: str = "generic", payload: dict = None) -> dict:
        task = {"task_type": task_type, "payload": payload or {}, "enqueued_at": host.now_ms()}
        msg_id = host.channel.send("", _TASK_CHANNEL, task_type, task)
        self.enqueued += 1
        fire_audit("task_enqueued", f"msg_id={msg_id} type={task_type}")
        return {"status": "ok", "msg_id": msg_id}

    @handler("dequeue")
    def dequeue(self, limit: int = 1, timeout_ms: int = 0) -> dict:
        tasks = []
        for _ in range(int(limit)):
            msg, ok, _ = host.channel.receive("", _TASK_CHANNEL, int(timeout_ms))
            if not ok:
                break
            tasks.append(msg)
        return {"status": "ok", "tasks": tasks, "count": len(tasks)}

    @handler("ack")
    def ack(self, msg_id: str = "") -> dict:
        host.channel.ack("", _TASK_CHANNEL, msg_id)   # commits the delivery
        self.completed += 1
        return {"status": "ok", "msg_id": msg_id}

    @handler("nack")
    def nack(self, msg_id: str = "", requeue: bool = True) -> dict:
        host.channel.nack("", _TASK_CHANNEL, msg_id, requeue)  # requeue for redelivery
        self.failed += 1
        return {"status": "ok", "msg_id": msg_id, "requeue": requeue}

PlexSpaces supports multiple providers for queues/channels such as Kafka, SQS, redis or backed by process-groups communication. The Channel primitive is built into the PlexSpaces host, durable, ordered, with explicit ack/nack semantics. If the consumer crashes mid-processing, the unacked message is redelivered on the next dequeue.

Pattern 3: Polling Over Events (HealthMonitorActor)

The constraint: We want to know the health of all service actors, but subscribing to process group membership change events introduces races: a join and a crash can arrive out of order, leaving stale membership in the subscriber’s view.

The actor solution: HealthMonitorActor never subscribes to anything. It polls every service group on a configurable interval using send_after to schedule its own next tick:

# From examples/python/apps/miniclaw/infra.py
_SERVICE_GROUPS = [
    "svc:llm_router", "svc:tool_registry", "svc:agent",
    "svc:session_manager", "svc:memory", "svc:audit",
    "svc:agent_fsm", "svc:task_queue",
]

@actor
class HealthMonitorActor:
    """Polls process group membership on a fixed interval using send_after."""

    poll_count: int = state(default=0)
    last_poll_ms: int = state(default=0)
    group_health: dict = state(default_factory=dict)
    poll_interval_ms: int = state(default=5000)

    @init_handler
    def on_init(self, config: dict) -> None:
        args = config.get("args", {})
        if args.get("poll_interval_ms"):
            iv = int(args["poll_interval_ms"])
            self.poll_interval_ms = min(max(iv, 1000), 300_000)
        host.process_groups.join("svc:health_monitor")
        host.send_after(self.poll_interval_ms, "poll_tick", {"op": "poll_tick"})

    @handler("poll_tick", "cast")
    def poll_tick(self) -> None:
        health = {}
        for grp in _SERVICE_GROUPS:
            try:
                members = host.process_groups.members(grp)
                health[grp] = len(members)
            except Exception:
                health[grp] = 0
        self.group_health = health
        self.poll_count += 1
        self.last_poll_ms = host.now_ms()

        import json
        host.ts.write(["health_snapshot", self.last_poll_ms, json.dumps(health)])
        # Re-arm: each tick schedules the next — no external scheduler needed
        host.send_after(self.poll_interval_ms, "poll_tick", {"op": "poll_tick"})

    @handler("get_health")
    def get_health(self) -> dict:
        degraded = [g for g, c in self.group_health.items() if c == 0]
        return {
            "status": "ok",
            "group_health": self.group_health,
            "healthy": len(self.group_health) - len(degraded),
            "degraded": degraded,
        }

Polling is always correct as it converges to the true membership on every tick regardless of event order. get_health returns not just a count but a list of degraded groups, making it immediately actionable.

The Constraint-Aware Philosophy

These four patterns share a common thread: each one reaches for the primitives already available in the PlexSpaces sandbox before introducing external dependencies.

NeedNaive SolutionNanoClaw SolutionPrimitive Used
Protect API keysEnvironment variables or secrets managerPhantom token stored in actor-private KVhost.kv_put/kv_get
Async task queueRabbitMQ / SQSChannel-backed queue with ack/nackhost.channel.send/receive/ack/nack
Service health monitoringEvent subscription + fan-outPeriodic send_after poll + TupleSpace snapshothost.send_after + host.process_groups.members()
Capability discoveryService registry with TTLProcess groups + TupleSpace agent cardshost.process_groups.join/members() + host.ts.write/read_all

The WASM sandbox is not a limitation to work around instead it is the guide for designing simpler, more auditable systems.


Part 6: The Agent Loop

6.1 The Loop in Code

The AgentActor drives the core agent loop. It receives a user message, calls the LLM, checks for tool requests, executes tools, feeds results back, and repeats with a hard cap of five iterations to prevent runaway loops.

# From examples/python/apps/miniclaw/agent.py
_MAX_ITER = 5
...
    @handler("chat")
    def chat(self, message: str = "", session_id: str = "") -> dict:
        if not message:
            return {"error": "message is required"}

        self.messages.append({"role": "user", "content": message})

        # Discover tools
        tool_reg_id, _ = pg_first("svc:tool_registry")
        tools = []
        if tool_reg_id:
            resp = ask(tool_reg_id, "list_tools", {})
            if resp:
                tools = resp.get("tools", [])

        # Signal FSM: processing
        fsm_id, _ = pg_first("svc:agent_fsm")
        if fsm_id:
            host.send(fsm_id, "transition", {"op": "transition", "to": "processing"})

        final_response = ""
        for i in range(_MAX_ITER):
            llm_id, err = pg_first("svc:llm_router")
            if err or not llm_id:
                final_response = f"[no LLM] Processed: {message}"
                break

            llm_resp = ask(llm_id, "chat_completion", {"messages": [{"role": "system", "content": self.system_prompt}] + self.messages, "tools": tools}, 10000)
            if not llm_resp or "error" in llm_resp:
                final_response = f"LLM unavailable: {llm_resp}"
                break

            response = llm_resp.get("response", {})
            stop_reason = response.get("stop_reason", "end_turn")
            content = response.get("content", "")

            assistant_msg = {"role": "assistant", "content": content, "stop_reason": stop_reason}
            if response.get("tool_calls"):
                assistant_msg["tool_calls"] = response["tool_calls"]
            self.messages.append(assistant_msg)

            if stop_reason == "end_turn":
                final_response = content
                break

            if stop_reason == "tool_use":
                if fsm_id:
                    host.send(fsm_id, "transition", {"op": "transition", "to": "tool_executing"})

                for tc in response.get("tool_calls", []):
                    tc_name = tc.get("name", "")
                    tc_input = tc.get("input", {})
                    tool_output = {}
                    if tool_reg_id:
                        tool_output = ask(tool_reg_id, "execute_tool", {"name": tc_name, "input": tc_input}) or {}

                    self.messages.append({
                        "role": "tool",
                        "tool_call_id": tc.get("id", ""),
                        "content": str(tool_output),
                    })
                    fire_audit("tool_called", f"tool={tc_name} session={session_id}")

                if fsm_id:
                    host.send(fsm_id, "transition", {"op": "transition", "to": "processing"})
                final_response = f"Tool results applied (iteration {i + 1})"
            else:
                final_response = content
                break

        # FSM: responding ? idle
        if fsm_id:
            host.send(fsm_id, "transition", {"op": "transition", "to": "responding"})
            host.send(fsm_id, "transition", {"op": "transition", "to": "idle"})

        # Compact history if needed
        if len(self.messages) > self.max_history:
            keep = self.max_history // 2
            self.messages = self.messages[:1] + self.messages[-keep:]

        # Persist history in KV if session provided
        if session_id:
            import json
            host.kv_put(f"session_history:{session_id}", json.dumps(self.messages))

        self.total_chats += 1
        fire_audit("agent_chat", f"session={session_id}")
        return {
            "status": "ok",
            "response": final_response,
            "session_id": session_id,
            "messages_count": len(self.messages),
        }

The _MAX_ITER = 5 cap prevents runaway loops. In a monolithic framework, this cap requires global state or thread-local storage.


Part 7: Circuit Breakers and Immutable Audit Trails

7.1 LLM Router

The LLMRouterActor simulates an LLM with tool-call routing. In production, replace the simulation with a real API call via host.http_fetch() over a named service link:

# From examples/python/apps/miniclaw/llm_router.py
TOOL_CALL_TRIGGERS = ("weather", "search", "calculate", "lookup", "find")

# `LLMRouterActor` is a simulator in this POC. It demonstrates the routing 
# boundary where production code would call OpenAI, Anthropic, Bedrock, Gemini, or 
# an internal model endpoint through a named service link.
@actor
class LLMRouterActor:
    """Simulated LLM router with tool-calling capability."""

    model: str = state(default="miniclaw-simulated-v1")
    request_count: int = state(default=0)

    @init_handler
    def on_init(self, config: dict) -> None:
        self.model = config.get("args", {}).get("model", self.model)
        host.process_groups.join("svc:llm_router")

    @handler("chat_completion")
    def chat_completion(self, messages: list = None, tools: list = None) -> dict:
        messages = messages or []
        tools = tools or []
        self.request_count += 1

        user_msg = ""
        for m in reversed(messages):
            if m.get("role") == "user":
                user_msg = str(m.get("content", "")).lower()
                break

        should_use_tool = tools and any(kw in user_msg for kw in TOOL_CALL_TRIGGERS)

        if should_use_tool:
            tool = tools[0] if tools else {}
            tool_name = tool.get("name", "search") if isinstance(tool, dict) else "search"
            response = {
                "stop_reason": "tool_use",
                "content": "",
                "tool_calls": [{"id": f"tc_{self.request_count}", "name": tool_name,
                                 "input": {"query": user_msg}}],
            }
        else:
            response = {
                "stop_reason": "end_turn",
                "content": f"[{self.model}] Processed: {user_msg}",
                "tool_calls": [],
            }
        return {"status": "ok", "response": response, "model": self.model}

To add a circuit breaker for production LLM rate limits, extend the actor state with circuit_open and consecutive_failures. The actor IS the circuit breaker, and the durability facet ensures the circuit state survives restarts:

@actor
class LLMRouterActor:
    model: str = state(default="gpt-4o")
    circuit_open: bool = state(default=False)
    consecutive_failures: int = state(default=0)
    request_count: int = state(default=0)

    @init_handler
    def on_init(self, config: dict) -> None:
        host.process_groups.join("svc:llm_router")
        # Schedule circuit recovery timer
        host.send_after(30_000, "timer_tick", {"op": "timer_tick"})

    @handler("chat_completion")
    def chat_completion(self, messages: list = None, tools: list = None) -> dict:
        if self.circuit_open:
            return {"error": "circuit_open", "circuit_open": True}

        try:
            # Production: real API call via host.http_fetch("llm-api", ...)
            result = self._call_llm(messages, tools)
            self.consecutive_failures = 0
            self.request_count += 1
            return result
        except Exception as e:
            self.consecutive_failures += 1
            if self.consecutive_failures >= 3:
                self.circuit_open = True
            return {"error": str(e), "circuit_open": self.circuit_open}

    @handler("timer_tick", "cast")
    def timer_tick(self) -> None:
        # Gradual recovery: decrement failure count by 1 each tick (30s).
        # 3 failures -> 90s before circuit closes again. Prevents premature re-open.      
        if self.circuit_open and self.consecutive_failures > 0:
            self.consecutive_failures -= 1
            if self.consecutive_failures == 0:
                self.circuit_open = False
        host.send_after(30_000, "timer_tick", {"op": "timer_tick"})

7.2 Immutable Audit Trail

The AuditEventActor captures every agent action as a fire-and-forget event. Senders never block. Events flow into TupleSpace for append-only, queryable storage:

# From examples/python/apps/miniclaw/memory.py

@event_actor
class AuditEventActor:
    """GenEvent actor: fire-and-forget audit events stored in TupleSpace."""

    event_count: int = state(default=0)

    @init_handler
    def on_init(self, config: dict) -> None:
        host.process_groups.join("svc:audit")

    @handler("log_event", "cast")
    def log_event(self, event_type: str = "", detail: str = "", timestamp: int = 0) -> None:
        ts = timestamp or host.now_ms()
        try:
            host.ts.write(["audit", event_type, ts, detail])
        except Exception as e:
            host.warn(f"AuditEvent: ts.write failed: {e}")
        self.event_count += 1

    @handler("get_stats")
    def get_stats(self) -> dict:
        return {"status": "ok", "event_count": self.event_count}

Notice the "cast" annotation on log_event, this marks the handler as fire-and-forget. The sender (fire_audit() in helpers.py) calls host.send(), not host.ask() without blocking.


Part 8: Tools as Actors with MCP-Style Isolation

8.1 Each Tool Gets Supervision, Metrics, and Fault Recovery

In MiniClaw, the ToolRegistryActor manages tool definitions and dispatches execution. Each tool handler runs within the actor’s sandboxed environment:

# From examples/python/apps/miniclaw/tool_registry.py

@actor
class ToolRegistryActor:
    """Registry of callable tools with simulated execution."""

    tools: dict = state(default_factory=dict)   # name -> tool spec
    exec_count: int = state(default=0)
    actor_id: str = state(default="")

    @init_handler
    def on_init(self, config: dict) -> None:
        self.actor_id = config.get("actor_id", "")
        self.tools = {t["name"]: t for t in _BUILTIN_TOOLS}
        host.process_groups.join("svc:tool_registry")
        host.info(f"ToolRegistryActor init actor_id={self.actor_id} tools={list(self.tools)}")

    @handler("list_tools")
    def list_tools(self) -> dict:
        return {"status": "ok", "tools": list(self.tools.values()), "count": len(self.tools)}

    @handler("register_tool")
    def register_tool(self, name: str = "", description: str = "", input_schema: dict = None) -> dict:
        if not name:
            return {"error": "name is required"}
        self.tools[name] = {"name": name, "description": description, "input_schema": input_schema or {}}
        host.info(f"ToolRegistry: registered tool={name}")
        return {"status": "ok", "name": name}

    @handler("execute_tool")
    def execute_tool(self, name: str = "", input: dict = None) -> dict:
        input = input or {}
        if name not in self.tools:
            return {"error": f"unknown tool: {name}"}

        self.exec_count += 1
        host.info(f"ToolRegistry: executing tool={name} exec={self.exec_count}")

        # Simulated responses per tool type
        if name == "web_search":
            return {"result": f"Search results for: {input.get('query', '')}"}
        if name == "calculator":
            expr = input.get("expression", "0")
            try:
				# Demo-only restricted evaluation.
				# Production code should replace this with an AST-based evaluator or a sandboxed tool actor.                    
                result = eval(expr, {"__builtins__": {}})  # noqa: S307
                return {"result": str(result)}
            except Exception:
                return {"result": f"Could not evaluate: {expr}"}
        if name == "weather":
            location = input.get("location", "unknown")
            return {"result": f"Weather in {location}: 22°C, partly cloudy"}

        return {"result": f"[simulated] {name} output for input {input}"}

    @handler("get_stats")
    def get_stats(self) -> dict:
        return {"status": "ok", "tool_count": len(self.tools), "exec_count": self.exec_count}

8.2 What Standalone MCP Servers Lack

CapabilityStandalone MCPTool-as-Actor (MiniClaw)
State persistenceIn-memory only; lost on restartDurability facet checkpoints to SQLite
Multi-tenant accessNo built-in tenant scopingRequestContext enforces tenant isolation
MetricsMust add manually per toolPer-actor invocation counts automatic
Fault toleranceProcess crash loses all stateSupervisor restarts; state restored from checkpoint
SandboxProcess boundary onlyWASM linear memory + optional Firecracker VM

Part 9: Agent Lifecycle State Machine

9.1 Scoped Memory with KV + TupleSpace Dual-Write

MemoryActor writes every memory entry to both KV (for durable point-lookup) and TupleSpace (for queryable pattern-scan across a scope):

# From examples/python/apps/miniclaw/memory.py

@actor
class MemoryActor:
    """Scoped memory backed by KV (persistent) and TupleSpace (queryable)."""

    memory_count: int = state(default=0)

    @init_handler
    def on_init(self, config: dict) -> None:
        host.process_groups.join("svc:memory")

    @handler("store_memory")
    def store_memory(self, key: str = "", value: str = "",
                     scope: str = "global", agent_id: str = "", session_id: str = "") -> dict:
        if not key:
            return {"error": "key is required"}
        scoped_key = _scoped_key(scope, agent_id, session_id, key)
        host.kv_put(scoped_key, str(value))                     # KV: durable point-lookup
        host.ts.write(["memory", scope, key, str(value)])       # TupleSpace: queryable scan
        self.memory_count += 1
        fire_audit("memory_stored", f"scope={scope} key={key}")
        return {"status": "ok", "key": key, "scope": scope}

    @handler("recall_memory")
    def recall_memory(self, key: str = "", scope: str = "global",
                      agent_id: str = "", session_id: str = "") -> dict:
        scoped_key = _scoped_key(scope, agent_id, session_id, key)
        value = host.kv_get(scoped_key)
        return {"status": "ok", "key": key, "value": value, "found": bool(value)}

    @handler("list_memories")
    def list_memories(self, scope: str = "global") -> dict:
        try:
            tuples = host.ts.read_all(["memory", scope, None, None])
            memories = [{"key": t[2], "value": t[3]} for t in tuples if len(t) >= 4]
        except Exception:
            memories = []
        return {"status": "ok", "memories": memories, "scope": scope}


def _scoped_key(scope: str, agent_id: str, session_id: str, key: str) -> str:
    if scope == "agent" and agent_id:
        return f"mem:agent:{agent_id}:{key}"
    if scope == "session" and session_id:
        return f"mem:session:{session_id}:{key}"
    return f"mem:global:{key}"

The three scopes are not just naming conventions — they determine which memories survive across session boundaries:

ScopePersists acrossExample
globalEverything including sessions, agent restartsUser name, user preferences
agentRestarts of this specific agentAgent-specific learned facts
sessionOnly within a single session“We were discussing X” context

9.2 Session Management with KV with a Channel+User Index

SessionManagerActor stores session metadata in KV and maintains a secondary index that maps channel+user_id to session_id:

# From examples/python/apps/miniclaw/agent.py

@actor
class SessionManagerActor:
    """Manages agent session lifecycle backed by KV storage."""

    active_sessions: int = state(default=0)
    total_created: int = state(default=0)
    session_ids: list = state(default_factory=list)

    @handler("create_session")
    def create_session(self, channel: str = "web", user_id: str = "anonymous",
                       agent_id: str = "agent") -> dict:
        import json
        session_id = f"sess-{channel}-{user_id}-{host.now_ms()}"
        meta = {"session_id": session_id, "channel": channel, "user_id": user_id,
                "agent_id": agent_id, "created_at": host.now_ms(), "status": "active"}
        host.kv_put(f"session:{session_id}", json.dumps(meta))
        host.kv_put(f"session_map:{channel}:{user_id}", session_id)  # secondary index
        self.session_ids.append(session_id)
        self.active_sessions += 1
        fire_audit("session_created", f"session_id={session_id} channel={channel} user_id={user_id}")
        return {"status": "ok", "session_id": session_id}

    @handler("get_session")
    def get_session(self, session_id: str = "", channel: str = "", user_id: str = "") -> dict:
        import json
        if not session_id and channel and user_id:
            # Natural key lookup via secondary index
            session_id = host.kv_get(f"session_map:{channel}:{user_id}")
        if not session_id:
            return {"error": "session not found"}
        raw = host.kv_get(f"session:{session_id}")
        if not raw:
            return {"error": "session not found", "session_id": session_id}
        meta = json.loads(raw)
        meta["status"] = "ok"
        return meta

The secondary index means a chatbot can route an incoming webhook (which carries channel and user_id but not a session token) directly to the right session without a scan.

9.3 State Management

The AgentStateFSM tracks execution state through a finite state machine. It validates transitions at runtime and attempting idle -> responding is rejected. This catches bugs in the agent loop before they produce corrupt state.

# From examples/python/apps/miniclaw/memory.py

# Sole authoritative definition of the FSM.
# Adding a new state requires only adding it here.
_VALID_FSM_TRANSITIONS = {
    "idle": {"processing", "tool_executing"},
    "processing": {"tool_executing", "responding", "idle"},
    "tool_executing": {"processing", "idle"},
    "responding": {"idle"},
}


@fsm_actor(states=["idle", "processing", "tool_executing", "responding"], initial="idle")
class AgentStateFSM:
    """Agent lifecycle FSM: idle -> processing -> tool_executing -> responding -> idle."""

    fsm_state: str = state(default="idle")
    transition_count: int = state(default=0)

    @init_handler
    def on_init(self, config: dict) -> None:
        host.process_groups.join("svc:agent_fsm")

    @handler("transition")
    def transition(self, to: str = "") -> dict:
        allowed = _VALID_FSM_TRANSITIONS.get(self.fsm_state, set())
        if to not in allowed:
            host.debug(f"FSM: invalid transition {self.fsm_state} -> {to}")
            return {"status": "ignored", "from": self.fsm_state, "to": to}
        prev = self.fsm_state
        self.fsm_state = to
        self.transition_count += 1
        host.debug(f"FSM: {prev} -> {to}")
        return {"status": "ok", "from": prev, "to": to}

    @handler("get_state")
    def get_state(self) -> dict:
        return {"status": "ok", "state": self.fsm_state, "transitions": self.transition_count}

Operators query the FSM to see what every agent does at any moment with full observability.


Part 10: Multi-Agent Orchestration with Durable Checkpoints

The OrchestratorActor decomposes complex tasks and delegates each sub-task to the AgentActor. It uses the Workflow behavior, which checkpoints progress after each step:

# From examples/python/apps/miniclaw/orchestrator.py

@workflow_actor
class OrchestratorActor:
    """Durable workflow: decompose task -> delegate to agents -> aggregate results."""

    status: str = state(default="idle")
    task_id: str = state(default="")
    progress: int = state(default=0)

    @init_handler
    def on_init(self, config: dict) -> None:
        host.info(f"OrchestratorActor init actor_id={config.get('actor_id', '')}")

    @run_handler
    def run(self, payload: dict = None) -> dict:
        payload = payload or {}
        task = payload.get("task", "explain how agents work")
        task_id = payload.get("task_id", f"orch-{host.now_ms()}")

        self.status = "running"
        self.task_id = task_id
        self.progress = 0

        agent_id, err = pg_first("svc:agent")
        if err or not agent_id:
            self.status = "failed"
            return {"error": "no agents in svc:agent", "task_id": task_id}

        # Decompose: split on " and " for multi-step tasks
        lower = task.lower()
        idx = lower.find(" and ")
        sub_tasks = [task[:idx].strip(), task[idx + 5:].strip()] if idx >= 0 else [task]

        sub_results = []
        for i, sub_task in enumerate(sub_tasks):
            self.progress = (i + 1) * 100 // len(sub_tasks)
            resp = ask(agent_id, "chat",
                       {"message": sub_task, "session_id": f"orch-{task_id}-{i}"}, 15000)
            if not resp:
                self.status = "failed"
                return {"error": "sub-task failed", "task_id": task_id}
            # Checkpoint sub-result to TupleSpace — survives orchestrator crash
            host.ts.write(["orch_result", task_id, i, str(resp.get("response", ""))])
            sub_results.append(resp)

        summaries = [r.get("response", "") for r in sub_results if r.get("response")]
        self.status = "completed"
        self.progress = 100
        fire_audit("orchestrator_completed", f"task_id={task_id} subtasks={len(sub_tasks)}")
        return {
            "status": "ok",
            "task_id": task_id,
            "result": " | ".join(summaries),
            "sub_results": sub_results,
            "sub_tasks": len(sub_tasks),
        }

    @signal_handler("cancel")
    def cancel(self) -> None:
        self.status = "cancelled"
        host.info(f"Orchestrator cancelled task_id={self.task_id}")

    @query_handler("status")
    def query_status(self) -> dict:
        return {"task_id": self.task_id, "status": self.status, "progress": self.progress}

The @run_handler, @signal_handler, and @query_handler decorators map cleanly to the Workflow behavior’s three message types:

  • run: starts the workflow execution
  • signal: sends an out-of-band control message (e.g., cancellation mid-workflow)
  • query: reads durable workflow state without blocking the running workflow

Part 11: Multi-App Deployments

In this example all ten actors share a single WASM binary via ACTOR_REGISTRY:

# From examples/python/apps/miniclaw/miniclaw_actor.py
ACTOR_REGISTRY = {
    "llm_router":      LLMRouterActor,
    "tool_registry":   ToolRegistryActor,
    "agent":           AgentActor,
    "session_manager": SessionManagerActor,
    "orchestrator":    OrchestratorActor,
    "memory":          MemoryActor,
    "audit_event":     AuditEventActor,
    "agent_fsm":       AgentStateFSM,
    "task_queue":      TaskQueueActor,
    "health_monitor":  HealthMonitorActor,
}

This is convenient for development and single-tenant deployments. For enterprise multi-tenant deployments, you can split actors into separate applications to achieve stronger isolation:

  • llm-gateway/ – LLMRouterActor only for credential management isolated
  • agent-app/ – AgentActor + SessionManagerActor one app per tenant team
  • tools-app/ – ToolRegistryActor + MemoryActor hared tool catalog
  • audit-app/ – AuditEventActor compliance isolation
  • infra-app/ – TaskQueueActor + HealthMonitorActor

In the multi-app model, each application gets its own Firecracker microVM in production, providing hardware-level tenant isolation. Actors across applications discover each other via process groups or object registry and the code changes only in app-config.toml, not in the actor implementations.

Plugins as Deployed Apps, Not Bundled Packages

OpenClaw’s post-mortem describes a painful middle state: too much moved toward plugins, while plugins were still bundled, repaired, and dependency-loaded in startup paths. This is the monolith decomposition trap: you split the code but not the process, so startup coupling survives the refactor.

PlexSpaces avoids this by treating plugins as deployed apps, not installed packages. A channel connector, or a third-party memory backend is a separate app that exposes one or more actors. The agent loop discovers them the same way it discovers any actor via pg_first("svc:telegram-connector") or on a remote node. Adding a new integration means deploying a new app, not modifying package.json.

OpenClaw patternPlexSpaces equivalentWhat changes
Bundled channel plugins in coreChannel app deployed separatelyStartup failure in the channel app doesn’t touch the agent loop
Shared node_modules dependency graphEach app is its own WASM binarySupply-chain compromise in one app’s deps can’t reach another app
Plugin repair at startupActor restarts via one_for_one supervisorOnly the failed actor restarts; the rest keep running
Hard to decompose after the factActor boundaries are message contracts from day oneMoving an actor to its own app changes app-config.toml, not the actor code

Part 12: Security Comparison Actor Framework vs. Monolithic

Security PropertyOpenClaw / MonolithicMiniClaw / Actor-Based
State isolationShared memory; one agent reads another’s statePer-actor private state; accessible only through messages
Privilege boundarySingle process; tools share agent’s full permissionsWASM sandbox; actor can only call WIT-declared imports
Sandbox depthOS process boundary onlyWASM linear memory + Firecracker microVM hardware boundary
Tenant separationApplication-level checks; misconfiguration = data leakFramework-enforced RequestContext; no bypass possible
Tool executionIn-process; tool crash = agent crashSeparate actor; tool crash triggers supervised restart
Secret managementos.environ shared across all toolsActor-scoped KV; WASM has no env var access
Audit trailOptional; must add per toolBuilt-in @event_actor; captures all operations by default
Prompt injection blast radiusFull system access: files, network, memoryConfined to single actor’s WIT capabilities
Circuit breakerMust implement per integrationBuilt into LLMRouterActor; state survives restarts
Crash recoveryProcess restart; lose all in-flight stateActor restart; resume from durability checkpoint
Quality validationHope the LLM got it rightReflection loop + three-check guardrails + LLM-as-Judge
Failure detectionUncaught exceptions; manual health checksMonitor/link primitives; __DOWN__/__EXIT__ messages
Multi-tenant scalingShard by process; complex ops burdenCellular architecture; independent failure domains

Part 13: Running the Example

Build and Deploy

cd examples/python/apps/miniclaw
./build.sh                     # componentize-py -> WASM Component Model
./test.sh 8092                 # Deploy to running node and run full test suite

What the Test Script Validates

The test script exercises all ten actors end-to-end:

# Step 3: LLM Router — simulated chat + tool routing
ask "llm_router" '{"op":"chat_completion","messages":[{"role":"user","content":"Hello!"}],"tools":[]}'

# Step 5: Agent chat — full loop including tool use
ask "agent" '{"op":"chat","message":"Search for the weather in Paris","session_id":"test-sess-1"}'

# Step 9: Agent FSM — validate state transitions
ask "agent_fsm" '{"op":"transition","to":"processing"}'
ask "agent_fsm" '{"op":"transition","to":"responding"}'

# Step 10: Orchestrator workflow — durable multi-agent task
ask "orchestrator" '{"op":"workflow_run","task":"explain AI agents","task_id":"test-orch-1"}' 60
ask "orchestrator" '{"op":"workflow_query:status"}'

# Step 8: Task Queue — Channel-backed enqueue/dequeue/ack
ask "task_queue" '{"op":"enqueue","task_type":"send_email","payload":{"to":"bob@example.com"}}'
ask "task_queue" '{"op":"dequeue","limit":1}'
ask "task_queue" '{"op":"ack","msg_id":"..."}'

App Configuration

All ten actors are declared in app-config.toml. Each actor specifies its behavior_kind, role (used to select the right class from ACTOR_REGISTRY), and facets:

[[supervisor.children]]
name = "agent"
actor_type = "miniclaw_wasm"
role = "agent"
behavior_kind = "GenServer"
args = { role = "agent", agent_name = "general-assistant",
         system_prompt = "You are a helpful AI assistant with access to tools." }
facets = [
  { type = "virtual_actor", priority = 100, config = { idle_timeout = "10m", activation_strategy = "eager" } },
  { type = "durability", priority = 90, config = { checkpoint_interval = 3 } }
]

[[supervisor.children]]
name = "orchestrator"
actor_type = "miniclaw_wasm"
role = "orchestrator"
behavior_kind = "Workflow"            # Enables @run_handler, @signal_handler, @query_handler
args = { role = "orchestrator" }
facets = [
  { type = "virtual_actor", priority = 100, config = { idle_timeout = "10m", activation_strategy = "lazy" } },
  { type = "durability", priority = 90, config = { checkpoint_interval = 5 } }
]

[[supervisor.children]]
name = "agent_fsm"
actor_type = "miniclaw_wasm"
role = "agent_fsm"
behavior_kind = "GenFSM"              # Enables @fsm_actor state machine behavior
args = { role = "agent_fsm" }
facets = [
  { type = "virtual_actor", priority = 100, config = { idle_timeout = "30m", activation_strategy = "lazy" } },
  { type = "durability", priority = 90, config = { checkpoint_interval = 1 } }
]

The Isolation Ladder

Not every deployment needs a Firecracker VM, but every production agent system should reason explicitly about which isolation layer each component requires. MiniClaw provides a progression:

LayerMechanismWhat it contains
Message isolationActor private state; all access via host.ask/sendCross-agent state reads; accidental coupling through shared memory
Tenant isolationRequestContext JWT enforced by the frameworkCross-tenant KV, TupleSpace, and process group access
App isolationSeparate deployed apps; independent startup pathsStartup coupling; plugin dependency repair contagion across integrations
WASM isolationWIT import surface; per-actor linear memorySupply-chain attacks; filesystem, env, and exec access
Firecracker/Docker isolationVM boundary per tenantWASM JIT escape; cross-tenant kernel syscall surface

The same actor code runs at every level. The app-config.toml determines which layers are active for a given deployment. Development runs message isolation only. A single-tenant production deployment adds WASM. A multi-tenant enterprise deployment adds Firecracker/Docker.


Conclusion

MiniClaw is not a finished enterprise agent platform. It is a small proof of concept that demonstrates a different foundation for one. The important lesson is not that every agent system needs these exact ten actors. The lesson is that agent runtimes benefit when isolation, supervision, explicit messaging, durable state, scoped memory, audit, and tenant boundaries are part of the architecture from the beginning. A monolithic agent loop is easy to start with, but hard to harden later. MiniClaw takes the opposite path: split the runtime into small actors, give each actor one responsibility, constrain what it can access, supervise it when it fails, and communicate only through explicit messages. Each actor owns one responsibility: routing LLM calls, managing tools, storing session metadata, persisting memory, recording audit events, coordinating workflows, or monitoring health.

MiniClaw is implemented with PlexSpaces that provides runtime primitives such as KV, TupleSpace, Channels, timers, workflows, GenEvent, and GenFSM. It allows better fault tolerance, observability, tenant-isolation, authentication, observability, rate limiting, circuit breaker, backpressure, sandboxed execution via WebAssembly and Firecracker. This POC demonstrates the shape of the solution:

  • AgentActor models the bounded agent loop: user message -> LLM -> tool call -> repeat -> final response.
  • LLMRouterActor defines the model boundary, using a simulator where production code would call OpenAI, Anthropic, Bedrock, Gemini, or an internal model.
  • ToolRegistryActor centralizes tool registration and dispatch.
  • SessionManagerActor stores session metadata in KV.
  • MemoryActor demonstrates global, agent, and session-scoped memory.
  • AuditEventActor records non-blocking audit events through GenEvent-style fire-and-forget messaging.
  • AgentStateFSM makes lifecycle transitions explicit.
  • TaskQueueActor shows durable background work through channels.
  • HealthMonitorActor polls service-group health using actor timers.
  • OrchestratorActor demonstrates workflow-style task decomposition and result aggregation.

A production MiniClaw would harden the implementation with the following:

  • strict tenant, user, session, and tool authorization on every message;
  • safe eval like asteval; the WASM sandbox reduces but does not eliminate the risk;
  • one actor instance per tenant/session or explicit session-partitioned state;
  • add schema validation before tool execution;
  • add idempotency to task queue processing;
  • hardened tool execution with separate sandboxed tool actors for high-risk tools;
  • real LLM provider integration with retries, budgets, timeouts, backoff, and circuit breakers;
  • prompt-injection detection, output validation, and optional LLM-as-judge actors;
  • stronger memory governance, including TTLs, redaction, encryption, and deletion semantics;
  • structured audit trails with retention policies and tamper-resistant storage;
  • crash-recovery tests, chaos testing, and cross-tenant isolation tests;
  • deployment hardening for secrets, networking, service links, and Firecracker isolation.

For teams building enterprise AI agents, the real question is not whether they need isolation, auditability, tenant boundaries, tool governance, and failure recovery. They do. The question is whether they bolt those properties onto a monolithic agent process later, or start with a runtime where those properties are first-class primitives.


The full source, including the Go and Python implementations, is at github.com/bhatti/PlexSpaces.

References

April 26, 2026

20+ Production Patterns for Distributed AI Agents Using Actors and TupleSpaces

Filed under: Computing,Concurrency,Erlang,GO — admin @ 12:37 pm

Introduction

I have been building Agentic AI applications for a couple of years and shared some of the learnings (see previous blogs at the end). In most cases, I used Python with LangChain and LangGraph frameworks because they provide integration with local and cloud based LLM providers. However, the real challenge isn’t building one AI agent. It’s running 10,000 of them reliably, across teams, across nodes, without one team’s runaway model budget crashing another’s pipeline. This post is about the other problem: the infrastructure problem, which is fundamentally a distributed systems problem.

Most AI frameworks don’t even acknowledge that coordinating large scale agents is a distributed systems problem (See FLP theorem and Byzantine Generals Bound). You cannot engineer your way out of these constraints with better prompts or better models. You need explicit coordination protocols, failure detection, and external validation, which is at the heart of distributed systems. This is where the actor model comes in. Actors have been part of core abstractions for distributed computing since 1970s and can be easily used to structure agents. I first learned about actors and Linda memory model back in college during my post-doc research in distributed systems and used them to build frameworks for solving computational problems in HPC at scale. Actors provide the coordination substrate that makes distributed agent systems provably safer:

  • Isolated state: means no shared memory corruption and a misinterpreting agent cannot corrupt another agent’s state.
  • Message passing: makes coordination explicit and auditable without shared memory/locks.
  • Supervision trees: give you crash detection and recovery, e.g., when an agent fails (Byzantine or otherwise), the supervisor restarts it, links can propagate failures, and monitors can trigger compensating actions.
  • Durable state: with the durability facet means consensus progress survives node crashes.
  • TupleSpace coordination: gives you Linda-model consensus patterns without deadlock: write-once slots, pattern-matched reads, blocking takes, which are the building blocks of coordination protocols.

Every major AI framework today picks one problem and solves it well. For example, LangChain gives you chains, AutoGen gives you multi-agent conversations, Ray gives you distributed compute. But when you need all of these like stateful agents, distributed execution, durable pipelines, multi-tenant isolation, MCP tool calling, AllReduce gradient synchronization, AND the coordination substrate that makes distributed agents safe, you have to stitch together five systems. I wrote PlexSpaces actors system to solve scalable computational problems. It can be used to treat each agent as an actor: isolated state, message-driven communication, location-transparent routing, built-in fault tolerance. This framework supports polyglot development where applications can be written in Python, Go, Rust, or TypeScript. This post shows how to implement AI workload patterns concretely. For the theory behind why the actor model fits AI workloads so naturally, see my earlier post on PlexSpaces foundations. For the polyglot WASM runtime that makes four-language deployment possible, see the WebAssembly deep-dive. This post is about AI agent patterns specifically.


Part 1: Why Actors Are the Right Foundation for Distributed Agents

1.1 The Actor-Agent Isomorphism

An LLM agent has four things: state (conversation history, tool results), a processing loop (receive message -> reason -> act), communication (call tools, delegate to other agents), and failure modes (timeouts, hallucinations, rate limits). An actor has exactly the same structure. This isn’t a coincidence. Both actors and agents are inspired by the same computational model: isolated units of stateful computation that communicate by passing messages. Here is a Python research agent in 18 lines:

# examples/python/apps/a2a_multi_agent — ResearchAgent pattern
@actor(facets=["virtual_actor", "durability"])
class ResearchAgent:
    """Each actor IS an agent: isolated state + message-driven + fault-tolerant."""
    history: list = state(default_factory=list)
    queries_handled: int = state(default=0)
    agent_id: str = state(default="")

    @init_handler
    def on_init(self, config: dict) -> None:
        self.agent_id = config.get("actor_id", "")
        # Register in service registry — write-once so supervisor instance wins
        _ts_register_service("research", self.agent_id)

    @handler("research")
    def research(self, query: str = "", from_actor: str = "") -> dict:
        self.queries_handled += 1
        self.history.append({"query": query, "ts": host.now_ms()})
        return {"result": f"Research result for: {query}", "agent_id": self.agent_id}

The @actor decorator registers this as a GenServer actor. The durability facet checkpoints state automatically if the node crashes mid-query, the agent resumes from the last checkpoint. The virtual_actor facet activates the agent on demand and deactivates it when idle, so you pay nothing at rest.

Notice _ts_register_service("research", self.agent_id): this is the TupleSpace write-once service registry pattern. The first instance to call this writes the slot. Any subsequent instance finds the slot already taken and skips registration. This is how you implement safe service discovery without process groups that generate noisy warnings or risk routing to the wrong instance.

Agentic coding naturally favors small, composable actors. A researcher, an analyzer, a writer, each focused on one capability, composable via message passing. The Go a2a_multi_agent example makes this concrete: four actors (registry, researcher, analyzer, writer) each do one thing and delegate everything else.

1.2 The Distributed Consensus Problem in Multi-Agent Systems

When you run multiple LLM agents in parallel to speed up a complex coding task, to parallelize a RAG pipeline, to run specialist agents for different subtasks, you are building a distributed system. And distributed systems have properties that no amount of LLM capability improvement will change. Consider a prompt: “Build a REST API for user management with authentication.” This prompt is under specified. It admits at least these valid interpretations:

  • JWT vs session-based auth
  • REST vs GraphQL
  • PostgreSQL vs MongoDB
  • Monolith vs microservices

If you run four parallel agents on this prompt and each picks a different interpretation, you don’t get a coherent system, instead you get four incompatible subsystems. At ten agents this is a debugging problem. At ten thousand agents running across twenty nodes, this is a production incident at 3 AM. The agents must coordinate their design choices. That coordination is a consensus problem.

  • FLP Theorem: If agents communicate asynchronously (messages may be delayed arbitrarily) and any agent can crash (network failure, context limit, rate limiting), then no deterministic protocol can guarantee both safety (all agents agree on correct output) and liveness (the system eventually produces output).
  • Byzantine bound: Treat a misinterpreting agent as a Byzantine node, it sends plausible-looking messages but with incorrect content. Correct consensus requires fewer than 1/3 of agents to be Byzantine. If three of your ten agents hallucinate an incompatible API shape, you may not be able to reach correct consensus at all.

What follows from this:

  1. External validation (tests, type checking, static analysis) converts silent misinterpretations into detectable failures, e.g., Byzantine nodes become crash-detectable nodes, which is a strictly easier problem to solve.
  2. Explicit coordination protocols (not “talk to each other until you agree”) give you provable properties.
  3. Liveness requires failure detection. An agent that has crashed must be detected and either recovered or bypassed.

PlexSpaces provides all three, baked into the actor model:

Distributed Systems NeedPlexSpaces Mechanism
Failure detectionhost.monitor(actorID): get notified when an actor dies
Crash recoverySupervisor tree: automatic restart with configurable strategy
Coordination protocolTupleSpace write-once slots with explicit, auditable coordination
External validationValidatorActor pattern with external check before accepting output
Byzantine isolationPer-actor isolated state so that a misinterpreting actor cannot corrupt others
Liveness under crashesdurability facet so that progress survives node restarts

1.3 Failure Detection and Liveness: host.monitor()

Agents need “liveness-checking tools for better fault detection.” In PlexSpaces, this is host.monitor() and host.link() , following Erlang’s location-transparent supervision philosophy.

  • Monitor: any actor watches any other. When the monitored actor stops, the monitoring actor receives __DOWN__ in its mailbox and stays alive. The monitor_ref returned by host.monitor() lets you cancel the watch with host.demonitor().
  • Link: bidirectional fate-sharing. __EXIT__ is delivered only on abnormal exits (error, kill). Normal shutdown does not cascade. Use host.unlink() before graceful shutdown to avoid spurious propagation.

The example below is from examples/python/apps/ai_monitor_link_supervision:

# examples/python/apps/ai_monitor_link_supervision/ai_monitor_link_actor.py

@gen_server_actor
class ValidatorAgent:
    """Monitors workers; detects Byzantine faults; applies FLP >= 1/3 alert threshold."""
    monitor_refs: dict = state(default_factory=dict)   # worker_id -> monitor_ref
    down_events: list = state(default_factory=list)
    byzantine_count: int = state(default=0)
    total_validations: int = state(default=0)
    FLP_THRESHOLD = 1.0 / 3.0

    @handler("__DOWN__", "cast")
    def on_down(self, monitor_ref: str = "", down_from: str = "", down_reason: str = "") -> None:
        """Monitored worker stopped — one-way notification. ValidatorAgent stays alive.
        
        DOWN fires on ANY exit: normal, error, shutdown, kill. The monitoring actor
        decides what to do — this is Akka Death Watch semantics, not Erlang trap_exit.
        """
        self.down_events.append({"down_from": down_from, "down_reason": down_reason})
        # Remove stale watch entry so we don't leak monitor refs
        for wid, ref in list(self.monitor_refs.items()):
            if ref == monitor_ref:
                del self.monitor_refs[wid]
                break

    @handler("monitor_worker")
    def on_monitor_worker(self, worker_id: str = "") -> dict:
        """One-way watch. Returns monitor_ref for future demonitor() call."""
        monitor_ref = host.monitor(worker_id)
        self.monitor_refs[worker_id] = monitor_ref
        return {"status": "ok", "monitor_ref": monitor_ref}

    @handler("demonitor_worker")
    def on_demonitor_worker(self, worker_id: str = "") -> dict:
        """Cancel watch — used when gracefully replacing a worker."""
        ref = self.monitor_refs.pop(worker_id, None)
        if ref:
            host.demonitor(ref)   # idempotent: safe to call multiple times
        return {"status": "ok", "worker_id": worker_id}

    @handler("validate")
    def on_validate(self, result: str = "", worker_id: str = "") -> dict:
        """Apply FLP-inspired Byzantine threshold: >= 1/3 flagged ? alert.
        
        FLP theorem: no deterministic async protocol can guarantee both safety and
        liveness with even one crash. Monitors give us the failure signal; this
        threshold decides when to escalate.
        """
        self.total_validations += 1
        is_byzantine = any(p in result.lower() for p in ["42 is the answer", "null", "checkpoint corrupted"])
        if is_byzantine:
            self.byzantine_count += 1
        flp_ratio = self.byzantine_count / self.total_validations if self.total_validations else 0.0
        return {"valid": not is_byzantine, "flp_threshold_exceeded": flp_ratio >= self.FLP_THRESHOLD}


@gen_server_actor
class InferenceWorker:
    """LLM inference worker. Uses host.link() for bidirectional fate-sharing with peer workers."""
    linked_peers: list = state(default_factory=list)

    @handler("__EXIT__", "cast")
    def on_exit(self, exit_from: str = "", exit_reason: str = "") -> None:
        """Linked peer died abnormally — clean up and continue.
        
        __EXIT__ fires ONLY on abnormal exits (error, kill). Normal shutdown does
        NOT propagate — use host.unlink() before graceful shutdown to prevent cascade.
        """
        if exit_from in self.linked_peers:
            self.linked_peers.remove(exit_from)

    @handler("link_with")
    def on_link_with(self, peer_id: str = "") -> dict:
        host.link(peer_id)          # bidirectional: if either dies abnormally, other gets __EXIT__
        self.linked_peers.append(peer_id)
        return {"status": "ok", "peer_id": peer_id}

    @handler("unlink_from")
    def on_unlink_from(self, peer_id: str = "") -> dict:
        host.unlink(peer_id)        # decouple before graceful shutdown — no cascade
        self.linked_peers = [p for p in self.linked_peers if p != peer_id]
        return {"status": "ok", "peer_id": peer_id}

This is liveness management at the actor level. The ValidatorAgent stays alive even when a worker crashes and __DOWN__ is informational, not fatal. The InferenceWorker handles __EXIT__ only from abnormal peer failures; normal shutdowns don’t cascade because the supervisor calls unlink_from first.

The down_from / down_reason header names match the create_down_message wire format used by every PlexSpaces node. The same pattern works identically in Go, TypeScript, and Rust WASM (see examples/*/apps/ai_monitor_link_supervision for all four languages).

1.4 Four Behaviors, Four Agent Archetypes

PlexSpaces provides four behavior types, each mapping naturally to a class of AI agent:

BehaviorDecoratorAgent ArchetypeExample
GenServer@actorTool executor, stateful helperSearch agent, RAG retriever
GenEvent@event_actorAudit logger, event publisherUsage tracker, metrics collector
GenFSM@fsm_actorState-machine agentCircuit breaker, quality gate, budget guard
Workflow@workflow_actorOrchestrator agentMulti-step pipeline, RAG workflow, agentic loop

The TypeScript llm_workflow_orchestrator uses all four. The QualityFSMActor implements a quality gate with five states:

// From llm_workflow_orchestrator_actor.ts
class QualityFSMActor extends PlexSpacesActor<QualityFSMState> {
  getDefaultState(): QualityFSMState {
    return { actorId: "", fsmState: "pending", attempts: 0, lastScore: 0 };
  }

  onEvaluate(payload: Record<string, unknown>): Record<string, unknown> {
    const score = Number(payload.score ?? 0);
    this.state.attempts++;
    this.state.lastScore = score;
    if (score >= 8) {
      this.state.fsmState = "approved";
    } else if (score >= 6) {
      this.state.fsmState = this.state.attempts >= 3 ? "escalated" : "evaluating";
    } else {
      this.state.fsmState = this.state.attempts >= 3 ? "rejected" : "evaluating";
    }
    return { state: this.state.fsmState, score, attempts: this.state.attempts };
  }
}

The PipelineAuditActor uses GenEvent semantics, fire-and-forget, no reply needed:

// Fire-and-forget handler: cast (no return value)
onPipeline_step_completed(payload: Record<string, unknown>): void {
  this.state.eventsReceived++;
  this.state.lastEvent = payload;
  host.applicationMetricsAdd(this.state.actorId || "llm-orchestrator", {
    message_count: 1,
    counter_metrics: { pipeline_events: 1 },
  });
}

These two actors require zero changes to the orchestrator logic. They attach via config.

1.5 Facets: Cross-Cutting Agent Capabilities

Facets are the key architectural insight. They are pluggable capabilities that attach to actors at deployment time without code changes in the actor handler logic.

FacetAgent BenefitDistributed Systems Guarantee
virtual_actorActivates on demand, deactivates when idlePrevents unbounded resource consumption
durabilitySurvives node restarts, state checkpointed automaticallyProgress preservation across crashes (liveness)
timerSchedules follow-up actions, heartbeats, budget reviewsTimeout detection for hung agents
metricsEvery interaction auto-instrumented in PrometheusObservability for failure detection
cachingMemoize expensive LLM calls, skip redundant computationReduces cost of Byzantine retries

The updated app-config.toml for llm_workflow_orchestrator shows facets composing via config:

[[supervisor.children]]
id = "quality_fsm"
type = "quality_fsm"
behavior_kind = "GenFSM"
facets = [
  { type = "virtual_actor", priority = 100, config = { idle_timeout = "30m", activation_strategy = "lazy" } },
  { type = "durability", priority = 90, config = { checkpoint_interval = 1 } }
]

The quality FSM now checkpoints after every state transition (checkpoint_interval = 1) and deactivates after 30 minutes of inactivity. Zero lines changed in QualityFSMActor. That is the point, the business logic and the operational logic stay separate.

1.6 TupleSpace: Safe Coordination Without Race Conditions

The FLP theorem says you cannot guarantee both safety and liveness in an asynchronous system. But you can get very close by using the right coordination primitive. TupleSpace implements the Linda coordination model: write tuples, read them by pattern match, take them (destructive read). Three operations without locks or mutable state. Write-once slots give you safe service registration across concurrent actor instances:

// Go SDK — TupleSpace write-once service registration
// (from resource_aware_inference_actor.go and a2a_multi_agent_actor.go)
func tsRegisterService(serviceType, actorID string) {
    // Read first — if entry exists, skip (write-once semantics)
    if _, ok := host.TS().Read([]any{"svc", serviceType, nil}); !ok {
        host.TS().Write([]any{"svc", serviceType, actorID})
    }
}

func tsDiscoverService(serviceType string) (string, error) {
    tup, ok := host.TS().Read([]any{"svc", serviceType, nil})
    if !ok || len(tup) < 3 {
        return "", fmt.Errorf("service %q not registered", serviceType)
    }
    return tup[2].(string), nil
}
// TypeScript SDK — same pattern
function tsRegisterService(serviceType: string, actorId: string): void {
  const existing = host.ts.read(["svc", serviceType, null]);
  if (!existing) {
    host.ts.write(["svc", serviceType, actorId]);
  }
}

function tsDiscoverService(serviceType: string): string | null {
  const tup = host.ts.read(["svc", serviceType, null]);
  return (tup && tup.length >= 3) ? String(tup[2]) : null;
}
# Python SDK — same pattern
def _ts_register_service(service_type: str, actor_id: str) -> None:
    existing = host.ts_read(["svc", service_type, None])
    if not existing:
        host.ts_write(["svc", service_type, actor_id])

def _ts_discover_service(service_type: str) -> str | None:
    tup = host.ts_read(["svc", service_type, None])
    return tup[2] if tup and len(tup) >= 3 else None

The framework uses WASM re-instantiation to speed up actor startup (compile once, instantiate from cached binary). During the re-instantiation window, a new HTTP request can activate a second instance of the same actor type via virtual_actor. If both instances join a process group, pgFirst() returns non-deterministically. We saw this cause budget_exceeded errors in resource_aware_inference when the routing workflow asked the budget manager for remaining balance and got the empty virtual_actor instance that had never been initialized with budget data. TupleSpace write-once registration solves this:

  1. Supervisor-spawned instance calls tsRegisterService("budget_manager", myID) on Init writes slot.
  2. Virtual_actor instance calls tsRegisterService("budget_manager", myID2) on Init finds slot taken, skips.
  3. Routing workflow calls tsDiscoverService("budget_manager") and always gets the supervisor-spawned instance.

For shared state (like budget totals that all instances should see), store the data in TupleSpace too:

// BudgetManagerActor — state in TupleSpace, not per-actor KV
// Both the supervisor-spawned and any virtual_actor instance read the same data
func (b *BudgetManagerActor) tsReadBudgetFloat(prefix, tenantID string) float64 {
    tup, ok := host.TS().Read([]any{prefix, tenantID, nil})
    if !ok || len(tup) < 3 { return 0 }
    var v float64
    fmt.Sscanf(fmt.Sprint(tup[2]), "%f", &v)
    return v
}

func (b *BudgetManagerActor) tsWriteBudgetFloat(prefix, tenantID string, value float64) {
    host.TS().Take([]any{prefix, tenantID, nil}) // remove old value
    host.TS().Write([]any{prefix, tenantID, fmt.Sprintf("%f", value)}) // write new
}

This is the coordination protocol the FLP analysis demands: explicit, auditable, shared state managed through a primitive that has no locks and no deadlock risk.


Part 2: Platform Capabilities

2.1 WAR-File like Deployment: Multiple AI Apps Per Node

PlexSpaces nodes are application servers for WASM actors like JBoss for WAR files, but for AI agents. Each team deploys an independent application (a .wasm binary + a config file) to the same node. Applications share the runtime but have isolated namespaces, actor registries, and tenant contexts.

# Deploy RAG pipeline from Search team
plexspaces deploy --app rag-pipeline --wasm rag.wasm --config rag-config.toml

# Deploy inference server from ML team — same node, independent lifecycle
plexspaces deploy --app inference-server --wasm inference.wasm --config inference-config.toml

# Deploy agent orchestrator from Platform team — same node
plexspaces deploy --app agent-orchestrator --wasm orchestrator.wasm --config orchestrator-config.toml

Each application has its own supervisor tree, its own actor namespace, and its own failure isolation. The ML team’s inference workers crashing doesn’t touch the Search team’s RAG pipeline.

2.2 Node Communication with Location-Transparent Messaging

Actors on different nodes message each other with the same API as local actors. When OrchestratorAgent calls host.Ask(researchAgentID, "research", ...), the framework routes transparently to local mailbox if the target is on the same node, gRPC if it’s on a different node. The calling actor never knows the difference.

// From a2a_multi_agent_actor.go — OrchestratorAgent
// This call works whether researchAgent is local or 3 nodes away.
researchResp, err := host.Ask(researchAgentID, "research", map[string]any{
    "topic": task, "depth": 1,
}, 10000)
// No service discovery config. No DNS lookup. No circuit breaker setup.
// The framework handles routing, retries, and failover.

SWIM gossip propagates node membership in real time. When a new node joins, actors on existing nodes can immediately message actors on the new node. This makes multi-node agent deployments trivial. The a2a_multi_agent example deploys four specialist agents, each potentially on different nodes, and the orchestrator coordinates them with the same host.Ask() calls used for local agents.

2.3 Multi-Tenancy with AuthN/AuthZ

Every host.Ask() call carries a RequestContext with tenant_id and namespace. You cannot bypass it. The Python MCPGatewayWorkflow enforces tenant boundary at the application layer:

# From mcp_tool_server_actor.py — MCPGatewayWorkflow.start()
# JWT carries tenant_id — enforced at every Ask() boundary
tenant = request.get("tenant", "")
if tenant:
    self_ns = actor_application_id(self.actor_id)
    if self_ns and tenant != self_ns:
        return {
            "jsonrpc": "2.0", "id": request_id,
            "error": {"code": -32600,
                      "message": f"Tenant mismatch: '{tenant}' — access denied"},
        }
# Pass tenant context downstream — research agent sees the same tenant_id
result = host.ask("tool_registry", "tools_call", {
    "tool_name": tool_name, "input": params.get("arguments", {}),
    "tenant": tenant,  # propagated through the call chain
}, timeout_ms=15000)

The application_metrics_add() call in every actor automatically tags metrics by actor ID, which includes the application namespace. Prometheus metrics are naturally scoped to tenant. JWT validation, namespace isolation, and metric scoping all happen at the framework level.

2.4 The Primitive Stack — Everything You Need, Nothing You Don’t

Every pattern in this post builds on one or more of these primitives. All are available in every language. All are accessible via the same host.* API from any actor regardless of language or location.

PrimitiveWhat It DoesAI Agent Use CaseHPC/ML Analog
Shard GroupPartition data across N actors; scatter-gather with aggregationParallel RAG retrieval, distributed inferenceRay map_batches(), Spark partitions
Worker PoolStateless actor pool with load balancingBurst inference capacity, tool executionRay remote functions, Lambda concurrency
Process GroupDynamic membership; broadcast to all membersConfig updates to all inference workersMPI communicator, Gloo process group
TupleSpacePattern-matched shared memory; Linda-model coordinationService registry, task result sharing, consensusMPI ghost cell exchange, barrier sync
ChannelsQueue-based stage coupling; 6 backends (Kafka, Redis, SQS, PG, …)Async pipeline stages, event streamingKafka, SQS, RabbitMQ
Workflow ActorMulti-step durable orchestration; pause/resume/cancelRAG pipeline, agent orchestrationAirflow DAG, Temporal workflow
Distributed LockLease-based mutual exclusion across actorsModel weight update, index rebuildZooKeeper, Redis Redlock
Blob StorageLarge binary payloads (embeddings, model weights)Embedding cache, model artifact storeS3, HDFS
BroadcastSend data to all actors in a process groupPush config updates to all workersMPI_Bcast
Collective ReduceSum/min/max across all actors; return to coordinatorAggregate inference metricsMPI_Allreduce
Scatter/GatherFan-out to N workers, fan-in aggregated resultsParallel document search, batch inferenceMPI_Scatter + MPI_Gather

2.5 Custom Services and Components and Full Polyglot Stack

PlexSpaces is not just a runtime for the primitives above. It ships the entire stack needed to build production AI services:

SDKs in all four languages:

# Python: @actor decorator, host.ask(), host.ts_write(), host.monitor()
@actor(facets=["virtual_actor", "durability"])
class MyAgent: ...
// Go: struct embedding, host.Ask(), host.TS().Write(), host.Monitor()
type MyAgent struct { plexspaces.ActorBase }
func (a *MyAgent) HandleMessage(from, msgType, payload string) string { ... }
// TypeScript: class extends PlexSpacesActor, host.ask(), host.ts.write()
class MyAgent extends PlexSpacesActor<MyState> { ... }
// Rust: #[gen_server_actor], host::ask(), host::ts_write(), host::monitor()
#[gen_server_actor]
struct MyAgent { state: MyState }

Service links for outbound HTTP connect to any external API (OpenAI, Anthropic, your own inference endpoint) via config, not code:

# app-config.toml — service link to LLM provider
[[service_links]]
name = "llm_provider"
base_url = "https://api.openai.com"
timeout_secs = 30
retry_policy = { max_attempts = 3, backoff = "exponential" }
# Python actor using service link — no URL in code, no hardcoded credentials
response = host.http_fetch("llm_provider", "POST", "/v1/chat/completions",
    json.dumps({"model": "gpt-4o", "messages": messages}))

Custom supervisor strategies — configure how your agent tree recovers from failures:

[supervisor]
id = "rag-supervisor"
strategy = "one_for_one"        # restart only the crashed actor
max_restarts = 10
max_restart_window_secs = 60    # if 10 crashes in 60s, escalate to parent
children = [...]

Alternatively rest_for_one (restart crashed actor + all actors started after it) or one_for_all (restart entire team when any member crashes), the right choice depends on how much your agents share state.

Observability out of the box: every actor reports to Prometheus automatically:

// application_metrics_add() from any actor, any language
host.ApplicationMetricsAdd("rag-pipeline", map[string]any{
    "message_count": 1,
    "counter_metrics": map[string]any{
        "queries_processed": 1,
        "validation_failures": validationFailed,
    },
    "latency_totals_ms": map[string]any{
        "retrieve_ms": retrieveLatency,
        "generate_ms": generateLatency,
    },
})
// Automatically available at /metrics as:
// plexspaces_app_queries_processed{app="rag-pipeline",node="node-1"} 142
// plexspaces_app_retrieve_ms_total{app="rag-pipeline",node="node-1"} 8432

The battery list (all included, zero external deps beyond the binary):

BatteryWhat It Includes
RuntimeWASM AOT compilation, ~50 microsecond cold start, polyglot actor host
StoragePer-actor SQLite journal, KV store, blob store, TupleSpace
MessagingLocal mailbox, remote gRPC, ordered delivery, at-least-once
SchedulingTimers, send_after, cron-style periodic messages
CoordinationTupleSpace, distributed locks, process groups, channels
ScalingShard groups, elastic pools, MPI collectives
SecurityJWT auth, tenant isolation, namespace scoping, RBAC
ObservabilityPrometheus metrics, per-actor counters, application metrics API
DeploymentAPP/WAR-file hot deploy/undeploy, multi-app per node, SWIM gossip
NetworkingLocation-transparent routing, gRPC transport, service links

Part 3: Infrastructure Patterns

Pattern 1: Durable Workflows with Signals and Queries

Workflow actors give you the durability that LLM pipelines need but almost never have. Use durability when your pipeline has multiple expensive steps and you cannot afford to restart from scratch on a crash. Each step is checkpointed. Crash at step 3, resume from step 3. No full restart. The Python MCPGatewayWorkflow shows the pattern:

# From mcp_tool_server_actor.py — MCPGatewayWorkflow
@workflow_actor(facets=["virtual_actor", "durability"])
class MCPGatewayWorkflow:
    session_id: str = state(default="")
    requests_processed: int = state(default=0)
    last_error: str = state(default="")

    @run_handler
    def start(self, request: dict = None) -> dict:
        if not self.session_id:
            self.session_id = f"session-{host.now_ms()}"
        method = request.get("method", "")
        # Route to tool registry — state checkpointed before and after
        if method == "tools/list":
            result = host.ask("tool_registry", "tools_list", {}, timeout_ms=10000)
        elif method == "tools/call":
            tool_name = request.get("params", {}).get("name", "")
            result = host.ask("tool_registry", "tools_call",
                              {"tool_name": tool_name, "input": request.get("params", {}).get("arguments", {})},
                              timeout_ms=15000)
        self.requests_processed += 1
        return {"jsonrpc": "2.0", "id": request.get("id", 0), "result": result}

    @signal_handler("reset")
    def reset(self, reason: str = "manual") -> None:
        self.requests_processed = 0
        self.session_id = f"session-{host.now_ms()}"

Temporal requires a separate server and a separate SDK. Airflow restarts the whole DAG. PlexSpaces checkpoints per step inside the actor runtime, using the same SQLite journal that backs all actor state.

Pattern 2: SEDA (Staged Event-Driven Architecture)

SEDA decouples pipeline stages so a slow embedder doesn’t stall the parser, and a GPU failure at step 3 doesn’t rerun step 1. Every stage is an independent actor (or shard group of actors). Stages communicate by message passing. Each stage has its own queue, its own scaling policy, and its own failure boundary.

Use this pattern when your pipeline stages have meaningfully different latency profiles or resource requirements. For example, a slow GPU-bound generation step should not stall a fast CPU-bound parsing step, and a failure in one stage should not force the others to restart. The agentic_rag_pipeline example in Go shows the three core stages: index, retrieve, generate, validate as separate actors orchestrated by a workflow:

// From agentic_rag_pipeline_actor.go — RAGWorkflow: four actors, one workflow
// Each actor is an independent stage with its own queue and failure domain.
retrieverID := wf.siblingActorID("retriever")    // Stage 2: keyword search
generatorID := wf.siblingActorID("generator")    // Stage 3: LLM generation
validatorID := wf.siblingActorID("validator")    // Stage 4: guardrail checks

// Stage 2 -> Stage 3: message passing (no shared memory, no locks)
retrieveResp, err := host.Ask(retrieverID, "retrieve", map[string]any{
    "query": query, "mode": effectiveMode, "max_results": 5,
}, 15000)
chunks := extractStringSlice(retrieveResp, "results")

generateResp, err := host.Ask(generatorID, "generate", map[string]any{
    "query": query, "context": chunks,
}, 15000)

// Fire-and-forget audit event to GenEvent actor — Stage 4 doesn't wait for it
_ = host.Send(eventActorID, "pipeline_step_completed", map[string]any{
    "step": "generate", "status": "completed",
})

The host.Send() call to the PipelineEventActor is fire-and-forget. The workflow continues immediately without blocking, backpressure from the audit stage into the generation stage. That’s SEDA in one line. At larger scale (from data_lake_rag), each stage becomes a shard group for horizontal parallelism: the retrieval stage fans out across N shards of the index, collects top-K per shard, merges globally.

Scale the retrieval stage without touching the generation stage. Route GPU-heavy generation to GPU nodes via labels. The workflow actor checkpoints between stages so a crash at generation doesn’t re-run indexing. This is the operational superiority of SEDA: independent scaling, independent failure recovery, independent observability.

Pattern 3: Cellular Architecture

You can use this pattern when namespace isolation is not enough and you need hard failure domain separation between tenants or regions. Also use for geographic compliance requirements where data cannot leave a region. Each cell in cellular architecture is an independent PlexSpaces cluster of nodes sharing same cluster-name: with its own supervisor tree, its own KV store, its own actor registry. WASM APP/WAR-file deployment means each cell runs multiple AI services independently. SWIM gossip handles peer discovery between cells. Partition cells by tenant or by geography. Cells fail independently. An ACME tenant cell crashing doesn’t touch the Beta tenant cell. Add a new AI service to the ACME cell/cluster by dropping a .wasm file and the Beta cell/cluster never sees it, never needs to restart.

This is multi-tenancy at the infrastructure level not just separate namespaces but separate fault domains with transparent cross-cell message routing.

Pattern 4: Resource-Based Affinity

Use resource based affinity when you have heterogeneous compute (GPU vs CPU nodes) and need to route requests to the right tier based on prompt complexity, remaining budget, or hardware capability. The Go resource_aware_inference example below shows cost-aware model routing in 30 lines. The routing workflow coordinates three actors via TupleSpace discovery:

// From resource_aware_inference_actor.go — RoutingWorkflow.Run()
func (rw *RoutingWorkflow) Run(payloadJSON string) string {
    p := parsePayload(payloadJSON)
    prompt := stringVal(p, "prompt", "")
    tenantID := stringVal(p, "tenant_id", "default")
    preferGPU, _ := p["prefer_gpu"].(bool)

    // Discover services via TupleSpace registry (write-once, race-safe)
    budgetManagerID, err := tsDiscoverService("budget_manager")
    modelRegistryID, err := tsDiscoverService("model_registry")

    // Step 1: Check tenant budget
    complexity := promptComplexity(prompt)
    estimatedCost := 200.0 * tierCostPer1K("medium") / 1000.0
    budgetResp, err := host.Ask(budgetManagerID, "check_budget", map[string]any{
        "tenant_id": tenantID, "estimated_cost": estimatedCost,
    }, 10000)
    // ... if not allowed: return budget_exceeded

    // Step 2: Select model by complexity + budget + GPU preference
    modelResp, _ := host.Ask(modelRegistryID, "select_model", map[string]any{
        "complexity": complexity, "budget_remaining": remainingUSD, "prefer_gpu": preferGPU,
    }, 10000)
    selectedTier := stringVal(modelMap, "tier", "small")

    // Step 3: Route to tier-specific inference worker (also TS-discovered)
    workerRole := "inference_worker_" + selectedTier
    workerID, _ := tsDiscoverService(workerRole)
    inferResp, _ := host.Ask(workerID, "infer", map[string]any{
        "prompt": prompt, "max_tokens": 100, "tenant_id": tenantID,
    }, 30000)

    // Step 4: Deduct actual cost from shared TupleSpace budget
    host.Ask(budgetManagerID, "deduct", map[string]any{
        "tenant_id": tenantID, "cost": actualCost,
    }, 10000)
}

Three model tiers. One workflow actor. Per-tenant budget enforcement.


Part 4: RAG and Knowledge Patterns

Pattern 5: Indexing at Scale with Sharded RAG Index

Use indexing at scale when your document corpus is too large for a single actor to index or query within acceptable latency, or when you need to parallelize retrieval across many partitions and aggregate top-K results. For example, the parameter server Leader.train() in Python shows scatter-gather at its most direct: fan out compute_gradient to N workers, collect responses, aggregate:

# From parameter_server_actor.py — Leader.train()
group = host.create_shard_group({
    "group_id": group_id,
    "actor_type": "worker",
    "shard_count": self.num_workers,
    "partition_strategy": "hash",
    "placement": {"strategy": "from_registry"},
    "initial_state": {},
})

for _ in range(iterations):
    response = host.scatter_gather({
        "group_id": group_id,
        "query": {
            "op": "compute_gradient",
            "weights": {"w1": self.w1, "w2": self.w2},
            "input_dim": self.input_dim, "hidden_dim": self.hidden_dim,
        },
        "aggregation": "concat",
        "min_responses": self.num_workers,
        "timeout_ms": 30000,
    })
    # ... aggregate gradients, update weights

The same pattern applies to RAG indexing: N shard actors each hold a partition of the document corpus. Query time: scatter the search across all shards, gather top-K results, merge.

Pattern 6: Agentic RAG — Orchestrated Retrieve-Generate-Validate

Use agentic RAG when a single retrieval-generation pass is not reliable enough for your use case, and you can afford 2–3 retry cycles in exchange for higher answer quality. The Go agentic_rag_pipeline demonstrates a full agentic RAG loop with retry in a workflow actor. This directly addresses the external validation recommendation from the FLP analysis: the ValidatorActor converts silent LLM misinterpretations (hallucinations, off-topic answers) into detectable failures that the workflow can handle.

// From agentic_rag_pipeline_actor.go — RAGWorkflow.Run()
for attempt := 0; attempt <= maxRetries; attempt++ {
    effectiveMode := mode
    if attempt > 0 { effectiveMode = "deep" }  // escalate to deep search on retry

    // Step 1: Retrieve
    wf.CurrentStep = "retrieve"
    retrieveResp, err := host.Ask(retrieverID, "retrieve", map[string]any{
        "query": query, "mode": effectiveMode, "max_results": 5,
    }, 15000)
    chunks := extractStringSlice(retrieveResp, "results")

    // Step 2: Generate
    wf.CurrentStep = "generate"
    generateResp, err := host.Ask(generatorID, "generate", map[string]any{
        "query": query, "context": chunks, "max_retries": 1,
    }, 15000)
    answer := extractString(generateResp, "answer")

    // Step 3: Validate — external check converts silent errors to detectable failures
    wf.CurrentStep = "validate"
    validateResp, err := host.Ask(validatorID, "validate", map[string]any{
        "answer": answer, "query": query, "sources": sources,
    }, 10000)
    if extractBool(validateResp, "valid") || attempt >= maxRetries {
        wf.Status = "completed"
        return marshal(map[string]any{"status": "completed", "answer": answer,
            "score": extractFloat(validateResp, "score"), "retry_count": attempt})
    }
    // Validation failed — retry with deep search mode
}

The retry escalation is key: first attempt uses single mode (fast, keyword match). Failed attempts switch to deep mode — multi-hop retrieval that tries individual query words. The workflow actor checkpoints between steps, so a generator crash mid-validation doesn’t force re-retrieval.

Pattern 7: Trustworthy Generation with Guardrails

Use guardrails pattern when you are deploying agents in a context where incorrect or unsafe output has real consequences: customer-facing answers, financial decisions, regulated content. The ValidatorActor in the Go RAG pipeline runs three checks on every generated answer. These checks implement the “external validation converts Byzantine failures to detectable failures” principle:

// From agentic_rag_pipeline_actor.go — ValidatorActor.validate()
// Check 1: Length — answer must be longer than 10 chars
lengthOK := len(answer) > 10

// Check 2: Source grounding — answer must share words with at least one source
// This detects hallucination: an answer with no shared words with sources is likely fabricated
groundedOK := false
if len(sources) > 0 {
    answerWords := wordSet(strings.ToLower(answer))
    for _, src := range sources {
        srcWords := wordSet(strings.ToLower(src))
        for w := range answerWords {
            if len(w) > 3 && srcWords[w] { groundedOK = true; break }
        }
    }
}
if len(sources) == 0 { groundedOK = true }  // no sources: check not applicable

// Check 3: Safety — answer must not contain prompt injection attempts
forbidden := []string{"ignore", "bypass", "jailbreak", "forget"}
safeOK := true
for _, f := range forbidden {
    if strings.Contains(strings.ToLower(answer), f) { safeOK = false; break }
}

confidence := float64(passedCount) / 3.0

Three independent checks, composable. Add a toxicity check, a PII check, a hallucination detector, each is a new check function inside the same validator actor. Or promote the validator to a pipeline of validator actors, each responsible for one check category.

Pattern 8: Deep Search (Multi-Hop Retrieval)

Use this pattern when a single-pass keyword retrieval consistently returns fewer results than expected for complex or multi-concept queries. However, it can result in higher escalation cost. For example, the RetrieverActor escalates from keyword matching to word-level multi-hop retrieval when the first pass yields fewer than 2 results:

// From agentic_rag_pipeline_actor.go — RetrieverActor.retrieve()
if mode == "deep" && len(results) < 2 {
    words := strings.Fields(queryLower)
    for _, word := range words {
        if len(word) < 3 { continue }
        extra := ret.matchChunks(keys, word, maxResults-len(results))
        for _, e := range extra {
            results = append(results, e)
            if len(results) >= maxResults { break }
        }
    }
}

Simple and effective. The RetrieverActor tracks TotalChunksScanned so you can observe the cost of deep search versus single-pass retrieval in Prometheus.


Part 5: LLM Orchestration

Pattern 9: Prompt Chaining

Use this pattern when a single prompt cannot reliably produce your target output and you can decompose the task into sequential transforms where each step’s output is well-defined enough to be the next step’s input. If steps are independent rather than sequential, use parallel scatter-gather instead. For example, ChainActor in the TypeScript orchestrator executes multi-step sequential transforms. Each step receives the output of the previous step:

// From llm_workflow_orchestrator_actor.ts — ChainActor.onExecute_chain()
onExecute_chain(payload: Record<string, unknown>): Record<string, unknown> {
    const steps = Array.isArray(payload.steps)
      ? (payload.steps as string[])
      : ["summarize", "extract_keywords", "format_output"];
    let currentContent = String(payload.content ?? "");
    const stepResults: Record<string, unknown>[] = [];

    for (const step of steps) {
        const stepStart = host.nowMs();
        let transformed = currentContent;
        if (step === "summarize") {
            transformed = currentContent.length > 200
              ? currentContent.slice(0, 200) + "... [summarized]" : currentContent;
        } else if (step === "extract_keywords") {
            const words = currentContent.replace(/[^a-zA-Z\s]/g, "").split(/\s+/)
              .filter((w) => w.length > 5);
            transformed = [...new Set(words)].slice(0, 5).join(", ");
        } else if (step === "format_output") {
            transformed = JSON.stringify({ step_count: stepResults.length + 1,
              content: currentContent, processed: true });
        }
        stepResults.push({ step, latency_ms: host.nowMs() - stepStart });
        currentContent = transformed;
    }
    return { steps_completed: steps.length, final_output: currentContent };
}

Each step is pluggable. Add a translate step, a classify step, a fact_check step — the chain executor handles it without structural changes.

Pattern 10: Routing

Routing is one of the most important agentic patterns (see the full taxonomy here). You can use this pattern when you have specialist agents (or models) that each handle a category of input better than a single general agent, and you need a stateful, observable dispatch layer rather than ad hoc if/else logic scattered across your orchestration code. For example, a routing actor classifies the input, selects the appropriate specialist, and dispatches, all in one stateful actor that tracks routing decisions in Prometheus. RouterActor in the TypeScript orchestrator. Note that onInit uses TupleSpace registration, not process groups, so sibling discovery is deterministic:

// From llm_workflow_orchestrator_actor.ts — RouterActor
protected override onInit(config: Record<string, unknown>): void {
    this.state.actorId = String(config.actor_id ?? "");
    // TupleSpace write-once registration — supervisor instance wins
    tsRegisterService("router", this.state.actorId);
}

onRoute(payload: Record<string, unknown>): Record<string, unknown> {
    const content = String(payload.content ?? "");
    const lower = content.toLowerCase();
    let route: string;
    if (lower.includes("summarize") || content.length < 100) {
        route = "summarize";
    } else if (lower.includes("extract") || lower.includes("entities")) {
        route = "extract";
    } else if (lower.includes("analyze") || lower.includes("compare")) {
        route = "analyze";
    } else {
        route = "generate";
    }
    this.state.routingDecisions += 1;
    this.state.routes[route] = (this.state.routes[route] ?? 0) + 1;
    return { route, task_type: route, content, routing_id: host.nowMs() };
}

The OrchestratorWorkflow resolves sibling targets at onInit via TupleSpace discovery, then uses them throughout the workflow run without re-discovery:

// From llm_workflow_orchestrator_actor.ts — OrchestratorWorkflow.onInit()
protected override onInit(config: Record<string, unknown>): void {
    // Resolve once at init — TupleSpace discovery is consistent
    this.state.routerTarget = siblingActorTarget("router");
    this.state.chainTarget = siblingActorTarget("chain");
    this.state.judgeTarget = siblingActorTarget("judge");
}

In production, replace keyword matching with a lightweight classifier model. The router actor holds the classifier in its state (loaded once in getDefaultState()), just like the inference worker holds the LLM. The dispatch logic stays unchanged — swap the classification algorithm without touching the routing architecture.

Pattern 11: Reflection and LLM-as-Judge

Use this pattern when output quality is highly variable and you can define a numeric score threshold that separates acceptable from unacceptable responses. For example, the OrchestratorWorkflow implements the reflection loop. It chains generation (via ChainActor) with scoring (via JudgeActor) and refines until the score threshold is met or max iterations is reached:

// From llm_workflow_orchestrator_actor.ts — OrchestratorWorkflow.run()
for (let iter = 0; iter <= maxIterations; iter++) {
    const judgeRes = host.ask(this.state.judgeTarget, "evaluate",
        { content: currentContent, original_query: content }, 10000) as Record<string, unknown>;
    const score = Number(judgeRes.score ?? 0);
    finalScore = score;
    finalResult = currentContent;

    if (score >= scoreThreshold || iter >= maxIterations) { break; }

    // Refine: re-chain with iteration note
    this.state.iterationCount += 1;
    currentContent = `Refined attempt ${this.state.iterationCount}: ${content}`;
    const refinedChain = host.ask(this.state.chainTarget, "execute_chain",
        { content: currentContent }, 15000) as Record<string, unknown>;
    currentContent = String(refinedChain.final_output ?? currentContent);
}
// Store result in TupleSpace for cross-actor access — other actors can pattern-match
host.ts.write(["orchestrator", "result", this.state.taskId, this.state.finalScore, host.nowMs()]);

The TupleSpace write at the end is important: other actors (the PipelineAuditActor, a downstream consumer) can read the final result by pattern-matching on ["orchestrator", "result", taskId, ...] without polling or shared memory. This is the Linda coordination model applied to agent result sharing.

Pattern 12: Exception Handling with Circuit Breaker FSM

Use this pattern when your agents call downstream services (LLM providers, external APIs) that are occasionally unavailable, and an indefinite block on a failed call would cascade into pipeline-wide stalls. The circuit breaker converts an unresponsive dependency into a fast, predictable failure. For example, the GeneratorActor in Go implements a circuit breaker with three states. This directly addresses the FLP liveness problem: when a downstream LLM is unavailable (crashed, rate-limited), the circuit breaker converts an indefinite block into a fast fail, preserving system liveness.

// From agentic_rag_pipeline_actor.go — GeneratorActor.generate()
if gen.CircuitOpen {
    return marshal(map[string]any{
        "answer": "Service temporarily unavailable. Please try again later.",
        "model": "circuit-breaker-fallback", "circuit_open": true,
    })
}

for attempt := 0; attempt <= maxRetries; attempt++ {
    answer, err := gen.tryGenerate(query, contextChunks)
    if err == "" {
        gen.ConsecutiveFailures = 0
        return marshal(map[string]any{"answer": answer, "circuit_open": false})
    }
    gen.ConsecutiveFailures++
    if gen.ConsecutiveFailures >= 3 {
        gen.CircuitOpen = true
        return marshal(map[string]any{"error": "circuit opened", "circuit_open": true})
    }
}

Three consecutive failures open the circuit. The fallback message is immediate. The reset_circuit handler closes it again after recovery. No external circuit breaker library. The actor IS the circuit breaker and it persists its open/closed state via the durability facet, so a node restart doesn’t incorrectly re-open a circuit that was deliberately closed.

Pattern 13: Evol-Instruct with Prompt Mutation for Dataset Augmentation

Use this pattern when you are fine-tuning a model and your prompt dataset is too small or not diverse enough. Run this pattern to generate mutation candidates, score them with a judge, and keep the top performers. For example, ChainActor.onEvolve_instruction() mutates prompts to generate diverse training data:

// From llm_workflow_orchestrator_actor.ts — ChainActor.onEvolve_instruction()
onEvolve_instruction(payload: Record<string, unknown>): Record<string, unknown> {
    const instruction = String(payload.instruction ?? "");
    const mutations = Number(payload.mutations ?? 2);
    let evolved = instruction;
    let count = 0;
    if (mutations >= 1) { evolved = "Please explain in detail: " + evolved; count += 1; }
    if (mutations >= 2) { evolved = evolved + " Provide examples."; count += 1; }
    if (mutations >= 3) {
        const synonyms: Record<string, string> = { good: "excellent", use: "utilize", show: "demonstrate" };
        for (const [word, syn] of Object.entries(synonyms)) {
            evolved = evolved.replace(new RegExp(`\\b${word}\\b`, "gi"), syn);
        }
        count += 1;
    }
    return { original: instruction, evolved, mutations_applied: count };
}

Chain this with a judge: generate 10 mutations, score each, keep the top 3. Ship them as training examples. The ChainActor state tracks how many evolutions it has produced, so you can throttle and monitor via Prometheus.


Part 6: Scaling Patterns

This is why PlexSpaces was built, e.g., how do you scale AI inference across 16 nodes without writing a distributed systems PhD thesis? Ray solves it with remote functions. Horovod solves the AllReduce piece. Spark solves the batch piece. But they’re three separate frameworks with three separate observability stacks and three separate deployment models. PlexSpaces gives you four parallelization mechanisms in the same framework, accessible from the same actor, using the same host.* API:

MechanismAPIUse CaseRay Equivalent
Shard Grouphost.scatter_gather()Stateful parallel workers, RAG shards, parameter serverray.map_batches() + Ray Actors
Elastic Poolhost.pool_checkout() / host.pool_checkin()Stateless workers, burst capacityray.remote() concurrency
MPI Collectiveshost.broadcast/reduce/allreduce/barrier_shard_group()Distributed training, gradient sync, consensusHorovod (external)
Process Groupshost.PG().Join/Broadcast/Members()Dynamic membership, pub-sub coordinationray.util.collective (partial)

The Python parallel_ai_inference demonstrates all four in one example. Run it with 2, 4, 8, or 16 shards and the BenchmarkActor measures throughput and latency at each level.

Pattern 14: Shard Groups for Stateful Parallelism

Use this pattern when your workload partitions naturally by key (documents by ID, users by hash) and each worker needs warm state across requests. For example, a model loaded in memory that should not be reloaded per request. If work is stateless and uniform, use elastic pools instead. The Python parallel_ai_inference below benchmark measures shard group throughput across 2, 4, 8, and 16 shards:

# From parallel_ai_inference_actor.py — BenchmarkActor.run_shard_benchmark()
for num_shards in shard_counts:
    group = host.create_shard_group({
        "group_id": f"bench-shard-{num_shards}-{host.now_ms()}",
        "actor_type": "inference_worker",
        "shard_count": num_shards,
        "partition_strategy": "hash",
        "placement": {"strategy": "from_registry"},
    })
    bench_start = host.now_ms()
    for i in range(requests_per_shard):
        response = host.scatter_gather({
            "group_id": group_id,
            "query": {"op": "infer", "request_id": f"bench-{num_shards}-{i}", "input": "sample-data"},
            "aggregation": "concat",
            "min_responses": num_shards,
            "timeout_ms": 30000,
        })
        for shard in _extract_shard_responses(response):
            payload = _unwrap_payload(shard.get("payload", {}))
            if payload.get("status") == "ok":
                latencies.append(int(payload.get("latency_ms", 0)))
    # ... compute throughput, p50, p99

Scaling (on my Apple M3 Pro):

ShardsTotalReqKB/reqWall msp50p95p99Compute msCoord msComp%GranEff%
2320256.0163101111447038.60.63100.0
4640256.0179111212878351.21.0591.1
81280256.01901112121768766.92.0285.8
162560256.025511121336712774.32.8963.9
325120256.046611141676426474.32.8935.0

Run parallel_ai_inference on your hardware to get real numbers and the BenchmarkActor outputs these metrics automatically. The key difference from Ray map_batches(): shard actors are stateful. The InferenceWorkerActor loads its model once in on_init and keeps it warm across requests. Ray’s stateless task model reloads the model on every batch.

Pattern 15: Elastic Pools

Use this pattern when your workload is stateless and bursty with no affinity requirement. Pools give you burst capacity without pre-partitioning; the virtual_actor facet shuts idle workers down automatically so you pay nothing at rest. The run_pool_benchmark handler in Python demonstrates dynamic checkout/checkin , a worker pool where requests lease actors, use them, and return them:

# From parallel_ai_inference_actor.py — BenchmarkActor.run_pool_benchmark()
for i in range(total_requests):
    checkout_start = host.now_ms()
    checkout = host.pool_checkout(pool_name, timeout_ms=5000)
    wait_ms = host.now_ms() - checkout_start

    if not checkout:
        failed += 1
        continue

    actor_id = checkout.get("actor_id")
    checkout_id = checkout.get("checkout_id")
    exec_start = host.now_ms()
    try:
        host.ask(actor_id, {"op": "infer", "request_id": f"pool-{i}", "input": "pool-sample"},
                 timeout_ms=10000)
        exec_ms = host.now_ms() - exec_start
        exec_times.append(exec_ms)
        successful += 1
    finally:
        host.pool_checkin(pool_name, actor_id, checkout_id, healthy=(failed == 0))

The pool tracks avg_wait_ms, avg_exec_ms, and pool_utilization. When utilization exceeds a threshold, the supervisor spawns additional pool workers. When it drops, idle workers deactivate via the virtual_actor facet and you pay zero at rest. Shard groups vs elastic pools: use shard groups when work partitions naturally (documents by ID, users by hash). Use pools when work is uniform and you want burst capacity without pre-partitioning.

Pattern 16: MPI Collectives

You can use MPI collective when you are running distributed training or gradient synchronization across multiple workers and need AllReduce, Barrier, or Broadcast semantics without pulling in a separate framework like Horovod. Also use for any distributed computation where all workers must agree on a shared value before proceeding to the next step. This is the capability that separates PlexSpaces from every other actor framework: native MPI-grade collective operations. Five collective operations, built in, available in Python, Go, Rust, and TypeScript.

# From parallel_ai_inference_actor.py — BenchmarkActor.run_collective_benchmark()
# 1. BroadcastShardGroup — config reset to all workers (MPI_Bcast equivalent)
t0 = host.now_ms()
broadcast_result = host.broadcast_shard_group({
    "group_id": group_id, "message": {"op": "reset"},
    "min_acks": num_shards, "timeout_ms": 10000,
})
timings["broadcast_ms"] = host.now_ms() - t0

# 2. BarrierShardGroup — wait for all workers to be ready (MPI_Barrier)
t0 = host.now_ms()
barrier_result = host.barrier_shard_group({"group_id": group_id, "timeout_ms": 10000})
timings["barrier_ms"] = host.now_ms() - t0

# 3. ReduceShardGroup — aggregate inference stats (MPI_Reduce with sum)
t0 = host.now_ms()
reduce_result = host.reduce_shard_group({
    "group_id": group_id, "map_function": {"op": "get_metrics"},
    "reduction": "sum", "timeout_ms": 10000,
})
timings["reduce_ms"] = host.now_ms() - t0

# 4. AllReduceShardGroup — consensus metrics across all workers (MPI_Allreduce)
t0 = host.now_ms()
allreduce_result = host.all_reduce_shard_group({
    "group_id": group_id, "map_function": {"op": "get_metrics"},
    "reduction": "sum", "timeout_ms": 10000,
})
timings["allreduce_ms"] = host.now_ms() - t0

What each operation does in AI/ML context:

OperationAPIML Use CaseMPI Equivalent
BroadcastShardGrouphost.broadcast_shard_group()Push updated model weights to all workersMPI_Bcast
BarrierShardGrouphost.barrier_shard_group()Synchronize all workers before next training stepMPI_Barrier
ReduceShardGrouphost.reduce_shard_group()Aggregate gradients from all workers -> coordinatorMPI_Reduce
AllReduceShardGrouphost.all_reduce_shard_group()Every worker gets the aggregated gradient (Ring AllReduce)MPI_Allreduce
ScatterGatherhost.scatter_gather()Fan-out inference requests, fan-in resultsMPI_Scatter + MPI_Gather

Ray needs Horovod for AllReduce, and Horovod is Python-only, requires NCCL, and runs as a separate job. PlexSpaces bakes all five collectives into the actor runtime, in all four languages, accessible from the same host.* API you use for everything else.

Pattern 17: Resource-Aware Cost Optimization

Use this pattern when you serve multiple tenants with different budgets and need to enforce financial limits at the infrastructure level. For example, BudgetManagerActor in Go tracks per-tenant USD spending across all inference calls. The state lives in TupleSpace and shared across all actor instances, race-safe via take-then-write:

// From resource_aware_inference_actor.go — BudgetManagerActor.getReport()
// State is in TupleSpace, not per-actor KV — all instances see the same data
func (b *BudgetManagerActor) getReport() string {
    // ReadAll matches pattern ["budget", tenantID, value] across all tenants
    tuples := host.TS().ReadAll([]any{"budget", nil, nil})
    report := make([]any, 0, len(tuples))
    for _, tup := range tuples {
        if len(tup) < 3 { continue }
        tenantID, _ := tup[1].(string)
        budgetUSD := b.tsReadBudgetFloat("budget", tenantID)
        usedCost := b.tsReadBudgetFloat("usage_cost", tenantID)
        report = append(report, map[string]any{
            "tenant_id": tenantID, "budget_usd": budgetUSD,
            "used_usd": usedCost, "remaining_usd": budgetUSD - usedCost,
        })
    }
    return marshal(map[string]any{"status": "ok", "report": report})
}

The model registry selects tier based on complexity AND remaining budget, large model for complex prompts when budget allows, fall back to small model when budget is tight. The resource-affinity side lives in app-config.toml:

# From resource_aware_inference/app-config.toml
[[supervisor.children]]
id = "inference_worker_large"
type = "inference_worker_large"
behavior_kind = "GenServer"
facets = [
  { type = "virtual_actor", priority = 100,
    config = { idle_timeout = "15m", activation_strategy = "lazy",
               labels = { tier = "large", gpu_capable = "true", memory_tier = "high" } } },
  { type = "metrics", priority = 50 }
]
args = { tier = "large", base_latency_ms = "400" }

Set gpu_capable = "true" on GPU nodes. The ModelRegistryActor.select_model() checks the prefer_gpu flag from the request and routes accordingly. Large-tier workers with gpu_capable = "true" get routed GPU-heavy requests. CPU workers handle small and medium requests. The BudgetFSM enforces the financial ceiling, no matter how capable the GPU, if the tenant budget is exhausted, requests get budget_exceeded before any GPU cycles are wasted.


Part 7: Agent Patterns

Pattern 18: Tool Calling and MCP Integration

Use this pattern when your agents need to call external tools (search APIs, databases) and you want those tools to be stateful, fault-tolerant, and observable as first-class actors rather than raw HTTP calls that fail silently and leave no audit trail. For example, the Python mcp_tool_server implements full MCP (Model Context Protocol) tool calling via actors. Each MCP tool is an actor. The registry is an actor. The gateway is a workflow actor.

# From mcp_tool_server_actor.py — ToolRegistryActor.tools_call()
@handler("tools_call")
def tools_call(self, tool_name: str = "", input: dict = None) -> dict:
    if tool_name not in self.tools:
        return {"error": "tool_not_found", "available_tools": list(self.tools.keys())}

    # Validate required fields from JSON schema
    schema = self.tools[tool_name]
    required_fields = schema.get("inputSchema", {}).get("required", [])
    missing = [f for f in required_fields if f not in input]
    if missing:
        return {"error": "missing_required_fields", "missing": missing}

    # Route to specialist tool actor — location transparent
    target_actor = {"calculator": "calculator_tool", "search": "search_tool",
                    "weather": "weather_tool"}.get(tool_name, tool_name)
    self.invocation_counts[tool_name] = self.invocation_counts.get(tool_name, 0) + 1
    try:
        return host.ask(target_actor, "execute", input, timeout_ms=10000)
    except Exception as exc:
        self.error_counts[tool_name] = self.error_counts.get(tool_name, 0) + 1
        return {"error": "tool_execution_failed", "tool": tool_name, "message": str(exc)}

What standalone MCP servers lack: built-in state (registry survives restarts), multi-tenant access control (tenant namespace validation), Prometheus metrics (invocation counts, error rates, latency), and fault tolerance (supervisor tree restarts crashed tool actors). Actors provide all four for free.

Pattern 19: Multi-Agent Collaboration and A2A

Use this pattern when a single agent’s context window or capability set is insufficient for the full task, and you need specialist agents to collaborate with explicit coordination. Use TupleSpace result sharing rather than shared memory; it makes the coordination auditable and race-free. For example, the Go a2a_multi_agent shows a complete multi-agent system with dynamic agent discovery and TupleSpace coordination. Critically, it uses the same TupleSpace patterns that solve the coordination problem identified in the FLP analysis and write results to addressable slots, never share memory directly:

// From a2a_multi_agent_actor.go — OrchestratorAgent.Run()
// Step 1: Discover research agents by capability
discoverResp, err := host.Ask(registryID, "discover", map[string]any{
    "capabilities": []string{"research"},
}, 10000)
researchAgentID := o.pickFirstAgent(discoverResp, selfID, "research_agent")

// Step 2: Delegate research
researchResp, err := host.Ask(researchAgentID, "research", map[string]any{
    "topic": task, "depth": 1,
}, 10000)

// Store in TupleSpace — other agents can read without polling or shared state
researchJSON, _ := json.Marshal(researchResp)
_ = host.TS().Write([]any{"task", taskID, "step", "research", string(researchJSON)})

// ... delegate to analysis and writing agents, each storing to TupleSpace

// Step 7: Aggregate all results from TupleSpace — pattern match retrieves all steps
allResults := host.TS().ReadAll([]any{"task", taskID, "step", nil, nil})

Location transparency is the critical insight for multi-agent systems. When OrchestratorAgent calls host.Ask(researchAgentID, "research", ...), it does not care whether the research agent is on the same node, a different node in the same cluster, or a different cluster entirely. The framework routes transparently.

Pattern 20: Batch Inference Pipeline

Use this pattern you need to process a large, bounded dataset through an inference pipeline as efficiently as possible like nightly jobs, model evaluation runs, bulk document processing. The Broadcast -> Barrier -> Scatter-Gather -> Reduce sequence maps directly to the initialization and execution steps of a distributed training or batch scoring job. For example, the parallel_ai_inference OrchestratorWorkflow runs multi-mode parallel inference:

# From parallel_ai_inference_actor.py — OrchestratorWorkflow._run_collective_mode()
# Broadcast -> Barrier -> Scatter-Gather -> Reduce
host.broadcast_shard_group({
    "group_id": group_id, "message": {"op": "reset"}, "min_acks": num_shards
})
host.barrier_shard_group({"group_id": group_id, "timeout_ms": 10000})

response = host.scatter_gather({
    "group_id": group_id,
    "query": {"op": "infer", "request_id": "collective-infer-0", "input": "collective-input"},
    "aggregation": "concat", "min_responses": num_shards,
})

host.reduce_shard_group({
    "group_id": group_id, "map_function": {"op": "get_metrics"}, "reduction": "sum"
})

Four operations in sequence: reset all workers (broadcast), synchronize (barrier), run inference (scatter-gather), collect metrics (reduce). This is exactly the initialization sequence for a distributed training step and it runs in one actor, in Python, in the same framework as the REST endpoint that triggered the inference.

Pattern 21: Async Agent Sessions

Use this pattern when your agents need to outlive the HTTP connection that triggered them such as background tasks, scheduled routines, multi-device handoff, or multi-user collaboration on a single agent session. For example, a synchronous HTTP/SSE transport couples the agent’s work lifetime to the connection lifetime.

ScenarioHTTP/SSE Failure ModePlexSpaces Solution
Agent outlives the callerResults stored in DB; client must polldurability facet + Workflow Actor: state survives node restart, client reconnects and reads result from TupleSpace
Agent pushes unpromptedMust email or Slack out-of-bandChannels primitive (Kafka/Redis/SQS backends): agent publishes to channel, subscriber receives regardless of original connection state
Caller changes deviceRequires custom session backendvirtual_actor + TupleSpace session state: agent is location-transparent, new device connects to same logical session
Multiple humans in one sessionNot supported nativelyProcess Groups + Broadcast: all session participants join a group; agent broadcasts to all members

PlexSpaces addresses both problems without external dependencies:

  • Durable state: actor-local KV + durability facet checkpointing + TupleSpace for shared session data
  • Durable transport: Channels primitive with six durable backends (Kafka, Redis, SQS, PostgreSQL, and others) — the agent writes to a channel, the subscriber reads from it regardless of whether the two were ever simultaneously connected
# Agent side — write result to durable channel when work completes
# No assumption that any client is currently connected
@workflow_actor(facets=["virtual_actor", "durability"])
class BackgroundResearchAgent:
    session_id: str = state(default="")
    
    @run_handler
    def start(self, request: dict = None) -> dict:
        # Do expensive, long-running work
        result = self._run_research(request.get("topic", ""))
        
        # Publish to named channel — durable, no connection required
        host.channel_publish(f"session:{self.session_id}:results", {
            "status": "complete",
            "result": result,
            "ts": host.now_ms()
        })
        
        # Also write to TupleSpace — any device reconnecting can pull directly
        host.ts_write(["session", self.session_id, "result", host.now_ms()])
        return {"status": "accepted", "session_id": self.session_id}
# Client side — subscribe to channel; survives disconnect/reconnect
# Works identically whether the client is a browser, mobile app, or another agent
subscriber = host.channel_subscribe(f"session:{session_id}:results")
# Blocks until a message arrives — no polling loop, no session URL
result = subscriber.next(timeout_ms=300_000)

The critical difference from the Anthropic and Cloudflare hosted approaches: this runs on your infrastructure, in your cluster, with your data. There is no proprietary session backend you are locked into. The Channels primitive is a configuration choice and you can swap Kafka for Redis for SQS without touching agent code.


Part 8: The Distributed Systems Case for the Actor Model

Why Formal Coordination Protocol Matters

The FLP theorem and Byzantine bounds are mathematical facts, not engineering challenges to be optimized away. In distributed systems, we don’t try to make all nodes infallible, we design protocols that tolerate failures like Zab (ZooKeeper), Raft, PBFT. The actor model applies the same principle to AI agents:

  1. Accept that agents crash: host.monitor() + supervisor restart strategies
  2. Accept that agents misinterpret: external validation via ValidatorActor + structured retry
  3. Accept that messages can be delayed: async host.Ask() with timeout + circuit breaker
  4. Accept shared state is dangerous: TupleSpace coordination instead of direct state sharing
  5. Accept that consensus is expensive: explicit checkpointing so you don’t re-run completed work

None of these require smarter models. They require the right coordination infrastructure.

What Makes the Actor Model the Right Foundation

The actor model, as implemented in PlexSpaces, gives you exactly the properties that distributed systems theory says you need for safe multi-agent coordination:

Distributed Systems PropertyActor Model MechanismPlexSpaces API
Failure atomicity without partial state corruptionPer-actor isolated stateActor KV + TupleSpace
Failure detection know when a peer crashesLink + Monitorhost.monitor(), host.link()
Crash recovery restart from last good stateJournaled checkpointingdurability facet
Consensus without shared memoryMessage passing onlyhost.Ask(), host.Send()
Coordination without deadlockLinda model TupleSpacehost.ts.write/read/take()
Liveness under partial failureSupervisor treeone_for_one, rest_for_one strategies
Byzantine isolationNo cross-actor direct state accessActor boundaries enforced by WASM sandbox
External validationStandalone validator actorsValidatorActor + retry loop pattern

Framework Comparison

PlexSpacesRaySparkHorovodLambda + SQS
Cold start~50 microsecond (WASM AOT)~100ms (Python)~10s (JVM)N/A100ms–10s
Worker stateActor-local, durableExternal storeShuffleStatelessStateless
Ring AllReduceNativeNeeds HorovodNoYesNo
Workflow durabilityPer-stage checkpointNoNoNoStep Functions
MPI collectives5 ops built-inNoNoPartialNo
Multi-tenancyBuilt-in, JWTNoNoNoIAM per function
MCP tool callingActor-nativeNoNoNoNo
A2A multi-agentTupleSpace + registryNoNoNoNo
Durable async transportChannels (6 backends)NoNoNoSQS only
Failure detectionmonitor() + supervisorLimitedNoNoDLQ
PolyglotPython, Go, Rust, TypeScriptPython primarilyJVM + PySparkPython/C++Any FaaS
APP-file deployYes, multi-app per nodeNoNoNoPer-function
Ecosystem maturityEarly-stage; smaller community and fewer third-party integrationsLarge ML ecosystem, extensive documentationMassive data engineering ecosystemNarrow but well-understoodAWS-native, excellent managed ops
Learning curveHigh: new coordination model, four-language SDK, WASM packagingMedium: Python-first, familiar to ML teamsMedium for PySpark, high for ScalaLow if you know PyTorchLow: functions are simple, AWS handles ops
Best fitStateful polyglot agent systems with strict coordination, isolation, and durability requirementsLarge-scale stateless Python ML workloads; teams already on RayBatch ETL and analytics at petabyte scaleDistributed deep learning gradient syncLightweight serverless event processing; AWS-native shops
Avoid whenYour team is Python-only and already invested in Ray or other similar frameworksYou need stateful actors with durability, strict multi-tenancy, or non-Python languagesYou need low-latency online serving or stateful agentsYou need anything beyond gradient synchronizationYou need stateful workflows, complex coordination, or multi-tenant isolation

Conclusion

Every pattern in this post is ultimately the same argument applied to a different surface area: accept the mathematical constraints of distributed systems rather than pretending they dissolve when the nodes are language models instead of databases. The FLP theorem does not care that your consensus participants are generating text. Byzantine fault tolerance does not care that the incorrect messages are hallucinated API shapes instead of corrupted packets. The constraints are identical like the need for isolated state, explicit coordination, crash detection, and external validation.

The actor model has provided exactly those properties since the 1970s. What’s new is the workload, not the substrate. The 20+ patterns in this post cover the full spectrum from single-agent durability to 10,000-agent distributed coordination. They all reduce to four primitives applied consistently:

  • FLP safety: isolated actor state, message-only communication, no shared memory corruption
  • FLP liveness: supervision trees, host.monitor() crash detection, durability facet checkpointing
  • Byzantine isolation: external ValidatorActor, WASM sandbox per actor, structured retry
  • Coordination without deadlock: TupleSpace write-once registration, Linda-model result sharing, Channels for durable async transport

The gap between “one agent that works in a demo” and “ten thousand agents that work at 3 AM on a Tuesday when two nodes are down and one tenant’s budget is exhausted” is not a gap that better prompts or bigger models close. It’s a distributed systems engineering problem, and it has distributed systems solutions. That’s what PlexSpaces is built around and it’s why the actor model, fifty years after its introduction, is still the right foundation.


GitHub: github.com/bhatti/PlexSpaces

Previous posts in this series:

April 14, 2026

API Anti-Patterns: 50+ Mistakes That Will Break Your Production Systems

Filed under: Computing,Microservices — admin @ 2:25 pm

Over the past years I have written extensively about what makes distributed APIs fail. In How Abstraction Is Killing Software I showed how each layer crossing a network boundary multiplies latency and failure probability. In Transaction Boundaries: The Foundation of Reliable Systems and How Duplicate Detection Became the Dangerous Impostor of True Idempotency, I showed how subtle contract violations produce data corruption. Building Robust Error Handling with gRPC and REST, Zero-Downtime Services with Lifecycle Management, and Robust Retry Strategies for Building Resilient Distributed Systems explained error handling and operational health. My production checklist and fault tolerance deep-dive outlined those lessons actionable before a deployment. I also built an open-source API mock and contract testing framework, available at github.com/bhatti/api-mock-service that addresses how few teams verify their API contracts before clients discover the gaps in production. And in Agentic AI for Automated PII Detection I showed how AI-driven scanning can find the sensitive data leaking through APIs that manual review misses. Here, I am showing 50 anti-patterns across seven categories, each with a real-world example. Two laws sit at the foundation of everything that follows.

Hyrum’s Law: With a sufficient number of users of an API, it does not matter what you promised in the contract, i.e., all observable behaviors of your system will be depended upon by somebody.

Postel’s Law (the Robustness Principle): Be conservative in what you send, be liberal in what you accept.


The Anatomy of an API Failure

The diagram below maps where anti-patterns activate in a production request lifecycle. Red nodes are failure hotspots.


Section 1: API Design Philosophy Anti-Patterns

Design philosophy determines everything downstream.


1.1 Bottom-Up API Design: Annotation-Driven and Implementation-First

I have seen this pattern countless times where the team builds the service, then adds Swagger/OpenAPI annotations to the Java or Typescript classes to generate the API spec automatically. The spec is an artifact of the implementation and field names are whatever the ORM column is called. Endpoints are organized around the service layer, not the consumer’s mental model. The spec is generated post-hoc, often incomplete, and rarely reviewed before clients onboard.

In the end, you get an API that perfectly describes your internal implementation and is poorly shaped for external callers. Names leak internal terminology. Refactoring the implementation silently changes the API contract. The APIs are also strongly coupled to the UI that the same team is building and clients who onboard during development find a moving target.

Better approach: Spec-First Design: Write the OpenAPI or Protobuf spec before writing any implementation code. Use the spec as the contract that drives both the server implementation and the client SDK. Review the spec with consumers before implementation begins. Use code generation to produce server stubs from the spec.

# spec-first: openapi.yaml is the source of truth, written before implementation
openapi: "3.1.0"
info:
  title: Order Service
  version: "1.0.0"
paths:
  /v1/orders:
    post:
      operationId: createOrder
      summary: Create a new order
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/CreateOrderRequest'
      responses:
        '201':
          description: Order created
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Order'
        '400':
          $ref: '#/components/responses/ValidationError'
        '409':
          $ref: '#/components/responses/ConflictError'

For gRPC: write the .proto file first. The proto is the spec. Code-generate both server stubs and client libraries from it. Also, Google’s API Improvement Proposals (AIP) define a spec-first methodology for gRPC APIs that also maps to HTTP via the google.api.http annotation. A single proto definition can serve both gRPC clients and REST/JSON clients through a transcoding layer (Envoy, gRPC-Gateway), giving you the performance of binary protobuf and the accessibility of JSON from one spec:

service OrderService {
  rpc CreateOrder(CreateOrderRequest) returns (Order) {
    option (google.api.http) = {
      post: "/v1/orders"
      body: "*"
    };
  }
  rpc ListOrders(ListOrdersRequest) returns (ListOrdersResponse) {
    option (google.api.http) = {
      get: "/v1/orders"
    };
  }
}

1.2 Bloated API Surface: Non-Composable, UI-Coupled APIs

Another common pattern I have seen at a lot of companies is that a service that has hundreds or thousands of endpoints because every new feature needs some new data or behavior. Another artifact of poorly designed APIs is bloated response with all fields, all related resources, deeply nested because the first consumer needed everything and nobody added projection. This often occurs because the API is built by the same team building the UI. When the UI changes, new endpoints are added rather than the existing ones being generalized.

This results in integration without documentation becomes impossible. New clients must read everything to understand what to call. Duplicate endpoints proliferate, e.g., three different endpoints do approximately the same thing because each was built for a different screen without awareness of the others.

Composability principle: A well-designed API surface should be small enough that a competent developer can understand its structure in 30 minutes. Operations should compose small, focused operations that can be combined.

// Anti-pattern: purpose-built for one UI screen
rpc GetCheckoutPageData(GetCheckoutPageDataRequest) returns (CheckoutPageData);
// CheckoutPageData contains customer, cart, inventory, shipping, payment — all tightly coupled to one view

// Better: composable operations that any client can combine
rpc GetCustomer(GetCustomerRequest) returns (Customer);
rpc GetCart(GetCartRequest) returns (Cart);
rpc ListShippingOptions(ListShippingOptionsRequest) returns (ListShippingOptionsResponse);
// BFF layer aggregates these for the UI — keeps the core API clean

On API surface size: prefer a small number of well-understood, stable operations over a large surface of purpose-built ones. Use field masks or projections so callers opt-in to the fields they need.


1.3 Improper Namespace and Resource URI Design

Though most companies provide REST based APIs but often endpoints organized around verbs instead of resources: /getOrder, /createOrder, /deleteOrder, /updateOrderStatus. No consistent hierarchy. Related resources scattered across URL spaces: /orders and /order-history and /customer-purchases all refer to variants of the same concept with no clear relationship. Different teams own overlapping namespaces. A service called UserService that has endpoints for users, preferences, addresses, payment methods, and audit logs with no sub-resource structure.

The fundamental concept in REST is that URLs identify resources with nouns and HTTP verbs express actions on those resources. A resource hierarchy expresses relationships. This is not an aesthetic preference; it is the architectural model that makes REST APIs predictable without documentation.

# Anti-pattern: verb-based, flat, unorganized
GET    /getUser?id=123
POST   /createOrder
POST   /updateOrderStatus
GET    /getUserOrders?userId=123
DELETE /cancelOrder?orderId=456
GET    /getOrderHistory?customerId=123

# Correct: resource-oriented hierarchy
GET    /v1/users/{userId}                        # get user
POST   /v1/orders                                # create order
PATCH  /v1/orders/{orderId}                      # partial update (including status)
GET    /v1/users/{userId}/orders                 # orders for a user
DELETE /v1/orders/{orderId}                      # cancel order
GET    /v1/users/{userId}/orders?status=completed # filtered history

Namespace discipline: Keep related resources under the same base path. OrderService owns /v1/orders/**. UserService owns /v1/users/**. Related sub-resources live under their parent: /v1/orders/{orderId}/items, /v1/orders/{orderId}/events. Do not scatter related concepts across different roots based on internal team ownership.

Avoiding duplicate APIs: Before creating a new endpoint, ask whether an existing one can be parameterized to serve the new use case


1.4 The Execute Anti-Pattern: Bag of Params for Different Actions

Contrary to large surface, this anti pattern reuses same endpoint for different action depending on which parameters are present. The operation is effectively execute(action, params...) with a bag of optional fields, where different combinations of fields trigger different code paths.

// Anti-pattern: one RPC that does many things depending on type
message ProcessOrderRequest {
  string order_id = 1;
  string action = 2;           // "cancel", "ship", "refund", "update", "hold"
  string cancel_reason = 3;    // only used when action = "cancel"
  string tracking_number = 4;  // only used when action = "ship"
  double refund_amount = 5;    // only used when action = "refund"
  Address new_address = 6;     // only used when action = "update"
  string hold_until = 7;       // only used when action = "hold"
}

It feels like one operation (“do something with this order”). It minimizes the number of endpoints and it is easy to add a new action without changing the RPC signature.

It results in callers not understanding what the operation does without documentation explaining every action variant. Validation becomes a conditional maze — field cancel_reason is required when action = "cancel" but ignored otherwise. Generated SDK method signatures have no useful type information. Tests multiply exponentially.

Better approach: Separate operations for separate actions. Use oneof in protobuf for requests that have genuinely mutually exclusive parameter sets:

// Better: explicit operations, each with a clear contract
rpc CancelOrder(CancelOrderRequest) returns (Order);
rpc ShipOrder(ShipOrderRequest) returns (Order);
rpc RefundOrder(RefundOrderRequest) returns (Refund);

message CancelOrderRequest {
  string order_id = 1;
  string reason = 2;   // always relevant, always validated
}

// If you truly need a polymorphic command, use oneof to make it explicit:
message UpdateOrderRequest {
  string order_id = 1;
  oneof update {
    ShippingAddressUpdate shipping_address = 2;
    StatusUpdate status = 3;
    ContactUpdate contact = 4;
  }
  // oneof makes it structurally impossible to send two update types at once
  // Generated SDKs expose typed accessors — no stringly-typed action field
}

gRPC’s required/optional semantics: proto3 makes all fields optional by default. Use proto3’s optional keyword explicitly when a field’s absence carries meaning. You can use Protocol Buffer Validation to add more validation and enforce it in your boundary validation layer.


1.5 NIH Syndrome: Custom RPC Protocols Instead of Standards

At other places, I have seen teams build their own binary protocol over raw TCP because “gRPC has too much overhead.” They have custom framing, error codes, and multiplexing, which runs on a non-standard port, and needs special firewall rules. More often it is NIH (Not Invented Here) syndrome, believing that the standard tools are not good enough, combined with underestimation of the operational cost of maintaining a custom protocol.

In the end, custom protocols do not work through corporate proxies, CDNs, API gateways, or load balancers that only speak HTTP. Many enterprise environments permit only HTTP/HTTPS outbound and a custom port means the integration simply cannot be used. Tools like Wireshark, curl, Postman, and every observability platform will not understand your protocol. Debugging becomes dramatically harder because the entire ecosystem of HTTP tooling is unavailable.

What standard protocols actually give you:

ProtocolBest ForTransportStreaming
REST/HTTPPublic APIs, broad compatibilityHTTP/1.1, HTTP/2No (use SSE)
gRPCHigh-performance internal services, strong typingHTTP/2Yes (4 modes)
WebSocketBidirectional real-time communicationHTTP upgradeYes (full-duplex)
GraphQLFlexible queries, client-driven shapeHTTP/1.1, HTTP/2Subscriptions
Server-Sent EventsServer-push notificationHTTP/1.1Server-to-client

1.6 Badly Designed Streaming APIs

This is similar to previous pattern where a team that needs real-time data pushes builds a polling endpoint (GET /events?since=<timestamp>) and expects clients to poll every second. Or uses raw sockets that send large JSON blobs because “it’s streaming.” Or uses gRPC streaming but sends the entire dataset in one message instead of streaming rows incrementally. Or builds a custom long-polling mechanism with complex session state when SSE would have been simpler.

  • gRPC streaming modes:
service DataService {
  // Unary: single request, single response — most operations
  rpc GetOrder(GetOrderRequest) returns (Order);

  // Server streaming: one request triggers a stream of responses
  // Use for: sending large datasets, live feeds, log tailing
  rpc TailOrderEvents(TailOrderEventsRequest) returns (stream OrderEvent);

  // Client streaming: stream of requests, one response
  // Use for: bulk ingest, file upload in chunks
  rpc BulkCreateOrders(stream CreateOrderRequest) returns (BatchCreateOrdersResponse);

  // Bidirectional streaming: both sides stream independently
  // Use for: real-time chat, collaborative editing, game state sync
  rpc SyncOrderState(stream OrderStateUpdate) returns (stream OrderStateUpdate);
}
  • WebSocket is the correct choice for full-duplex browser communication where you need persistent connections with low latency in both directions. It upgrades from HTTP, passes through standard proxies, and is supported universally.
  • Server-Sent Events (SSE) is the correct choice for server-push-only scenarios (notifications, live dashboards) where the client only needs to receive, not send. SSE is HTTP.
  • Never build: custom TCP streaming, custom HTTP long-polling with complex session management, or custom binary framing when gRPC already provides exactly that.

1.7 Ignoring Encoding: JSON Everywhere Regardless of Cost

This anti-pattern can surfaces when a high-throughput internal service between two microservices you control uses JSON over HTTP/1.1 because “it’s simple.” Internal services process millions of messages per second serializing and deserializing large JSON payloads. The payload includes deeply nested structures with long field names repeated in every message. No compression. No binary encoding.

The performance reality: JSON is human-readable text with significant overhead:

  • Field names are repeated in every object (bandwidth and parse cost)
  • No schema enforcement at the encoding layer
  • No native binary type (base64 for bytes adds ~33% overhead)
  • UTF-8 string parsing is CPU-intensive at high throughput

Protobuf binary encoding is typically 3–10× smaller than equivalent JSON and 5–10× faster to serialize/deserialize at high volume. For internal service-to-service communication at scale, this is not a micro-optimization, it is a significant infrastructure cost difference.

Better approach: Choose encoding based on the use case:

ScenarioRecommended Encoding
Public REST API, browser clientsJSON (required for broad compatibility)
Internal service-to-service (high throughput)Protobuf binary over gRPC
Internal service-to-service (moderate)JSON over HTTP/2 with compression is acceptable
Mixed: public + internal clientsgRPC with HTTP/JSON transcoding via AIP
Event streaming (Kafka, Kinesis)Avro or Protobuf with schema registry

gRPC over HTTP/2 gives you multiplexed streams, binary encoding, strongly typed contracts, and bi-directional streaming in one package. For internal services at scale, there is rarely a justification for JSON over HTTP/1.1.

1.8 No Clear Internal/External API Boundary

In many cases, organizations may use gRPC internally and REST externally but in practice, the internal gRPC APIs were never held to any standard. For example, field names are inconsistent, operations are not paginated or there is no versioning.

  • Internal APIs become a inconsistent mess with duplicate functionality. Because internal APIs have no governance, each team designs theirs in isolation. Team A has GetUserProfile. Team B has FetchUser. Team C has LookupUserById. The internal API surface grows without bound.
  • Internal APIs leak into the external surface. The public REST API was designed conservatively, returning only what external callers need. But an internal team needs the same resource with additional fields. Rather than adding a projection or a scoped access tier, the quickest path is to promote the internal API endpoint. Over time, the line between “public” and “internal” API blurs. External clients discover undocumented internal fields (Hyrum’s Law again) and start depending on them.

Better approach — treat internal and external APIs as two tiers of the same governance model:

External API (public)         Internal API (private)
??????????????????????        ?????????????????????????
Same naming conventions       Same naming conventions
Same error shape              Same error shape
Same pagination model         Same pagination model
Same versioning policy        Same versioning policy — yes, even internally
Minimal response fields       Additional fields gated by internal scope/role
OpenAPI spec enforced         Proto spec enforced with protoc-gen-validate
Published SLA                 Published SLA (even if internal)
Contract tests in CI          Contract tests in CI

The key discipline is that internal APIs must follow the same standards as public APIs in terms of naming, versioning, error shapes, pagination. The only difference is the data they expose and the authentication model.

Handling the “extra fields” problem: use scoped projections rather than separate endpoints:

message GetOrderRequest {
  string order_id = 1;

  // Callers with INTERNAL_READ scope receive all fields.
  // External callers receive only the public projection.
  // The same RPC serves both — authorization determines the projection.
  FieldMaskScope scope = 2;
}

enum FieldMaskScope {
  FIELD_MASK_SCOPE_PUBLIC = 0;    // external callers: customer-visible fields
  FIELD_MASK_SCOPE_INTERNAL = 1;  // internal callers: + audit, cost, state flags
  FIELD_MASK_SCOPE_ADMIN = 2;     // ops callers: + all internal diagnostics
}

message Order {
  // Public fields — always returned
  string order_id = 1;
  OrderStatus status = 2;
  google.protobuf.Timestamp created_at = 3;

  // Internal fields — returned only to INTERNAL_SCOPE callers
  // Stripped at the API gateway for external requests
  string internal_routing_key = 100;
  CostAllocation cost_allocation = 101;

  // Admin fields — returned only to ADMIN_SCOPE callers
  repeated AuditEvent audit_trail = 200;
}

This approach keeps one canonical API, one proto spec, one set of tests. The authorization layer determines which fields a caller receives. The API gateway strips internal fields from external responses. The same spec, with scope annotations, documents both tiers.

On internal API governance: internal APIs need the same review gates as public APIs, even if the review is lighter. Some organizations enforce this via a service registry where every internal API must be registered, and the registry enforces naming and schema standards automatically.

1.9 Mixing Control-Plane and Data-Plane APIs

This anti-pattern occurs when a single API service handles both resource management (create a cluster, update a configuration, rotate a secret) and the high-frequency operational traffic that those resources serve (process a transaction, ingest a telemetry event). The same service, the same load balancer, the same deployment unit. A configuration change that causes a brief control-plane outage also takes down the data plane. A traffic spike on the data plane starves the management operations that operators need most during an incident.

Defining the planes: these terms come from networking and are now standard in cloud platform design.

PlanePurposeTypical TPSLatency requirementCaller
Control planeManage and configure resourcesLow (10s–100s/s)Relaxed (100ms–seconds)Operators, automation, UI
Data planeServe the workload those resources defineHigh (1,000s–millions/s)Strict (single-digit ms)End-users, services, devices

Real-world examples of the split done correctly:

  • Kubernetes: kube-apiserver is the control plane that creates Deployments, update ConfigMaps, scale ReplicaSets. The actual pod-to-pod traffic it orchestrates is the data plane. A kube-apiserver brownout does not stop running pods from serving traffic.
  • AWS API Gateway: The management API (create/update/delete routes, authorizers, stages) is the control plane. The actual HTTP proxy that forwards requests to Lambda or ECS is the data plane.

The scaling difference between management traffic and operational traffic is invisible until it isn’t. The consequence: Two failure modes, both serious.

  • First, data-plane load starves control-plane availability. A traffic spike on the data plane consumes all available threads, connections, and CPU. Operators cannot reach the management API to make the configuration change that would fix the problem.
  • Second, control-plane deployments risk data-plane availability. A risky configuration change deployed to the unified service takes down both planes together. A misconfigured authentication change gates all traffic, including the operational traffic that cannot tolerate any interruption.

Better approach:

Separate the planes at the service level, not just at the routing level. A reverse proxy that routes /mgmt/* to one backend and /v1/* to another on the same process does not achieve the isolation you need.

// Control-plane API — management operations, low TPS, relaxed latency
service OrderConfigService {
  // Create/update routing rules — takes effect asynchronously
  rpc UpsertRoutingRule(UpsertRoutingRuleRequest) returns (RoutingRule);
  rpc DeleteRoutingRule(DeleteRoutingRuleRequest) returns (google.protobuf.Empty);
  rpc ListRoutingRules(ListRoutingRulesRequest) returns (ListRoutingRulesResponse);

  // Capacity and rate limit configuration
  rpc SetRateLimit(SetRateLimitRequest) returns (RateLimit);

  // Returns async job — config changes propagate eventually to data plane
  rpc TriggerConfigSync(TriggerConfigSyncRequest) returns (ConfigSyncJob);
}

// Data-plane API — operational traffic, high TPS, strict latency
service OrderService {
  // Reads routing rules from LOCAL CACHE — never calls control plane in-band
  rpc CreateOrder(CreateOrderRequest) returns (Order);
  rpc GetOrder(GetOrderRequest) returns (Order);
  rpc ListOrders(ListOrdersRequest) returns (ListOrdersResponse);
}
  • Config propagation: the data plane must not call the control plane synchronously on the hot path. Configuration is pushed from the control plane to the data plane via an event stream or periodically polled and cached locally. The data plane starts with the last known good configuration and operates independently if the control plane is temporarily unavailable.
  • Deployment and SLA differences: control-plane deployments can be careful, canary-gated, and slow because the cost of a management API degradation is low (operators retry). Data-plane deployments should be fast and automated with aggressive auto-rollback because the cost of data-plane degradation is direct user impact.

Section 2: Contract & Consistency Anti-Patterns


2.1 Inconsistent Naming Across APIs

This anti-pattern is fairly common with evolution of API, e.g., EC2 uses CreateTags, ELB uses AddTags, RDS uses AddTagsToResource, Auto Scaling uses CreateOrUpdateTagswith four different verb shapes for the same semantic across four services.

Better approach: Establish a canonical vocabulary before first public release. For lifecycle operations: Create, Get, List, Update, Delete. Use id (server-assigned) vs name (client-specified) consistently. Use google.protobuf.Timestamp for all time values, never strings, never epoch integers.

message Order {
  string order_id = 1;                          // server-assigned ID
  string customer_name = 2;                     // client-specified name
  google.protobuf.Timestamp created_at = 3;     // typed timestamp, never string
  google.protobuf.Timestamp updated_at = 4;
  OrderStatus status = 5;                       // enum, not string, not int
}

enum OrderStatus {
  ORDER_STATUS_UNSPECIFIED = 0;  // always include; proto3 default
  ORDER_STATUS_PENDING = 1;
  ORDER_STATUS_CONFIRMED = 2;
  ORDER_STATUS_CANCELLED = 3;
}

2.2 Wrong HTTP Verb for the Operation

Despite adopting REST, I have seen companies misusing verbs like PATCH /orders/{id} that replaces the entire resource. GET /reports/generate that inserts a database record.

Note on GraphQL and gRPC: Both protocols legitimately tunnel all operations through HTTP POST. This is an intentional protocol design choice andnot an anti-pattern but it must be documented explicitly, and REST-layer middleware (caches, proxies, WAFs) must be configured to account for it.

VerbSemanticsIdempotentSafe
GETRetrieveYesYes
PUTFull replaceYesNo
PATCHPartial updateConditionallyNo
POSTCreate / non-idempotentNoNo
DELETERemoveYesNo

2.3 Breaking API Changes Without Versioning

A breaking change without versioning can easily break clients, e.g., a field renamed from customerId to customer_id, an error code that was 400 becomes 422, a previously optional field becomes required.

Safe (no version bump): adding optional request fields, adding response fields, adding new operations, making required fields optional. Never safe without a version bump: removing/renaming fields, changing field types, changing error codes for existing conditions, splitting an exception type, changing default behavior when optional inputs are absent.


2.4 Hyrum’s Law: Changing Semantic Behavior Without Versioning

With this anti-pattern, you fix a bug where ListOrders returned insertion order instead of alphabetical. You update an error message wording. You tighten validation. All of these feel internal. None are.

Better approach: Document everything observable. Use structured error fields (resource IDs, machine-readable codes) so clients never parse message strings. Treat any observable change including ordering, error message wording, validation leniency as potentially breaking.


2.5 Postel’s Law Misapplied: Silently Accepting Bad Input

This anti-pattern occurs when an API that accepts quantity: -5 and treats it as 0. An endpoint that silently drops unknown fields, then later adds a field with the same name with different semantics. An API that accepts both camelCase and snake_case then a new field orderType collides with legacy alias order_type.

Better approach: Be strict at the boundary. Reject invalid input with a structured ValidationException. Accept unknown fields only if explicitly designed for forward compatibility. Never silently coerce.


2.6 Bimodal Behavior

In this scenario, under normal load, ListOrders returns a complete consistent list with 200. Under high load, it silently returns a partial list still with 200.

Better approach: Your degraded paths must return consistent response shapes and correct status codes. A timeout is a 503 with Retry-After. A partial result is not a 200.


2.7 Leaky Abstractions

Examples of leaky abstractions include error messages contain internal ORM table names; pagination tokens are readable base64 JSON containing your database cursor.

Better approach: Map your domain model to your API, not your implementation. Pagination tokens must be opaque, encrypted, and versioned. Internal identifiers and infrastructure topology must never be inferred from responses.


2.8 Missing or Inconsistent Input Validation

This occurs when some fields are validated strictly, others silently truncated. The same field accepts null, "", and "0" on different endpoints.

Better approach: Validate at the boundary, consistently, for every operation.

message ValidationException {
  string message = 1;          // human-readable — never parse this in code
  string request_id = 2;
  repeated FieldViolation field_violations = 3;
}
message FieldViolation {
  string field = 1;            // "order.items[2].quantity"
  string description = 2;      // "must be greater than 0, got -5"
}

Section 3: Implementation Efficiency Anti-Patterns


3.1 N+1 Queries and ORM Abuse

In this case, you might have a ListOrders endpoint that fetches the list in one query, then issues a separate query per order for customer details, then another per order for line items. With 100 orders: 201 database round trips for what should be 1.

Network cost: Each cloud database round trip costs 1–5ms. 4,700 round trips = 4.7–23.5 seconds of pure network overhead before a byte of business logic executes. As covered in How Abstraction Is Killing Software, every layer crossing a network boundary multiplies the failure surface and latency budget.

Better approach: Return summary structures with commonly needed fields. Audit query plans with production-scale data before launch. Use eager loading for related data.


3.2 Missing Pagination

In this case, you might have a ListOrders endpoint that returns all results in a single response. Works at launch with small datasets. At scale some accounts have millions of records and responses become hundreds of megabytes, timeouts multiply, and clients start crashing on deserialization. Retrofitting pagination is a breaking change. If your endpoint always returned everything and you start returning a page with a next_page_token, clients that assumed completeness silently miss data. For example, EC2’s original DescribeInstances had no pagination. As customer instance counts grew into the thousands, responses became megabyte-scale XML documents that timed out and crashed clients. Retrofitting required making pagination opt-in legacy callers continued hitting the unbounded path for years after the fix shipped.

Guidance: every list operation must be paginated before first release:

  1. All List* operations that return a collection MUST be paginated no exceptions. The only exemption is a naturally size-limited result like a top-N leaderboard.
  2. Only one list per operation may be paginated. If you need to paginate two independent collections, expose two operations.
  3. Paginated results SHOULD NOT return the same item more than once across pages (disjoint pages). If the sort order is not an immutable strict total ordering, provide a temporally static view or snapshot the result set at the time of the first request and page through the snapshot.
  4. Items deleted during pagination SHOULD NOT appear on later pages.
  5. Newly created items MAY appear on not-yet-seen pages, but MUST appear in sorted order if they do.

The canonical request/response shape (REST and gRPC should follow the same field naming like page_size in, next_page_token out):

message ListOrdersRequest {
  // Optional upper bound — service may return fewer. Default is service-defined.
  // Client MUST NOT assume a full page means there are no more results.
  int32 page_size = 1 [(validate.rules).int32 = {gte: 0, lte: 1000}];

  // Opaque token from previous response. Absent on first call.
  string page_token = 2;

  // Filter parameters — MUST be identical on every page of the same query.
  // Service MUST reject a request where filters change mid-pagination.
  OrderFilter filter = 3;
}

message ListOrdersResponse {
  repeated OrderSummary orders = 1;

  // Absent when there are no more pages. Clients MUST stop when this is absent.
  // Never an empty string — absent means done, empty string is ambiguous.
  string next_page_token = 2;

  // Optional approximate total — document clearly that this is an estimate.
  // Do NOT guarantee an exact count; that requires a full scan on every call.
  int32 approximate_total = 3;
}

page_size is an upper bound, not a target: the service MUST return a next_page_token and stop early when its own threshold is exceeded. Attempting to fill a page to meet page_size for a highly selective filter on a large dataset creates an unbounded operation.

Changing page_size between pages is allowed: it does not change the result set, only how it is partitioned. Changing filter parameters is not allowed and must be rejected.


3.3 Pagination Token Anti-Patterns

Every one of the following mistakes has been made in production by major APIs. Each creates a permanent contract liability.

  • Readable token (leaks implementation): When you restructure your database, the token format is a public contract you cannot change. Clients construct tokens manually to jump to arbitrary offsets, bypassing your access controls. Making backwards-compatible changes to a plain-text token format is nearly impossible.
// Decoded token — client immediately knows your DB cursor format
{ "offset": 500, "shard": "us-east-1a", "table": "orders_v2" }
  • Token derived by client (S3 ListObjects mistake): S3’s original ListObjects required callers to derive the next token themselves: check IsTruncated, use NextMarker if present, otherwise use the Key of the last Contents entry. Every S3 client library had to implement this multi-step derivation. When S3 needed to change the pagination algorithm, all that client logic became incorrect. ListObjectsV2 was the clean-break solution an explicit opaque ContinuationToken issued by the server.
  • Token that never expires: A non-expiring token makes schema migrations impossible. If your pagination token format encodes version 1 of your database schema and you ship version 2, you must maintain a decoder for every token ever issued indefinitely. A 24-hour expiry gives you a bounded window after which all outstanding tokens are on the current format.
  • Token usable across users: A token generated for user A contains enough context to enumerate user B’s resources if the user check is missing. This is a data isolation vulnerability, not just a correctness bug.
  • Token that influences AuthZ: The service must not evaluate permissions differently based on whether a pagination token is present or what it contains. Authorization must be re-evaluated on every page request using the caller’s current credentials, not credentials cached inside the token.
// What the service stores inside the encrypted token — never visible to callers
message PaginationTokenPayload {
  string account_id = 1;      // bound to caller's account
  int32 version = 2;           // token format version for forward compatibility
  string cursor = 3;           // internal cursor — DB row ID, sort key, etc.
  google.protobuf.Timestamp issued_at = 4;   // for expiry enforcement
  bytes filter_hash = 5;       // hash of filter params — reject if changed
}
// This struct is AES-GCM encrypted before being base64-encoded and returned as next_page_token.
// The client sees only an opaque string. The server decrypts and validates on every use.

Client usage pattern: SDK helpers should abstract this loop, but every client must implement it correctly when calling raw:

page_token = None
while True:
    response = client.list_orders(
        filter={"status": "PENDING"},
        page_size=100,
        page_token=page_token   # None on first call
    )
    process(response.orders)
    page_token = response.next_page_token
    if not page_token:
        break   # no token = no more pages; do NOT check len(orders) < page_size

# NOTE: len(orders) < page_size does NOT mean last page.
# The service may return fewer results for internal reasons (execution time limit,
# scan limit, etc.) and still issue a next_page_token. Always check the token.

The single most common client-side pagination bug is treating a short page as a signal that pagination is complete.


3.4 Filtering Anti-Patterns

Filtering is where inconsistency compounds fastest as every team makes slightly different choices about semantics, validation, and edge cases, and callers cannot predict the behavior without reading the documentation for every endpoint individually.

The standard AND/OR semantic: all filtering implementations should follow EC2’s model: multiple values for a single attribute are OR’d; multiple attributes are AND’d. The order of attributes must not affect the result (commutative).

# EC2 canonical example
aws ec2 describe-instances \
  --filter Name=instance-state-name,Values=running \
  --filter Name=image-id,Values=ami-12345 \
  --filter Name=tag-value,Values=prod,test

# Equivalent SQL semantics:
# (instance-state-name = 'running')
# AND (image-id = 'ami-12345')
# AND (tag-value = 'prod' OR tag-value = 'test')

Swapping the order of the three filter arguments must return an identical result set. Clients must never need to order their filters to get correct behaviour.

Include/exclude filter variants for date, time, and status fields:

# Negation filter: exclude terminated instances from a different AZ
aws ec2 describe-instances \
  --filter Name=instance-state-name,Values=terminated,operator=exclude \
  --filter Name=availability-zone,Values=us-east-1a,operator=include

Timestamp fields MAY support not-before / not-after semantics. When supported, document the semantics exactly and validate that the provided value is a well-formed timestamp.

Filter structure in protobuf: use an enum for attribute names so the set of supported filters is machine-readable, and a validated pattern for values so wildcards and injection vectors are controlled:

message ListOrdersRequest {
  repeated Filter filters = 1 [(validate.rules).repeated.max_items = 10];
  int32 page_size = 2;
  string page_token = 3;
}

message Filter {
  FilterAttribute name = 1;    // enum — only supported attributes accepted
  repeated string values = 2   // OR'd together; max bounded
    [(validate.rules).repeated = {min_items: 1, max_items: 20}];
  FilterOperator operator = 3; // default INCLUDE; EXCLUDE for negation
}

enum FilterAttribute {
  FILTER_ATTRIBUTE_UNSPECIFIED = 0;
  FILTER_ATTRIBUTE_STATUS = 1;       // maps to Order.status
  FILTER_ATTRIBUTE_REGION = 2;       // maps to Order.region
  FILTER_ATTRIBUTE_CREATED_AFTER = 3;  // timestamp lower bound
  FILTER_ATTRIBUTE_CREATED_BEFORE = 4; // timestamp upper bound
  // Every value here must correspond to a field returned in OrderSummary.
  // Never add a filter attribute for an internal field not in the response.
}

enum FilterOperator {
  FILTER_OPERATOR_INCLUDE = 0;  // default — only matching resources returned
  FILTER_OPERATOR_EXCLUDE = 1;  // matching resources excluded from results
}

Filtering vs. specifying a list of IDs: these are different operations and must not be conflated. A filter is a predicate applied to the result set and it does not guarantee fetching a specific resource. Fetching a known set of resource IDs is a batch read (BatchGetOrders) and belongs in the batch operations standard, not in the filter parameter.

Flat parameters vs. structured filter list: two common shapes exist. Flat parameters (?status=PENDING&region=us-east) are simpler for simple cases and easier to cache with HTTP GET semantics. A structured filters list (as above) is more extensible and handles negation, wildcards, and complex predicates cleanly. Do not mix shapes across endpoints.


3.5 Chatty APIs and Network Latency Multiplication

Rendering a single page requires six sequential API calls. Each is 20ms. Sequential total: 120ms of pure network time before rendering begins. For example, Netflix’s move to microservices initially produced exactly this. Their solution: the BFF (Backend for Frontend) pattern, which is a purpose-built aggregation layer that parallelizes the six calls and returns one tailored response to the client.

Better approach: Design batch and composite read operations for primary use cases. Where callers need related resources together, provide projections. Parallelize what can be parallelized in your aggregation layer.


3.6 Synchronous APIs for Long-Running Operations

This is another pattern resulting from poor understanding of API behavior, e.g., POST /reports/generate blocks for 45 seconds, or it returns 202 Accepted (or 202 OK) with no body, no job ID, no link to check status, no way to cancel, and no way to know when it is safe to retry. Another related scenario is an API that was designed for a specific UI assumption, e.g., “the UI will only ever submit 100 IDs” but is exposed as a general API. When an automation script submits 10,000 IDs, the synchronous operation times out at the load balancer, the client retries, and two copies of the same job are now running. The API has no idempotency token, no job ID to check for an in-progress operation, and no way to cancel the duplicate. The missing async API primitives:

  1. No requestId in the 202 response: the caller has no handle to reference the job in subsequent calls, in logs, or in support tickets
  2. No status endpoint: the caller cannot poll for completion; the only signal is silence until a webhook fires
  3. No cancel operation: a misconfigured job consuming resources cannot be stopped without operator intervention
  4. No idempotency on submission: submitting the same job twice creates two jobs; there is no way to detect an in-progress duplicate
  5. No bounded input validation: the operation accepts an unbounded number of IDs because the UI never sends more than 100, but the API contract enforces no limit; automation sends 100,000 and the job runs for hours

Better approach is complete async job lifecycle:

// Submission: returns immediately with a Job handle
rpc StartExport(StartExportRequest) returns (Job) {
  option (google.api.http) = { post: "/v1/exports", body: "*" };
  // Response: HTTP 202 Accepted
}

// Status + result polling
rpc GetJob(GetJobRequest) returns (Job) {
  option (google.api.http) = { get: "/v1/jobs/{job_id}" };
}

// Cancellation — idempotent; safe to call multiple times
rpc CancelJob(CancelJobRequest) returns (Job) {
  option (google.api.http) = { post: "/v1/jobs/{job_id}:cancel", body: "*" };
}

message StartExportRequest {
  string client_token = 1;  // idempotency — same token returns existing job, not a new one
  repeated string record_ids = 2 [(validate.rules).repeated = {
    min_items: 1,
    max_items: 1000  // enforced at boundary — not a UI assumption baked into code
  }];
  ExportFormat format = 3;
}

message Job {
  string job_id = 1;              // stable handle for all subsequent calls
  string request_id = 2;         // trace ID for this submission specifically
  JobStatus status = 3;
  google.protobuf.Timestamp submitted_at = 4;
  google.protobuf.Timestamp completed_at = 5;  // absent until terminal state
  string result_url = 6;          // present only when status = SUCCEEDED
  JobError error = 7;             // present only when status = FAILED
  string self_link = 8;           // href to GET this job — no client URL construction needed
  string cancel_link = 9;         // href to cancel — clients should use these, not construct URLs
  int32 estimated_seconds = 10;   // hint for polling interval; not a guarantee
}

enum JobStatus {
  JOB_STATUS_UNSPECIFIED = 0;
  JOB_STATUS_QUEUED = 1;
  JOB_STATUS_RUNNING = 2;
  JOB_STATUS_SUCCEEDED = 3;
  JOB_STATUS_FAILED = 4;
  JOB_STATUS_CANCELLED = 5;
  JOB_STATUS_CANCELLING = 6;  // in-progress cancel — may still complete
}

The 202 Accepted response body must include:

  • job_id — the durable handle
  • self_link — the URL to poll (clients must not construct this)
  • cancel_link — the URL to cancel
  • estimated_seconds — polling hint
  • request_id — for logging and support correlation
HTTP 202 Accepted
Location: /v1/jobs/job-a3f9c2
{
  "job_id": "job-a3f9c2",
  "status": "QUEUED",
  "request_id": "req-7d2e1a",
  "self_link": "/v1/jobs/job-a3f9c2",
  "cancel_link": "/v1/jobs/job-a3f9c2:cancel",
  "estimated_seconds": 30
}

The Location header is standard HTTP for 202 include it so HTTP clients that follow redirects and standard library polling helpers work without custom code.

Idempotency on submission prevents duplicate jobs: if a client submits with client_token: "export-2024-q1" and receives a timeout, the retry with the same token returns the existing Job.

Bounded input enforced at the boundary: the max_items: 1000 constraint in StartExportRequest is enforced by protoc-gen-validate at the gRPC boundary instead of application code. If the constraint needs to change, it changes in the proto spec and the enforcement changes with it.


3.7 Batch Operations with Mixed Success/Error Lists

This occurs when a batch endpoint returns a single flat list where successes and failures are distinguished only by the presence of an error field. Callers must iterate every entry to determine outcome. For example, Firehose’s PutRecordsBatch uses this anti-pattern with a single mixed list. The correct model (adopted in newer AWS APIs) separates success and failure lists:

message BatchCreateOrdersResponse {
  repeated Order created_orders = 1;
  repeated OrderError failed_orders = 2;
  // HTTP 200 even if all items failed — per-item failure is in failed_orders
  // HTTP 400 only if the batch itself is malformed
}
message OrderError {
  string client_request_id = 1;  // correlates to request entry
  string error_code = 2;
  string message = 3;
}

Section 4: Idempotency & Transaction Anti-Patterns


4.1 Duplicate Detection Masquerading as True Idempotency

I wrote about this previously at How Duplicate Detection Became the Dangerous Impostor of True Idempotency and this issue arises when you create endpoint checks for an existing resource with the same name and returns it if found, calling this “idempotency.”

The correct idempotency token flow:

Stripe’s idempotency key is the canonical implementation. Every POST accepts an Idempotency-Key header. Stripe stores the key and the exact response. Same key within 24 hours replays the original response without re-executing. Same key with a different body returns 422.

Failure mode of duplicate detection: A response is lost in transit. The client retries. Meanwhile, another actor deleted the resource and a third created a new one with the same name. Your “idempotent” endpoint returns the new resource which the original client neither created nor controls.


4.2 Missing Idempotency Tokens on Create Operations

This scenario may occur when POST /orders returns an order ID without clientToken. The client gets a timeout. Retry = potential duplicate. No retry = potential data loss. For example, early payments APIs had this problem. A double-charge scenario: customer clicks Pay, network times out, app retries, customer charged twice. Stripe, Adyen, and Braintree all mandate idempotency keys for payment operations.

message CreateOrderRequest {
  // SDK auto-generates when absent; callers may provide their own.
  // Must be at least 64 ASCII printable characters for uniqueness.
  optional string client_token = 1;
  string customer_id = 2;
  repeated OrderItem items = 3;
}

4.3 Transaction Boundary Violations

I wrote about this anti-pattern previously at Transaction Boundaries: The Foundation of Reliable Systems. This occurs when a single API call updates two separate resources with no atomicity guarantee. The first update succeeds; the service crashes before the second. Caller retries; first update applies twice.

Better approach: Document atomicity guarantees explicitly. For cross-service consistency, use the Saga pattern with compensating transactions.


4.4 Full Update via PATCH (Implicit Field Deletion)

This occurs when PATCH /orders/{id} replaces the entire resource. Fields not included are deleted. A mobile client updating the shipping address silently deletes the contact email. For example, GitHub’s current v3 API is explicit: PATCH applies partial updates, PUT applies full replacement — documented unambiguously for every endpoint.

message UpdateOrderRequest {
  string order_id = 1;
  Order order = 2;
  // Only fields in update_mask are modified.
  // paths = ["shipping_address"] ? only shipping_address is touched
  google.protobuf.FieldMask update_mask = 3;
}

4.5 Missing Optimistic Concurrency Control

This occurs when two clients GET the same order, both modify it, both PUT back. The last write silently overwrites the first. For example, Kubernetes uses server-side apply with field ownership tracking and returns 409 Conflict with the specific fields in conflict. The ETag / If-Match pattern is the REST equivalent.

GET /orders/123 ? { ..., "version": "v7" }
PATCH /orders/123 + If-Match: v7
# If order is now v8: HTTP 409 Conflict { "current_version": "v8" }

4.6 Ignoring Concurrent Operation Safety

In this scenario, an API that allows parallel create-and-delete on the same resource without concurrency safety. A long-running create that can be invoked a second time while the first is in flight.

Better approach: Document concurrency semantics per operation. For long-running creates: check for an in-progress operation before starting a new one. Use idempotency tokens to prevent parallel retries from compounding.


Section 5: Error Handling Anti-Patterns


5.1 Opaque, Non-Actionable Errors

This anti-pattern occurs with poorly defined errors like: {"error": "Something went wrong"}. An HTML error page from a load balancer served as an API response. The same ValidationException returned for “field missing,” “field too long,” and “field contains invalid characters.”

Better approach: I wrote about better error handling previously at Building Robust Error Handling with gRPC and REST APIs. Seven standard exception types cover nearly all scenarios:

ExceptionHTTPRetryable
ValidationException400No
ServiceQuotaExceededException402No (contact support)
AccessDeniedException403No
ResourceNotFoundException404No
ConflictException409No (needs resolution)
ThrottlingException429Yes (honor Retry-After)
InternalServerException500Yes (with backoff)

Include request_id in every error response for support correlation. Include retry_after_seconds in 429 and 500 responses.


5.2 Error Messages That Clients Must Parse

This occurs where an API error looks like "ValidationException: The field 'order.items[2].quantity' must be greater than 0." A client parses the string to extract the field path. Major cloud providers have been forced to freeze exact error message phrasing for years because clients parse them. Changing a comma placement breaks production integrations.

Better approach: As described in Building Robust Error Handling with gRPC and REST APIs, error message text is for humans reading logs. Any information a program acts on must be in structured fields, never embedded in the message string.


5.3 Leaking Internal Information in Errors

Error messages contain database hostnames, stack traces, SQL fragments, or internal ARNs. 500 that says NullPointerException at com.internal.service.OrderProcessor:237.

Security principle: Return only information applicable to that request and requester. An unauthorized caller asking for a resource that does not exist receives 403 AccessDeniedException, not 404 ResourceNotFoundException that reveals non-existence is as informative as confirming existence.

Better approach: Catch and re-throw all dependency exceptions as service-defined error types. Include only a requestId for support lookup.


5.4 Exception Type Splitting and Proliferation

Splitting ConflictException into ResourceAlreadyExistsException, ConcurrentModificationException, and OptimisticLockException after release. Clients catching ConflictException silently miss the new subtypes.

The rule: Splitting an existing exception type is a breaking change. Adding fields to an existing exception type is always safe. Add new exception types only for genuinely new scenarios triggered by new optional parameters.


Section 6: Resilience & Operations Anti-Patterns


6.1 Missing Retry Safety in the SDK

This occurs when an SDK retrying any 5xx response including non-idempotent POST. No jitter causing synchronized retry storms.

Correct retry policy:

  • Retry only: idempotent operations (GET, PUT, DELETE) OR POST with clientToken
  • Retry on: 429 (honor Retry-After), 500 (if retryable: true), 503
  • Never retry: 400, 401, 403, 404, 409
  • Backoff: base 100ms, 2x multiplier, ±25% jitter, max 10s, max 3 attempts

6.2 Retry Storms and Missing Bulkheads

This occurs where all clients receive 429 simultaneously. All back off for exactly 2^n * 100ms. All retry at the same moment. The retry wave is as large as the original spike. I wrote previously Robust Retry Strategies for Building Resilient Distributed Systems that shows effective strategies for robust retries. For example, Netflix built Hystrix specifically to isolate downstream dependency thread pools. Slow responses in one pool cannot bleed into others. Circuit breakers open when error rates exceed thresholds, failing fast rather than queueing.


6.3 Hard Startup Dependencies

This occurs when a service cannot start unless all dependencies are reachable. During a dependency outage, no new instances can start so the deployment stalls and you cannot deploy fixes when you most need to.

Better approach: I wrote about this previously at Zero-Downtime Services with Lifecycle Management on Kubernetes and Istio, which shows safe startup and shutdown. Start despite all dependencies unavailable. Initialize connectivity lazily. Distinguish not yet ready (503 + Retry-After) from unhealthy (500). Degrade gracefully rather than refuse to start.


6.4 Missing Graceful Shutdown

This is another common anti-pattern, e.g., a pod receives SIGTERM and exits immediately, dropping in-flight requests. I have seen it caused a data loss because a locally saved data failed to synchronize with the remote server before the pod was shutdown.

Correct sequence: Stop accepting new connections -> complete in-flight requests (bounded timeout) -> flush async work -> exit. As covered in Zero-Downtime Services with Lifecycle Management, getting any stage wrong produces dropped requests during every deployment.


6.5 No Pre-Authentication Throttling

This occurs when throttling applied only after auth. An attacker sends millions of requests that exhaust authentication infrastructure before per-account quota applies.

Better approach: Lightweight rate limiting before authentication (source IP / API key prefix) as first-line defense. Per-account throttling after auth. Both layers required. Configuration updatable without deployment.


6.6 Shallow Health Checks

I have seen companies touting 99.99% availability where their /health returns 200 as long as the HTTP server is running, regardless of whether the database connection pool is exhausted or the cache is unreachable.

EndpointPurposeChecked by
/health/liveProcess aliveKubernetes liveness probe
/health/readyCan handle requestsReadiness probe, load balancer
/health/deepFull end-to-end validationDeployment pipeline gate

6.7 Insufficient Metrics, SLAs, and Alerting

I wrote about From Code to Production: A Checklist for Reliable, Scalable, and Secure Deployments that shows metrics/alerting must be configured for API deployment. If you have insufficient metrics like only request count and binary error rate tracked without latency percentiles or defined SLA then diagnosing failure will be hard . For example, alerts fire at 100% error rate and the entire service is down before anyone is notified.

Better approach: Instrument every operation with request rate, error rate (4xx vs 5xx), latency at P50/P95/P99/P999, and downstream dependency health. Set alert thresholds below your SLA, e.g. if P99 SLA is 500ms, alert at 400ms.


6.8 No “Big Red Button” and Missing Emergency Rollback

This occurs when there is no fast path to revert a bad deployment. Configuration changes require a full deployment to roll back. No tested runbook.

Better approach: Feature flags togglable without deployment (tested weekly). Sub-5-minute rollback pipeline. Pre-tested load shedding with documented decision thresholds. Runbooks practiced in drills, not just read.


6.9 Backup Communication Channels Not Tested

Incident response plans rely on Slack to coordinate a Slack outage. Runbooks stored in Confluence, down when cloud IAM is broken. For example, Google’s 2017 OAuth outage logged 350M users out of devices and services. Teams expected to coordinate via Google Hangouts, which was also down. Incident coordination was hampered by the incident. Recovery took 12 hours.


6.10 Phased Deployment Anti-Patterns and Missing Automation

This occurs when you deploy globally in a single wave. Rollback criteria is “wait and see.” Canary populations too small. Rollback requires human decision-making at 3 AM. I wrote about Mitigate Production Risks with Phased Deployment that shows how phased deployment can mitigate production releases. Automated phased deployment:

  1. Deploy 1-5% canary
  2. Run automated integration tests against canary
  3. Monitor SLA metrics for bake period (10 minutes)
  4. Auto-rollback if any threshold breaches without human intervention
  5. Promote to next fault boundary only on clean bake

Section 7: Security, Data Privacy & Lifecycle Anti-Patterns


7.1 Missing Boundary Validation: Specs That Don’t Enforce

In this case, an OpenAPI spec exists but is not enforced at runtime and is documentation only. A proto definition marks fields as optional but the service processes requests where required fields are absent and produces undefined behavior. Input validation is implemented inconsistently in business logic rather than at the API boundary.

Better approach: Enforce the spec at the boundary. For OpenAPI/REST: Use middleware that validates every request against the OpenAPI schema before it reaches business logic. Libraries like express-openapi-validator (Node.js), connexion (Python), or API Gateway request validation do this. Every field type, pattern, range, and required constraint in the spec is automatically enforced.

# openapi.yaml — enforced at runtime, not just documentation
components:
  schemas:
    CreateOrderRequest:
      type: object
      required: [customer_id, items]
      properties:
        client_token:
          type: string
          minLength: 16
          maxLength: 128
        customer_id:
          type: string
          pattern: '^cust-[a-z0-9]{8,}$'
        items:
          type: array
          minItems: 1
          maxItems: 100
          items:
            $ref: '#/components/schemas/OrderItem'

For gRPC/Protobuf: Use protoc-gen-validate (PGV), a protobuf plugin that generates validation code from annotations in your .proto files:

import "validate/validate.proto";

message CreateOrderRequest {
  // clientToken: optional but if present must be 16-128 printable ASCII chars
  optional string client_token = 1 [(validate.rules).string = {
    min_len: 16, max_len: 128
  }];

  // customer_id: required, must match pattern
  string customer_id = 2 [(validate.rules).string = {
    pattern: "^cust-[a-z0-9]{8,}$",
    min_len: 1
  }];

  // items: required, 1-100 items
  repeated OrderItem items = 3 [(validate.rules).repeated = {
    min_items: 1, max_items: 100
  }];
}

message OrderItem {
  string product_id = 1 [(validate.rules).string.min_len = 1];

  // quantity: must be positive
  int32 quantity = 2 [(validate.rules).int32.gt = 0];

  // price: must be non-negative
  double unit_price = 3 [(validate.rules).double.gte = 0.0];
}

This enforces validation at the boundary, before your business logic runs, using the same .proto file that is your source of truth. No duplicate validation code. No inconsistency between the spec and the enforcement.


7.2 PII Data Exposure in APIs

This anti-pattern exposes PII data like full credit card numbers, SSNs, or passport numbers returned in GET responses. Email addresses and phone numbers included in audit logs and error messages. User location data exposed in list endpoints without access controls. Responses cached at the CDN layer with no consideration of the PII they contain.

Better approach: Apply data minimization at the API layer and return only the fields a caller needs and is authorized to receive. I wrote Agentic AI for Automated PII Detection: Building Privacy Guardians with LangChain and Vertex AI to show how annotations to mark sensitive fields in your schema and AI agents can be used to detect violations:

import "google/api/field_behavior.proto";

message Customer {
  string customer_id = 1;
  string display_name = 2;

  // Sensitive: only returned to callers with PII_READ permission
  // Masked in logs: shown as "****@example.com"
  string email_address = 3 [
    (google.api.field_behavior) = OPTIONAL,
    // Custom option — your PII classification
    (pii.sensitivity) = HIGH
  ];

  // Never returned in list operations; only in GetCustomer with explicit consent
  string phone_number = 4 [(pii.sensitivity) = HIGH];

  // Tokenized before storage; never returned as plaintext
  string payment_method_token = 5;
}

Operational controls:

  • Never log full request/response bodies; use structured logging with explicit field allowlists
  • Apply response field filtering at the API gateway based on caller permissions
  • Scan API responses in CI/CD pipelines for PII patterns before deployment
  • Ensure pagination tokens do not contain PII
  • Cache keys must never contain PII; cached responses must never contain PII for a different caller

7.3 Missing Contract Testing

In this case, a service team ships an API. Client teams write integration tests against their own mock servers. The mock servers are written from the documentation, not from the actual service behavior. When the service changes, the mocks stay static. Clients discover the breaking change in production.

Consumer-driven contract testing reverses this: clients publish their expectations (the “contract” of what they call and what they expect back), and the service validates those contracts in its CI/CD pipeline. If the service changes in a way that breaks a client contract, the service’s build fails before the change is deployed.

I built an open-source framework specifically for this: api-mock-service and described in Contract Testing for REST APIs. The framework supports:

  • Recording real API traffic and generating mock contracts from it (no manual mock writing)
  • Replaying recorded responses in test environments
  • Validating that recorded behavior matches the current service
  • Contract assertions that run in CI/CD pipelines to catch regressions before deployment
  • Support for REST, gRPC, and asynchronous APIs
# Contract generated from real traffic — not hand-written
contract:
  name: create_order_success
  method: POST
  path: /v1/orders
  request:
    headers:
      Content-Type: application/json
    body:
      customer_id: "{{non_empty_string}}"
      items:
        - product_id: "{{non_empty_string}}"
          quantity: "{{positive_integer}}"
  response:
    status: 201
    body:
      order_id: "{{non_empty_string}}"
      status: PENDING
      created_at: "{{iso_timestamp}}"
  # This contract runs against the service in CI — if CreateOrder
  # changes its response shape, this test fails before deployment

Spec enforcement + contract testing = full boundary defense:

  • The OpenAPI or proto spec enforces what the service accepts
  • Contract tests verify what the service returns
  • Together they eliminate the “it works in mocks but breaks in production” class of failures

7.4 No API Versioning Strategy

There is no version identifier, or a single v1 with no plan for v2. Or major version bumps so frequent clients cannot keep up. For example, Twitter’s v1.0 deprecation gave clients weeks, not months, and broke thousands of integrations.

Better approach: Version from day one in the URL path (/v1/, /v2/). Run old versions in parallel until usage is zero. Communicate sunset timelines with 12+ months’ notice.


7.5 Poor or Missing Documentation

Documentation covers only the happy path. No failure modes, retry semantics, or idempotency semantics documented. Field descriptions say “the order ID” rather than valid values and behavior when absent.

Documentation is a contract: every field, every failure mode, every error code must be documented. Consumer-driven contract tests are a forcing function.


7.6 Insufficient Rate Limiting and Quota Management

In this scenario, no per-account rate limits exist. Rate limits fixed in code, not configurable without deployment. One client’s traffic starves all others. Throttling responses use 500 instead of 429 Too Many Requests with Retry-After.

GitHub’s rate limiting is a reference implementation. X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers in every response allow clients to implement proactive backoff. 429 with Retry-After when the limit is hit.


7.7 Caching Without Security Consideration

Examples of this anti-pattern surfaces include a CDN cached responses by keyed only on URL, serving account A’s private data to account B. Cache stores authorization decisions without accounting for permission revocation.

Better approach: I described best practices of caches in When Caching is not a Silver Bullet. Cache keys must include all authorization context. Authorization decisions must have TTLs reflecting how quickly permission changes take effect. Cache poisoning must be in your threat model.


7.8 No API Lifecycle Management and Missing Deprecation Path

This occurs when there is no process for retiring old API versions. Deprecated endpoints have no documented migration path. Or endpoints removed with insufficient notice. For example, Twilio’s classic API deprecation was managed over 18 months with migration guides, compatibility layers, and direct client outreach.

Better approach: Collect per-endpoint, per-client usage metrics before announcing deprecation. Block new clients. Provide migration docs and tooling. 12+ months’ lead time. Monitor until zero usage confirmed.


Quick Reference: Pre-Launch Checklist

API Design Philosophy

  • [ ] Spec written first (OpenAPI or proto) before any implementation code
  • [ ] OpenAPI/proto schema enforced at runtime boundary (PGV, openapi-validator)
  • [ ] API surface is small and composable; no UI-specific endpoints in the core API
  • [ ] Resources organized in a consistent URI hierarchy under namespaces
  • [ ] No bag-of-params / execute pattern; separate operations for separate actions
  • [ ] Standard protocol chosen (REST, gRPC, WebSocket, SSE), no custom RPC
  • [ ] Encoding chosen based on use case (protobuf binary for internal high-throughput)
  • [ ] Streaming APIs use gRPC streaming or WebSocket, not polling or custom framing

Contract & Consistency

  • [ ] Consistent naming vocabulary (nouns, verbs, field names, timestamps)
  • [ ] Correct HTTP verbs with documented semantics
  • [ ] No breaking changes without version bump
  • [ ] Hyrum’s Law review: what observable behaviors exist not in the contract?
  • [ ] Strict input validation on every field, every operation

Pagination & Filtering

  • [ ] Pagination on all list operations before first client, not after
  • [ ] Opaque, versioned, expiring, account-scoped pagination tokens
  • [ ] Filter semantics documented (AND across attributes, OR within values)

Idempotency & Transactions

  • [ ] clientToken on all create operations
  • [ ] Token mismatch returns 409 with conflicting resource ID
  • [ ] Transaction boundaries documented
  • [ ] PATCH implements partial update (field mask)
  • [ ] ETag / version token for optimistic concurrency

Error Handling

  • [ ] Structured error format with machine-readable codes
  • [ ] No internal implementation detail in error messages
  • [ ] Correct HTTP status codes; seven standard exception types
  • [ ] 404 vs 403: resource existence hidden from unauthorized callers

Security & Privacy

  • [ ] PII tagged in schema; data minimization applied per-endpoint
  • [ ] No PII in logs, error messages, or pagination tokens
  • [ ] PII scanning in CI/CD pipeline before deployment
  • [ ] Cache keys include authorization context

Resilience & Operations

  • [ ] Retry logic limited to idempotent or token-protected operations
  • [ ] Exponential backoff with jitter; Retry-After honored
  • [ ] Service starts despite all dependencies unavailable
  • [ ] Graceful shutdown tested (SIGTERM -> drain -> exit)
  • [ ] Pre-auth throttling + per-account quota + 429 with Retry-After
  • [ ] Three-layer health checks: live / ready / deep
  • [ ] Latency SLAs defined; alerts below SLA threshold
  • [ ] Phased deployment with automatic metric-gated rollback
  • [ ] Big Red Button identified, documented, and drill-tested
  • [ ] Backup incident communication channel tested independently

Contract Testing & Lifecycle

  • [ ] Contract tests generated from real traffic, run in CI/CD
  • [ ] API version in URL path (v1, v2) from day one
  • [ ] Documentation covers failure modes, idempotency, retry semantics
  • [ ] Usage metrics collected per endpoint for lifecycle decisions
  • [ ] Deprecation policy documented; sunset timelines published

Closing Thoughts

Above anti-patterns are based on my decades of experience in building and operating high traffic APIs. They share a common thread: they were invisible at design time, or the team assumed fixing them later would be cheaper. An idempotency contract is cheapest to design correctly before the first client. A spec-first approach catches URI design problems before any client builds against the wrong shape. A contract test catches breaking changes before deployment. The checklist above addresses these as a system because they compound. An unbounded response is worse with no pagination. A missing idempotency token is catastrophic with an aggressive retry policy. A leaky PII field is worse without boundary validation. Two practices matter more than any individual anti-pattern on this list:

  • Spec-first design: write the contract before writing the implementation. Review it with consumers before coding starts. Use it as the source of truth for both server stubs and client SDKs.
  • Contract testing: verify the contract continuously against the live service. Use recorded real traffic, not hand-written mocks. Run it in every CI/CD pipeline.

Further reading from this series:

Older Posts »

Powered by WordPress