Shahzad Bhatti Welcome to my ramblings and rants!

June 27, 2026

Measuring Availability Properly: Percentiles, Tail Latency, and the Production Traps

Filed under: Computing — admin @ 1:18 pm

Observability is a key part of any infrastructure but I’ve watched teams repeat the same mistakes around measuring availability. For example, they track uptime and watch average latency. They run a TCP health check on port 80 and call it good. Then support learns about the availability issues from customers but the health dashboard shows everything is green. This post covers how to measure availability correctly: what signals to collect, how monitoring tools compute the rolling statistics you see, why percentiles beat averages and what happens to tail latency at scale in microservices.


1. What Availability Actually Means

The textbook definition of availability is uptime, e.g., the fraction of time a service is running. This splits into two independent questions:

Availability = P(request succeeds) AND P(request completes within SLA)

A service can answer every request successfully but take 30 seconds per response then that’s functionally unavailable. Conversely, a service can respond in 5ms but return errors to 50% of requests is also functionally unavailable.


2. User Errors vs Server Errors — Why the Distinction Matters

This is the most commonly conflated measurement in production monitoring. HTTP status codes carry clear semantic meaning that should drive entirely different alert responses:

Code RangeMeaningWhose Fault?Include in Availability?
2xxSuccessYes (success)
3xxRedirectUsually ignored
4xxClient/user errorThe callerNo
5xxServer errorYour serviceYes

4xx errors are client/user errors like 400/Bad Request, 401/Unauthorized. 5xx errors means service is failing like 500/Internal Server, 503/Service Unavailable. There is one gray area: client timeouts. If your client times out after 5s waiting for your 10s response, the client sees a 408 or a network error, which look like a 4xx but the root cause is server-side latency. This is why tracking latency separately from error codes is essential.

from prometheus_client import Counter, Histogram

# Track errors with full status code granularity
request_counter = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status_code', 'status_class']
)

latency_histogram = Histogram(
    'http_request_duration_seconds',
    'Request latency',
    ['method', 'endpoint'],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

def record_request(method: str, endpoint: str, status: int, duration_s: float):
    status_class = f"{status // 100}xx"
    request_counter.labels(
        method=method,
        endpoint=endpoint,
        status_code=str(status),
        status_class=status_class
    ).inc()
    latency_histogram.labels(method=method, endpoint=endpoint).observe(duration_s)


# --- Prometheus queries that actually measure availability ---

# Server error rate (5xx only — excludes client errors)
SERVER_ERROR_RATE = """
sum(rate(http_requests_total{status_class="5xx"}[5m]))
/
sum(rate(http_requests_total[5m]))
"""

# Availability (only penalize server errors)
AVAILABILITY = """
1 - (
  sum(rate(http_requests_total{status_class="5xx"}[5m]))
  /
  sum(rate(http_requests_total[5m]))
)
"""

# Client error rate (useful to watch, but not availability)
CLIENT_ERROR_RATE = """
sum(rate(http_requests_total{status_class="4xx"}[5m]))
/
sum(rate(http_requests_total[5m]))
"""

# Latency SLA compliance — fraction of requests completing within 500ms
LATENCY_SLA_COMPLIANCE = """
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
"""

A spike in 4xx that isn’t paired with a 5xx spike is almost certainly a misbehaving client, not your service. Alert on them differently: 5xx pages your on-call, 4xx goes to a ticket queue for review.


3. SLAs, SLOs, and Error Budgets

These three terms are used interchangeably in many organizations and they shouldn’t be.

  • SLA (Service Level Agreement) is a contractual commitment to external customers. Violating it has legal or financial consequences. Example: “We guarantee 99.9% availability per calendar month. If we breach this, we issue service credits.”
  • SLO (Service Level Objective) is an internal engineering target, usually tighter than the SLA. Example: “We target 99.95% availability.” The gap between SLO and SLA is your buffer.
  • Error Budget is what you get to spend before you breach your SLO. For a 99.9% SLO over 30 days:
Total minutes in 30 days = 30 × 24 × 60 = 43,200 minutes
Allowed downtime = 43,200 × (1 - 0.999) = 43.2 minutes

The error budget is your 43.2 minutes. Every minute of downtime spends from it. This reframes the conversation from “is the service up?” to “how fast are we burning through our budget?”

from datetime import datetime, timedelta

class ErrorBudget:
    """
    Track error budget consumption in real time.
    
    Example: 99.9% SLO over 30 days = 43.2 minutes of allowed downtime.
    """
    def __init__(self, slo_target: float, window_days: int = 30):
        self.slo_target = slo_target          # e.g., 0.999 for 99.9%
        self.window_minutes = window_days * 24 * 60
        self.allowed_downtime_minutes = self.window_minutes * (1 - slo_target)
        self.downtime_minutes_spent = 0.0
        self.start_time = datetime.now()

    def record_downtime(self, minutes: float):
        self.downtime_minutes_spent += minutes

    def budget_remaining_minutes(self) -> float:
        return max(0, self.allowed_downtime_minutes - self.downtime_minutes_spent)

    def budget_remaining_pct(self) -> float:
        return (self.budget_remaining_minutes() / self.allowed_downtime_minutes) * 100

    def burn_rate(self) -> float:
        """How fast are we burning budget vs. expected rate? 1.0 = on track, >1.0 = burning fast."""
        elapsed = (datetime.now() - self.start_time).total_seconds() / 60
        expected_spent = (elapsed / self.window_minutes) * self.allowed_downtime_minutes
        if expected_spent == 0:
            return 0.0
        return self.downtime_minutes_spent / expected_spent

    def summary(self) -> str:
        return (
            f"SLO: {self.slo_target*100:.2f}%  |  "
            f"Budget: {self.allowed_downtime_minutes:.1f} min  |  "
            f"Spent: {self.downtime_minutes_spent:.1f} min  |  "
            f"Remaining: {self.budget_remaining_pct():.1f}%  |  "
            f"Burn rate: {self.burn_rate():.2f}x"
        )

# Usage
budget = ErrorBudget(slo_target=0.999, window_days=30)
budget.record_downtime(minutes=12.5)   # incident on day 3
budget.record_downtime(minutes=8.0)    # incident on day 11
print(budget.summary())
# SLO: 99.90%  |  Budget: 43.2 min  |  Spent: 20.5 min  |  Remaining: 52.5%  |  Burn rate: ...

A burn rate above 1.0 means you’ll exceed your error budget before the window closes. Burn rate above 14.4x means you’ll exhaust it within 48 hours, which is a PagerDuty alert.


4. The Health Check Anti-Pattern

I need to address something I’ve seen sink production deployments before we even get to metrics: health checks that only verify the process is listening on a port. A port check tells you the process hasn’t crashed. It tells you nothing about whether the process can serve traffic. I’ve seen this exact scenario: database connection pool was exhausted, port was open, load balancer marked the instance healthy, every request returned a 500. The monitoring was dark green the whole time.

A real health check must exercise the actual request path: connect to dependencies, perform a lightweight but genuine operation, return structured status. In Kubernetes this means a readiness probe hitting a /health endpoint that checks dependency connectivity. Critically, readiness and liveness are different probes:

  • Liveness: Is the process deadlocked? If not, keep it alive. If yes, kill and restart it.
  • Readiness: Can it serve traffic right now? If not, remove it from the load balancer pool, but don’t kill it.

A process that is alive but not ready (warming up a cache, waiting for a dependency) should fail readiness but pass liveness. Confusing these two causes cascading restarts during startup under load is a failure mode I’ve seen multiple times in prod. See my Zero-Downtime Services on Kubernetes and Istio post for the full treatment.


5. Why Average Latency Lies

Here’s a production story I’ve seen more than once. The team does an efficiency push: optimizes the hot path, ships a 30% improvement in p50 latency. Dashboards celebrate but three weeks later, the p99 is back to where it started. The answer is queuing theory. Consider a server with a queue in front of it. Define utilization P as:

P = arrival rate / service rate

The average number of items in the system in queue plus being served is:

E[N] = P / (1 - P)

This is not a linear relationship. It’s an asymptote that goes vertical as you approach full utilization:

P (utilization)E[N] (avg items in system)
0.50 (50%)1
0.80 (80%)4
0.90 (90%)9
0.95 (95%)19
0.99 (99%)99

When you make the code faster (higher service-rate), P drops, and you slide left on this curve, i.e., fewer items queuing with lower tail latency. But then traffic grows or you reduce servers to “realize the savings.” P climbs back to where it was, and latency returns with it. The key lesson is that the average latency reflects the fast path but high-percentile latency (p99, p99.9) is extremely sensitive to queue depth. High percentile latency is a leading indicator that you’re approaching overload.

There’s a counterintuitive implication from this: p99 is a terrible way to measure whether your efficiency work succeeded. It’s so sensitive to the queuing nonlinearity that changes in utilization will swamp the signal from your actual code changes. For measuring efficiency, mean latency is actually better because it tracks the true cost of processing one request without queue effects. Use percentiles for alerting and use mean for efficiency measurement.


6. Percentiles From First Principles

Let’s go over percentiles from scratch, because monitoring tools throw around “p50”, “p99”, “p99.9” without ever explaining what they actually represent, and misunderstanding them leads to misreading dashboards. Given a set of N latency measurements, sort them from fastest to slowest. The Nth percentile is the value at position N% in that sorted list.

Latencies (ms): [5, 7, 8, 9, 10, 11, 12, 13, 250, 400]
Sorted:          [5, 7, 8, 9, 10, 11, 12, 13, 250, 400]
                  ^              ^              ^
                  p10           p50            p90

p50 = 10ms  (50% of requests were at or below this speed)
p90 = 13ms  (90% of requests were at or below this speed)
p99 = 400ms (99% of requests were at or below this speed)

What p99 tells you is: at most 1% of your requests see latency worse than this number. Equivalently, 999 out of every 1000 requests complete faster than p99. The catch is that p99 is a single value and it summarizes nothing about the shape of the distribution between p90 and p99. Latency can get dramatically worse for customers in that range without your p99 alarm firing.

import numpy as np

def explain_percentile(latencies_ms: list[float]):
    """Show what percentiles mean in plain English."""
    arr = np.array(sorted(latencies_ms))
    n = len(arr)
    
    stats = {
        "mean":  np.mean(arr),
        "p50":   np.percentile(arr, 50),
        "p90":   np.percentile(arr, 90),
        "p95":   np.percentile(arr, 95),
        "p99":   np.percentile(arr, 99),
        "p99.9": np.percentile(arr, 99.9),
        "max":   np.max(arr),
    }
    
    print(f"{'Statistic':<10} {'Value':>10}   Plain English")
    print("-" * 65)
    print(f"{'mean':<10} {stats['mean']:>10.1f}ms  Average — hides bimodal distributions")
    print(f"{'p50':<10} {stats['p50']:>10.1f}ms  Half of requests faster than this")
    print(f"{'p90':<10} {stats['p90']:>10.1f}ms  90% of requests faster than this")
    print(f"{'p95':<10} {stats['p95']:>10.1f}ms  95% of requests faster than this")
    print(f"{'p99':<10} {stats['p99']:>10.1f}ms  99% of requests faster than this")
    print(f"{'p99.9':<10} {stats['p99.9']:>10.1f}ms  999/1000 requests faster than this")
    print(f"{'max':<10} {stats['max']:>10.1f}ms  Worst single request (very noisy)")

# Simulate a bimodal latency distribution
# 95% fast requests (cache hit), 5% slow (cache miss + DB query)
import random
random.seed(42)
latencies = [
    random.gauss(10, 2) if random.random() > 0.05 else random.gauss(300, 40)
    for _ in range(1000)
]
explain_percentile(latencies)
Statistic       Value   Plain English
-----------------------------------------------------------------
mean             24.8ms  Average — hides bimodal distributions
p50              10.4ms  Half of requests faster than this
p90              12.1ms  90% of requests faster than this
p95              17.9ms  95% of requests faster than this
p99             302.1ms  99% of requests faster than this
p99.9           375.8ms  999/1000 requests faster than this
max             392.4ms  Worst single request (very noisy)

7. Moving Averages and Rolling Percentiles

When Grafana shows you a p99 or Datadog shows you an error rate, it’s not summing up all-time data. It’s computing over a rolling time window.

Simple Moving Average vs EWMA

A Simple Moving Average (SMA) gives equal weight to every sample in the window:

from collections import deque
import statistics

class SMA:
    """Simple Moving Average — every sample in the window weighted equally."""
    def __init__(self, window: int):
        self.buf = deque(maxlen=window)
    
    def add(self, v: float) -> float:
        self.buf.append(v)
        return statistics.mean(self.buf)

An Exponentially Weighted Moving Average (EWMA) gives more weight to recent samples, fading older ones smoothly:

class EWMA:
    """
    Exponentially Weighted Moving Average.
    alpha: 0 < alpha < 1
    - High alpha (e.g. 0.3): reacts fast, noisier
    - Low alpha  (e.g. 0.05): smoother, slower to detect changes
    
    StatsD uses EWMA for gauge values. Prometheus uses time-window sums.
    """
    def __init__(self, alpha: float = 0.1):
        self.alpha = alpha
        self.value: float | None = None

    def add(self, sample: float) -> float:
        if self.value is None:
            self.value = sample
        else:
            self.value = self.alpha * sample + (1 - self.alpha) * self.value
        return self.value

# Demonstrate: same spike, different alphas
spike_data = [10, 10, 10, 10, 250, 10, 10, 10, 10, 10]
slow_ewma = EWMA(alpha=0.05)
fast_ewma = EWMA(alpha=0.30)

print(f"{'Sample':>8} {'Value':>8} {'alpha=0.05':>10} {'alpha=0.30':>10}")
for i, v in enumerate(spike_data):
    print(f"{i:>8} {v:>8.0f} {slow_ewma.add(v):>10.1f} {fast_ewma.add(v):>10.1f}")
  Sample    Value     alpha=0.05   alpha=0.30
       0       10       10.0       10.0
       1       10       10.0       10.0
       4      250       21.9       82.0   --> fast alpha sees the spike much louder
       5       10       21.3       58.4   --> slow alpha recovers faster
       9       10       18.5       17.2

Rolling Percentile

Computing exact percentiles over a moving window requires keeping raw samples and re-sorting. For production scale, the T-Digest algorithm computes approximate percentiles with bounded memory. Here’s the conceptual version first:

import numpy as np
from collections import deque

class RollingPercentile:
    """
    Rolling percentile over a fixed window of recent samples.
    
    Production note: At high throughput, use T-Digest or DDSketch instead.
    Prometheus uses pre-defined histogram buckets + linear interpolation.
    """
    def __init__(self, window: int, pctile: float):
        self.buf = deque(maxlen=window)
        self.pctile = pctile

    def add(self, v: float) -> float | None:
        self.buf.append(v)
        if len(self.buf) < 2:
            return None
        return float(np.percentile(list(self.buf), self.pctile))

# Show how window size affects sensitivity
import random
random.seed(7)

data = [random.gauss(10, 2) for _ in range(90)] + \
       [random.gauss(200, 20) for _ in range(10)]  # degradation at t=90

p99_small  = RollingPercentile(window=20,  pctile=99)
p99_medium = RollingPercentile(window=100, pctile=99)

print("How window size affects p99 detection of a latency spike:")
print(f"{'t':>4} {'value':>8} {'p99 w=20':>12} {'p99 w=100':>12}")
for t, v in enumerate(data[80:]):   # show the transition region
    small  = p99_small.add(v)
    medium = p99_medium.add(v)
    marker = " --> spike starts" if t == 10 else ""
    if small and medium:
        print(f"{t+80:>4} {v:>8.1f} {small:>12.1f} {medium:>12.1f}{marker}")

Prometheus histogram vs. summary: Prometheus offers two ways to track latency. A Summary computes quantiles client-side over a rolling window but you can’t aggregate across instances. A Histogram records counts in pre-defined buckets and approximates quantiles server-side, which is slightly less accurate, but fully aggregatable. For microservices with multiple replicas, always use Histogram.


8. Trimmed Mean: More Signal, Real Tradeoffs

Here’s the core difference between a percentile and a trimmed mean, using the product review analogy:

100 latency measurements, sorted by speed:

p99  = the single worst measurement in the best 99%
       (the 99th measurement out of 100, sorted fastest-to-slowest)

tm99 = the average of all 99 measurements in the best 99%
       (discard the 1 slowest, average the remaining 99)

tm99 summarizes 99 times more data than p99. That makes it more stable (less spiky under low traffic), harder to game (a gradual degradation can hide between percentile checkpoints, but tm99 will catch it), and more representative of typical customer experience.

  • tm99 tracks the average experience of your bulk of customers
  • TM(99%:) tracks the average of your slowest 1%; ensures outlier experience doesn’t silently worsen

Together these two numbers cover 100% of your requests with just two metrics.

import numpy as np

def compute_tm_stats(samples: list[float]) -> dict:
    """
    Compute a full suite of trimmed mean statistics.

    Syntax mirrors CloudWatch / AWS Embedded Metrics Format:
      tm99        = TM(0%:99%)  = average of fastest 99%
      TM(99%:)    = TM(99%:100%) = average of slowest 1%  
      TM(1%:99%)  = drop both extremes (handles unbounded latency)
      IQM         = TM(25%:75%) = Interquartile Mean
    """
    arr = np.sort(np.array(samples))
    n = len(arr)

    def tm(lower_pct: float, upper_pct: float) -> float:
        lo = np.percentile(arr, lower_pct)
        hi = np.percentile(arr, upper_pct)
        trimmed = arr[(arr >= lo) & (arr <= hi)]
        return float(np.mean(trimmed)) if len(trimmed) else float('nan')

    return {
        "mean":       float(np.mean(arr)),
        "p50":        float(np.percentile(arr, 50)),
        "p99":        float(np.percentile(arr, 99)),
        "tm99":       tm(0, 99),      # avg of fastest 99%
        "TM(99%:)":   tm(99, 100),    # avg of slowest 1%  --> watch your outliers here
        "TM(1%:99%)": tm(1, 99),      # drop both extremes (use for unbounded latency)
        "IQM":        tm(25, 75),     # interquartile mean
    }

# Scenario: a cache-miss spike where 2% of requests are slow
rng = np.random.default_rng(42)
fast = rng.normal(10, 1.5, 980)
slow = rng.normal(350, 30, 20)
samples = np.concatenate([fast, slow]).tolist()

stats = compute_tm_stats(samples)
print(f"{'Metric':<14} {'Value':>10}   Notes")
print("-" * 65)
for k, v in stats.items():
    notes = {
        "mean":       "Pulled up by slow tail — misleading",
        "p50":        "Median — fine but ignores tail",
        "p99":        "Single value at 99th position",
        "tm99":       "Average of 98% of customers --> primary SLO metric",
        "TM(99%:)":   "Average of slowest 2% --> outlier watchdog",
        "TM(1%:99%)": "Drops both extremes — good for browser metrics",
        "IQM":        "Middle 50% average — robust to both extremes",
    }.get(k, "")
    print(f"{k:<14} {v:>10.1f}ms  {notes}")
Metric              Value   Notes
-----------------------------------------------------------------
mean                16.8ms  Pulled up by slow tail — misleading
p50                  9.9ms  Median — fine but ignores tail
p99                335.2ms  Single value at 99th position
tm99                10.1ms  Average of 98% of customers --> primary SLO metric
TM(99%:)           351.4ms  Average of slowest 2% --> outlier watchdog
TM(1%:99%)          10.1ms  Drops both extremes — good for browser metrics
IQM                  9.8ms  Middle 50% average — robust to both extremes

Bounded vs. unbounded latency:

  • Bounded latency (server-side, with request timeouts): use tm99 + TM(99%:). Since latency is capped by your timeout, even the worst measurements are meaningful.
  • Unbounded latency (client-side browser metrics, user-perceived time): use TM(1%:99%). A user who closes their laptop mid-request and reopens it days later may log a latency of 230,400 seconds. These shouldn’t contaminate your outlier statistics. Drop the top and bottom extremes.

The key lesson is that percentiles create blind spots “between the checkpoints.” A degradation that affects the 40th–60th percentile range will move neither p25 nor p75 much. Trimmed mean, because it averages across the entire range, catches these shifts. However, trimmed mean has its own blind spot. It deliberately removes the part of the distribution that dominates user experience in fan-out architectures. The right answer is not to choose between percentiles and trimmed mean but use both.


10. Winsorized Mean, Percentile Rank, and IQM

These statistics show up in CloudWatch and modern observability platforms, and they each solve a specific problem.

Winsorized Mean (WM)

Like trimmed mean, but instead of discarding outliers, it replaces them with the boundary value. For wm99:

  • Find the value at the 99th percentile (= p99)
  • Treat all 1% outliers as if they had exactly that p99 value
  • Average all 100% of samples
def winsorized_mean(samples: list[float], lower_pct: float = 0, upper_pct: float = 99) -> float:
    arr = np.array(samples, dtype=float)
    lo = np.percentile(arr, lower_pct)
    hi = np.percentile(arr, upper_pct)
    # Clip: anything below lo becomes lo, anything above hi becomes hi
    winsorized = np.clip(arr, lo, hi)
    return float(np.mean(winsorized))

Winsorized mean gives some weight to outliers without letting extreme values skew the average. The difference between tm99 and wm99 is subtle at high percentages and wm99 will be slightly higher because it includes the outliers rather than dropping them.

Percentile Rank PR()

Percentile rank answers the inverse question from percentile. Percentile says: “What latency value marks the Nth percent?” Percentile rank says: “What percent of requests are below a given latency value?”

If you have an SLA of “respond within 500ms to 99% of users,” you’d normally monitor p99 and check it’s <= 500ms. With Percentile Rank, you instead plot PR(:500ms, i.e., the percentage of requests completing within 500ms and drive that number toward 99% or higher. This is more directly action-oriented: you always know exactly how far below your SLA you are.

def percentile_rank(samples: list[float], threshold: float) -> float:
    """What fraction of samples are at or below threshold?"""
    arr = np.array(samples)
    return float(np.mean(arr <= threshold) * 100)

# Example: SLA is p99 < 500ms
samples_ms = [10, 12, 9, 11, 450, 10, 13, 600, 11, 10]  # small sample
pr_500 = percentile_rank(samples_ms, 500)
print(f"PR(:500ms) = {pr_500:.1f}%  (SLA requires 99%)")
# PR(:500ms) = 90.0%  (SLA requires 99%) — you're 9 percentage points short

IQM (Interquartile Mean)

IQM is simply TM(25%:75%), the average of the middle 50% of samples, discarding the top and bottom 25%. It’s extremely robust to outliers in both directions, useful when you expect noise from both ends of the distribution (e.g., some requests are trivially fast cache hits, others are pathologically slow).


11. The Inspection Paradox: Your Users Experience Worse Than Your Metrics Show

As Marc Brooker’s explained in his blog, this is the most underappreciated gap in distributed systems reliability. For example, say your service has outages with very different durations: some resolve in 30 seconds, but occasionally one runs for 3 hours. Your MTTR (Mean Time to Recovery) might calculate to 5 minutes. But when a user hits your service during an outage, they’re more likely to land in a long outage than a short one because long outages have more time-slots for users to arrive in.

Customer-experienced mean recovery = (1/2) × (MTTR + Variance/MTTR)

The second term is what kills you. If your outage duration has high variance, e.g., fast recovery most of the time, but occasional 3-hour events then that variance term dominates. Your customers experience something dramatically worse than your MTTR.

import random
import math
import statistics

def inspection_paradox_demo(
    median_recovery_min: float,
    p99_recovery_min: float,
    arrivals_per_min: float = 100,
    n_outages: int = 2000
) -> dict:
    """
    Simulate the gap between operator MTTR and customer-experienced recovery.
    
    Key insight: customers are t-weighted samplers of your outage distribution.
    A 10-minute outage gets sampled by ~10x as many clients as a 1-minute outage.
    """
    # Fit lognormal to median and p99
    mu = math.log(median_recovery_min)
    sigma = (math.log(p99_recovery_min) - mu) / 2.326

    server_durations = []
    client_wait_times = []

    for _ in range(n_outages):
        duration = random.lognormvariate(mu, sigma)
        server_durations.append(duration)

        # Clients arrive as a Poisson process during the outage
        t = 0.0
        while True:
            gap = random.expovariate(arrivals_per_min)
            if t + gap > duration:
                break
            # This client arrived at time t, waits until outage ends
            client_wait_times.append(duration - t)
            t += gap

    return {
        "operator_mttr":        statistics.mean(server_durations),
        "operator_p99":         sorted(server_durations)[int(len(server_durations) * 0.99)],
        "customer_mean_wait":   statistics.mean(client_wait_times) if client_wait_times else 0,
        "customer_p99_wait":    sorted(client_wait_times)[int(len(client_wait_times) * 0.99)] if client_wait_times else 0,
        "experience_gap_ratio": (statistics.mean(client_wait_times) / statistics.mean(server_durations)) if client_wait_times else 0,
    }

result = inspection_paradox_demo(
    median_recovery_min=1,    # median outage resolves in 1 minute
    p99_recovery_min=60,      # but 1% of outages take an hour
)

print("Scenario: 1-minute median recovery, 60-minute p99 recovery")
print()
print("What your on-call dashboard shows:")
print(f"  MTTR:              {result['operator_mttr']:.1f} minutes")
print(f"  p99 recovery:      {result['operator_p99']:.1f} minutes")
print()
print("What your customers actually experience:")
print(f"  Mean recovery:     {result['customer_mean_wait']:.1f} minutes")
print(f"  p99 recovery:      {result['customer_p99_wait']:.1f} minutes")
print(f"  Experience gap:    {result['experience_gap_ratio']:.1f}x worse than MTTR")
Scenario: 1-minute median recovery, 60-minute p99 recovery

What your on-call dashboard shows:
  MTTR:              4.9 minutes
  p99 recovery:      56.6 minutes

What your customers actually experience:
  Mean recovery:     60.0 minutes
  p99 recovery:      797.3 minutes
  Experience gap:    12.1x worse than MTTR

This is why tail recovery time matters more than averages suggest. Timeout-and-retry can hide individual request latency, but it cannot hide recovery time. Once a client gets stuck in an outage, retries don’t shorten the outage, they just add load to an already struggling service. The right takeaway: minimize variance in recovery time, not just its mean. Bounded, predictable recovery is far better for customers than fast-average-but-occasional-disaster.


12. Tail Latency Amplifies in Microservices

Modern architectures decompose user requests into many service calls. This creates two topologies, and both amplify tail latency:

Fan-out math: If each service has a 1% probability of a slow response, the probability that at least one is slow when calling N services in parallel is:

P(at least one slow) = 1 - (1 - 0.01)^N
N (services called)% of user requests seeing a slow response
11.0%
54.9%
109.6%
2522.2%
5039.5%
10063.4%

What was a rare 1% tail now affects the majority of user interactions. And here’s the pernicious part: your per-service p99 metric looks perfectly fine. The damage is invisible at the service level, only visible at the user-experience level.

import numpy as np, random

def simulate_fanout(n_backends: int, tail_prob: float = 0.01, n_reqs: int = 20_000):
    """
    Simulate client experience when calling n_backends in parallel.
    Each backend: (1-tail_prob) chance of fast, tail_prob chance of slow.
    """
    results = []
    slow_count = 0
    for _ in range(n_reqs):
        latencies = []
        for _ in range(n_backends):
            if random.random() < tail_prob:
                latencies.append(random.gauss(250, 25))
                slow_count += 1
            else:
                latencies.append(random.gauss(10, 2))
        results.append(max(latencies))  # fan-out: wait for slowest
    
    arr = np.array(results)
    return {
        "p50":  np.percentile(arr, 50),
        "p99":  np.percentile(arr, 99),
        "mean": np.mean(arr),
        "pct_slow_user_requests": np.mean(arr > 50) * 100,
    }

print(f"{'N':>4} {'p50 (ms)':>10} {'p99 (ms)':>10} {'mean (ms)':>10} {'% users hit slow':>18}")
for n in [1, 5, 10, 25, 50, 100]:
    r = simulate_fanout(n)
    print(f"{n:>4} {r['p50']:>10.1f} {r['p99']:>10.1f} {r['mean']:>10.1f} {r['pct_slow_user_requests']:>18.1f}%")

The trimmed mean blind spot revisited. At N=50, nearly 40% of user requests are slow. But your per-service tm99 (averaging the best 99% of individual service calls) still looks great because it’s averaging the fast cluster. This is exactly the case where trimmed mean gives you false comfort. You need explicit end-to-end latency tracking at the user-request level, not just per-service tail tracking.


13. The Pooling Dividend: Why Redundancy Is Non-Linear

Adding servers doesn’t just increase capacity linearly but it also improves latency through pooling. This comes from the Erlang C model in queuing theory. For example, two designs, both handling the same total load:

  • Design A: 1 server at 80% utilization
  • Design B: 10 servers sharing load, each at 80% utilization

Design A has roughly a 13% chance of any incoming request finding the server busy and joining a queue. Design B has roughly a 3.6% chance. Double the fleet to 20 servers at the same 80% per-server utilization, and the queueing probability drops toward 1%. You’re getting better latency and better tail behavior at the same per-server cost.

import math
from functools import lru_cache

def erlang_c(c: int, rho: float) -> float:
    """
    Erlang C formula: probability an arriving request must queue
    (rather than being served immediately) in an M/M/c system.
    
    c: number of servers
    rho: per-server utilization (0 < rho < 1)
    """
    a = c * rho  # total offered load
    
    @lru_cache(maxsize=None)
    def factorial(n: int) -> int:
        return 1 if n <= 1 else n * factorial(n - 1)
    
    # Sum term for the denominator
    sum_term = sum(a**k / factorial(k) for k in range(c))
    last_term = (a**c / factorial(c)) * (1 / (1 - rho))
    
    ec = last_term / (sum_term + last_term)
    return ec

print("Probability a request must queue before being served:")
print(f"{'Servers':>8} {'Utilization':>12} {'Queue prob':>12}   {'Queue %':>8}")
for c in [1, 2, 5, 10, 20, 50]:
    ec = erlang_c(c=c, rho=0.8)
    print(f"{c:>8} {'80%':>12} {ec:>12.4f}   {ec*100:>7.1f}%")
Probability a request must queue before being served:
 Servers  Utilization   Queue prob    Queue %
       1          80%       0.8000      80.0%
       2          80%       0.7111      71.1%
       5          80%       0.5541      55.4%
      10          80%       0.4092      40.9%
      20          80%       0.2561      25.6%
      50          80%       0.0870       8.7%

Most of the benefit materializes at modest fleet sizes. You don’t need to be at hyperscale to get pooling gains. A fleet of 5-10 servers sharing load through a proper load balancer will have dramatically better tail latency behavior than the same compute running as independent instances.


14. Retries, Circuit Breakers, and the Amplification Trap

Retries protect against transient failures like a GC pause, a brief network glitch, a thundering herd. In past production deployment, I use up to 3 retries with exponential backoff for idempotent read operations. The protection against false positives is real and worthwhile. But retries have a catastrophic failure mode: retry amplification.

A single user request can generate 3 × 3 × 3 = 27 actual requests to a struggling downstream service. This turns a partial overload into a total collapse. I’ve watched this happen in production, e.g., a service that was at 60% capacity receives a burst of retries from a misbehaving upstream and immediately spikes to 200% load, failing every request, causing more retries, a feedback loop.

The mitigations:

import time
import threading
from collections import deque

class RetryBudget:
    """
    Limit total retry rate as a fraction of total traffic.
    If retries exceed the budget, fail fast instead of retrying.
    
    Classic mitigation for retry amplification.
    """
    def __init__(self, budget_fraction: float = 0.10, window_seconds: int = 60):
        self.budget_fraction = budget_fraction
        self.window = window_seconds
        self.total_requests: deque = deque()
        self.retry_requests: deque = deque()
        self._lock = threading.Lock()

    def _prune(self):
        cutoff = time.monotonic() - self.window
        while self.total_requests and self.total_requests[0] < cutoff:
            self.total_requests.popleft()
        while self.retry_requests and self.retry_requests[0] < cutoff:
            self.retry_requests.popleft()

    def record_request(self):
        with self._lock:
            self.total_requests.append(time.monotonic())

    def should_retry(self) -> bool:
        """Returns True if we have retry budget remaining."""
        with self._lock:
            self._prune()
            total = len(self.total_requests)
            retries = len(self.retry_requests)
            if total == 0:
                return True
            current_rate = retries / total
            if current_rate < self.budget_fraction:
                self.retry_requests.append(time.monotonic())
                return True
            return False  # budget exhausted — fail fast, don't amplify


class CircuitBreaker:
    """
    Stop sending requests to a failing downstream.
    Transitions: CLOSED -> OPEN -> HALF_OPEN -> CLOSED
    """
    CLOSED, OPEN, HALF_OPEN = "CLOSED", "OPEN", "HALF_OPEN"

    def __init__(self, failure_threshold: float = 0.5, cooldown_seconds: float = 30):
        self.failure_threshold = failure_threshold
        self.cooldown = cooldown_seconds
        self.state = self.CLOSED
        self.failures = 0
        self.total = 0
        self.opened_at: float | None = None

    def call_allowed(self) -> bool:
        if self.state == self.CLOSED:
            return True
        if self.state == self.OPEN:
            if time.monotonic() - self.opened_at > self.cooldown:
                self.state = self.HALF_OPEN
                return True  # let one probe through
            return False  # fail fast
        return True  # HALF_OPEN: let one probe through

    def record_success(self):
        self.failures = 0
        self.total = 0
        self.state = self.CLOSED

    def record_failure(self):
        self.failures += 1
        self.total += 1
        if self.total >= 10 and self.failures / self.total >= self.failure_threshold:
            self.state = self.OPEN
            self.opened_at = time.monotonic()

Hedge requests are often better than retries for latency problems. Instead of waiting for a timeout and retrying, fire a second request after a short delay (say, the p90 latency). Accept whichever responds first, cancel the other. This cuts your tail exposure without amplifying load as aggressively, because typically one of the two requests will succeed quickly.


15. Synthetic Canaries in Production

Error rates and latency percentiles tell you what’s happening to real traffic but only after users are affected. Synthetic canaries fill the gap: background processes that continuously exercise your API end-to-end, giving you availability signal even at 3am when real traffic is low.

Key design decisions from production experience:

  • Test the full workflow, not just the health endpoint. A canary for a data API should create, read, update, and delete a record. One for an auth service should issue a token, validate it, and revoke it. Shallow canaries that only call GET /health will miss the exact failures that health check anti-patterns also miss.
  • Track first-attempt and final success separately. If your canary succeeds on retry 2 90% of the time, the final success rate looks fine but something is quietly broken. First-attempt success rate catches this.
  • Keep canary observability separate from production. Mixing them has two failure modes: canary failures inflate your production error rate, and canary successes can mask production degradation if canaries hit warm caches or a separate code path.
  • Account for canary bias. Canaries hit warm caches and have predictable access patterns. Their p99 is almost always better than real user p99. Use canary latency to detect regressions relative to a baseline, not to claim absolute performance numbers.
  • Use retries in canaries, but with a limit. Up to 3 retries prevents false positives from transient network blips. But record the retry count per run, e..g, a canary that regularly needs 2+ retries is a signal worth investigating even if it eventually succeeds.

16. Putting It All Together: A Layered Monitoring Strategy

After decades of building and operating distributed systems, here’s the monitoring architecture I’d deploy for any production service from day one:

MetricWhyWindowAlert Threshold
5xx rateServer failures1 min> 0.1%
p99 latencyTail experience, SLA1 min> SLA value
Request volumeSilent failures1 minDrop > 50%
tm99 latencyBulk experience5 minTrending up
TM(99%:) latencyOutlier watchdog5 minTrending up
Error budget burnSLO health1 hr> 2x expected rate
p99.9 latencyOverload early warning15 minTrending
Retry rateAmplification risk5 min> 10% of traffic
Canary first-attemptEnd-to-end health60s< 95%

Closing: The Number That Matters Most

After all of this, the insight that has most changed how I think about availability is this: your users don’t experience your MTTR. They experience a version of it weighted by how long outages last, which skews dramatically toward your worst events. A service with a 1-minute median recovery but occasional 2-hour outages will have customers experiencing something closer to hours, not minutes. The variance in your tail events matters more than the central tendency. This is why the tail cannot be trimmed away from your visibility. Build observability that shows you the tail. Use redundancy and retries but understand how they amplify under pressure. Run canaries that exercise the whole path. Track user errors and server errors separately. Keep SLO burn rate visible so you always know how much budget you’ve spent. And when your customers say the service is slow and your dashboard says everything is green then believe the customers.

No Comments

No comments yet.

RSS feed for comments on this post. TrackBack URL

Sorry, the comment form is closed at this time.

Powered by WordPress