Shahzad Bhatti Welcome to my ramblings and rants!

November 22, 2025

Testing Distributed Systems Failures with Interactive Simulators

Filed under: Computing — admin @ 10:31 pm

Introduction

Building distributed systems means confronting failure modes that are nearly impossible to reproduce in development or testing environments. How do you test for metastable failures that only emerge under specific load patterns? How do you validate that your quorum-based system actually maintains consistency during network partitions? How do you catch cross-system interaction bugs when both systems work perfectly in isolation? Integration testing, performance testing, and chaos engineering all help, but they have limitations. For the past few years, I’ve been using simulation to validate boundary conditions that are hard to test in real environments. Interactive simulators let you tweak parameters, trigger failure scenarios, and see the consequences immediately through metrics and visualizations.

In this post, I will share four simulators I’ve built to explore the failure modes and consistency challenges that are hardest to test in real systems:

  1. Metastable Failure Simulator: Demonstrates how retry storms create self-sustaining collapse
  2. CAP/PACELC Consistency Simulator: Shows the real tradeoffs between consistency, availability, and latency
  3. CRDT Simulator: Explores conflict-free convergence without coordination
  4. Cross-System Interaction (CSI) Failure Simulator: Reveals how correct systems fail through their interactions

Each simulator is built on research findings and real-world incidents. The goal isn’t just to understand these failure modes intellectually, but to develop intuition through experimentation. All simulators available at: https://github.com/bhatti/simulators.


Part 1: Metastable Failures

The Problem: When Systems Attack Themselves

Metastable failures are particularly insidious because the initial trigger can be small and transient, but the system remains degraded long after the trigger is gone. Research in the metastable failures has shown that traditional fault tolerance mechanisms don’t protect against metastability because the failure is self-sustaining through positive feedback loops in retry logic and coordination overhead. The mechanics are deceptively simple:

  1. A transient issue (network blip, brief CPU spike) causes some requests to slow down
  2. Slow requests start timing out
  3. Clients retry timed-out requests, adding more load
  4. Additional load increases coordination overhead (locks, queues, resource contention)
  5. Higher overhead increases latency further
  6. More timeouts trigger more retries
  7. The system is now in a stable degraded state, even though the original trigger is gone

For example, AWS Kinesis experienced a 7+ hour outage in 2020 where a transient metadata mismatch triggered retry storms across the fleet. Even after the original issue was fixed, the retry behavior kept the system degraded. The recovery required externally rate-limiting client retries.

How the Simulator Works

The metastable failure simulator models this feedback loop using discrete event simulation (SimPy). Here’s what it simulates:

Server Model:

  • Base latency: Time to process a request with no contention
  • Concurrency slope: Additional latency per concurrent request (coordination cost)
  • Capacity: Maximum concurrent requests before queueing
# Latency grows linearly with active requests
def current_latency(self):
    return self.base_latency + (self.active_requests * self.concurrency_slope)

Client Model:

  • Timeout threshold: When to give up on a request
  • Max retries: How many times to retry
  • Backoff strategy: Exponential backoff with jitter (configurable)

Load Patterns:

  • Constant: Steady baseline load
  • Spike: Sudden increase for a duration, then back to baseline
  • Ramp: Gradual increase and decrease

Key Parameters to Experiment With:

ParameterWhat It TestsTypical Values
server_capacityHow many concurrent requests before queueing20-100
base_latencyProcessing time without contention0.1-1.0s
concurrency_slopeCoordination overhead per request0.001-0.05s
timeoutWhen clients give up1-10s
max_retriesRetry attempts before failure0-5
backoff_enabledWhether to add jitter and delaysTrue/False

What You Can Learn:

  1. Trigger a metastable failure: Set spike load high, timeout low, disable backoff ? watch P99 latency stay high after spike ends
  2. See recovery with backoff: Same scenario but enable exponential backoff ? system recovers when spike ends
  3. Understand the tipping point: Gradually increase concurrency slope ? observe when retry amplification begins
  4. Test admission control: Set low server capacity ? see benefit of failing fast vs queueing

The simulator tracks success rate, retry count, timeout count, and latency percentiles over time, letting you see exactly when the system tips into metastability and whether it recovers. With this simulator you can validate various prevention strategies such as:

  • Exponential backoff with jitter spreads retries over time
  • Adaptive retry budgets limit total fleet-wide retries
  • Circuit breakers detect patterns and stop retry storms
  • Load shedding rejects requests before queues explode

Part 2: CAP and PACELC

The CAP theorem correctly states that during network partitions, you must choose between consistency and availability. However, as Daniel Abadi and others have pointed out, this only addresses partition scenarios. Most systems spend 99.99% of their time in normal operation, where the real tradeoff is between latency and consistency. This is where PACELC comes in:

  • If Partition happens: choose Availability or Consistency
  • Else (normal operation): choose Latency or Consistency

PACELC provides a more complete framework for understanding real-world distributed databases:

PA/EL Systems (DynamoDB, Cassandra, Riak):

  • Partition ? Choose Availability (serve stale data)
  • Normal ? Choose Latency (1-2ms reads from any replica)
  • Use when: Shopping carts, session stores, high write throughput needed

PC/EC Systems (Google Spanner, VoltDB, HBase):

  • Partition ? Choose Consistency (reject operations)
  • Normal ? Choose Consistency (5-100ms for quorum coordination)
  • Use when: Financial transactions, inventory, anything that can’t be wrong

PA/EC Systems (MongoDB):

  • Partition ? Choose Availability (with caveats – unreplicated writes go to rollback)
  • Normal ? Choose Consistency (strong reads/writes in baseline)
  • Use when: Mixed workloads with mostly consistent needs

PC/EL Systems (PNUTS):

  • Partition ? Choose Consistency
  • Normal ? Choose Latency (async replication)
  • Use when: Read-heavy with timeline consistency acceptable

Quorum Consensus: Strong Consistency with Coordination

When R + W > N (read quorum + write quorum > total replicas), the read and write sets must overlap in at least one node. This overlap ensures that any read sees at least one node with the latest write, providing linearizability.

Example with N=5, R=3, W=3:

  • Write to replicas {1, 2, 3}
  • Read from replicas {2, 3, 4}
  • Overlap at {2, 3} guarantees we see the latest value

Critical Nuances:

R + W > N alone is NOT sufficient for linearizability in practice. You need additional mechanisms: readers must perform read repair synchronously before returning results, and writers must read the latest state from a quorum before writing. “Last write wins” based on wall-clock time breaks linearizability due to clock skew. Sloppy quorums like those used in Dynamo are NOT linearizable because the nodes in the quorum can change during failures. Even R = W = N doesn’t guarantee consistency if cluster membership changes. Google Spanner uses atomic clocks and GPS to achieve strong consistency globally, with TrueTime API providing less than 1ms clock uncertainty at the 99th percentile as of 2023.

How the Simulator Works

The CAP/PACELC simulator lets you explore these tradeoffs by configuring different consistency models and observing their behavior during normal operation and network partitions.

System Model:

  • N replica nodes, each with local storage
  • Configurable schema for data (to test compatibility)
  • Network latency between nodes (WAN vs LAN)
  • Optional partition mode (splits cluster)

Consistency Levels:

  1. Strong (R+W>N): Quorum reads and writes, linearizable
  2. Linearizable (R=W=N): All nodes must respond, highest consistency
  3. Weak (R=1, W=1): Single node, eventual consistency
  4. Eventual: Async replication, high availability
def get_quorum_size(self, operation_type):
    if self.consistency_level == ConsistencyLevel.STRONG:
        return (self.n_nodes // 2) + 1  # Majority
    elif self.consistency_level == ConsistencyLevel.LINEARIZABLE:
        return self.n_nodes  # All nodes
    elif self.consistency_level == ConsistencyLevel.WEAK:
        return 1  # Any node

Key Parameters:

ParameterWhat It TestsImpact
n_nodesReplica countMore nodes = more fault tolerance but higher coordination cost
consistency_levelStrong/Eventual/etcDirectly controls latency vs consistency tradeoff
base_latencyNode processing timeBaseline performance
network_latencyInter-node delayWAN (50-150ms) vs LAN (1-10ms) dramatically affects quorum cost
partition_activeNetwork partitionTests CAP behavior (A vs C during partition)
write_ratioRead/write mixWrite-heavy shows coordination bottleneck

What You Can Learn:

  1. Latency cost of consistency:
    • Run with Strong (R=3,W=3) at network_latency=5ms ? ~15ms operations
    • Same at network_latency=100ms ? ~300ms operations
    • Switch to Weak (R=1,W=1) ? single-digit milliseconds regardless
  2. CAP during partitions:
    • Enable partition with Strong consistency ? operations fail (choosing C over A)
    • Enable partition with Eventual ? stale reads but available (choosing A over C)
  3. Quorum size tradeoffs:
    • Linearizable (R=W=N) ? single node failure breaks everything
    • Strong (R=W=3 of N=5) ? can tolerate 2 node failures
    • Measure failure rate vs consistency guarantees
  4. Geographic distribution:
    • Network latency 10ms (same datacenter) ? quorum cost moderate
    • Network latency 150ms (cross-continent) ? quorum cost severe
    • Observe when you should use eventual consistency for geo-distribution

The simulator tracks write/read latencies, inconsistent reads, failed operations, and success rates, giving you quantitative data on the tradeoffs.

Key Insights from Simulation

The simulator reveals that most architectural decisions are driven by normal operation latency, not partition handling. If you’re building a global system with 150ms cross-region latency, strong consistency means every operation takes 150ms+ for quorum coordination. That’s often unacceptable for user-facing features. This is why hybrid approaches are becoming standard: use strong consistency for critical invariants (financial transactions, inventory), eventual consistency for everything else (user profiles, preferences).


Part 3: CRDTs

CRDTs (Conflict-Free Replicated Data Types) provide strong eventual consistency (SEC) through mathematical guarantees, not probabilistic convergence. They work without coordination, consensus, or concurrency control. CRDTs rely on operations being commutative (order doesn’t matter), merge functions being associative and idempotent (forming a semilattice), and updates being monotonic according to a partial order.

Example: G-Counter (Grow-Only Counter)

class GCounter:
    def __init__(self, replica_id):
        self.counts = {}  # replica_id -> count
    
    def increment(self, amount=1):
        # Each replica tracks its own increments
        self.counts[self.replica_id] = self.counts.get(self.replica_id, 0) + amount
    
    def value(self):
        # Total is sum of all replicas
        return sum(self.counts.values())
    
    def merge(self, other):
        # Take max of each replica's count
        for replica_id, count in other.counts.items():
            self.counts[replica_id] = max(self.counts.get(replica_id, 0), count)

Why this works:

  • Each replica only increments its own counter (no conflicts)
  • Merge takes max (idempotent: max(a,a) = a)
  • Order doesn’t matter: max(max(a,b),c) = max(a,max(b,c))
  • Eventually all replicas see all increments ? convergence

CRDT Types

There are two main approaches: State-based CRDTs (CvRDTs) send full local state and require merge functions to be commutative, associative, and idempotent. Operation-based CRDTs (CmRDTs) transmit only update operations and require reliable delivery in causal order. Delta-state CRDTs combine the advantages by transmitting compact deltas.

Four CRDTs in the Simulator:

  1. G-Counter: Increment only, perfect for metrics
  2. PN-Counter: Increment and decrement (two G-Counters)
  3. OR-Set: Add/remove elements, concurrent add wins
  4. LWW-Map: Last-write-wins with timestamps

Production systems using CRDTs include Redis Enterprise (CRDBs), Riak, Azure Cosmos DB for distributed data types, and Automerge/Yjs for collaborative editing like Google Docs. SoundCloud uses CRDTs in their audio distribution platform.

Important Limitations

CRDTs only provide eventual consistency, NOT strong consistency or linearizability. Different replicas can see concurrent operations in different orders temporarily. Not all operations are naturally commutative, and CRDTs cannot solve problems requiring atomic coordination like preventing double-booking without additional mechanisms.

The “Shopping Cart Problem”: You can use an OR-Set for shopping cart items, but if two clients concurrently remove the same item, your naive implementation might remove both. The CRDT guarantees convergence to a consistent state, but that state might not match user expectations.

Byzantine fault tolerance is also a concern as traditional CRDTs assume all devices are trustworthy. Malicious devices can create permanent inconsistencies.

How the Simulator Works

The CRDT simulator demonstrates convergence through gossip-based replication. You can watch replicas diverge and converge as they exchange state.

Simulation Model:

  • Multiple replica nodes, each with independent CRDT state
  • Operations applied to random replicas (simulating distributed clients)
  • Periodic “merges” (gossip protocol) with probability merge_probability
  • Network delay between merges
  • Tracks convergence: do all replicas have identical state?

CRDT Implementations: Each CRDT type has its own semantics:

# G-Counter: Each replica has its own count, merge takes max
def merge(self, other):
    for replica_id, count in other.counts.items():
        self.counts[replica_id] = max(self.counts.get(replica_id, 0), count)

# OR-Set: Elements have unique tags, add always beats remove
def add(self, element, unique_tag):
    self.elements[element].add(unique_tag)

def remove(self, element, observed_tags):
    self.elements[element] -= observed_tags  # Only remove what was observed

# LWW-Map: Latest timestamp wins
def set(self, key, value, timestamp):
    current = self.entries.get(key)
    if current is None or timestamp > current[1]:
        self.entries[key] = (value, timestamp, self.replica_id)

Key Parameters:

ParameterWhat It TestsValues
crdt_typeDifferent convergence semanticsG-Counter, PN-Counter, OR-Set, LWW-Map
n_replicasNumber of nodes2-8
n_operationsTotal updates10-100
merge_probabilityGossip frequency0.0-1.0
network_delayTime for state exchange0.0-2.0s

What You Can Learn:

  1. Convergence speed:
    • Set merge_probability=0.1 ? slow convergence, replicas stay diverged
    • Set merge_probability=0.8 ? fast convergence
    • Understand gossip frequency vs consistency window tradeoff
  2. OR-Set semantics:
    • Watch concurrent add/remove ? add wins
    • See how unique tags prevent unintended deletions
    • Compare with naive set implementation
  3. LWW-Map data loss:
    • Two replicas set same key concurrently with different values
    • One value “wins” based on timestamp (or replica ID tie-break)
    • Data loss is possible – not suitable for all use cases
  4. Network partition tolerance:
    • Low merge probability simulates partition
    • Replicas diverge but operations still succeed (AP in CAP)
    • After “partition heals” (merges resume), all converge
    • No coordination needed, no operations failed

The simulator visually shows replica states over time and convergence status, making abstract CRDT theory concrete.

Key Insights from Simulation

CRDTs trade immediate consistency for availability and partition tolerance. The theoretical guarantees are proven: if all replicas receive all updates (eventual delivery), they will converge to the same state (strong convergence).

But the simulator reveals the practical challenges:

  • Merge semantics don’t always match user intent (LWW can lose data)
  • Tombstones can grow indefinitely (OR-Set needs garbage collection)
  • Causal ordering adds complexity (need vector clocks for some CRDTs)
  • Not suitable for operations requiring coordination (uniqueness constraints, atomic updates)

When to use CRDTs:

  • High-write distributed counters (page views, analytics)
  • Collaborative editing (where eventual consistency is acceptable)
  • Offline-first applications (sync when online)
  • Shopping carts (with careful semantic design)

When NOT to use CRDTs:

  • Bank account balances (need atomic transactions)
  • Inventory (can’t prevent overselling without coordination)
  • Unique constraints (usernames, reservation systems)
  • Access control (need immediate consistency)

Part 4: Cross-System Interaction (CSI) Failures

Research from EuroSys 2023 found that 20% of catastrophic cloud incidents and 37% of failures in major open-source distributed systems are CSI failures – where both systems work correctly in isolation but fail when connected. This is the NASA Mars Climate Orbiter problem: one team used metric units, another used imperial. Both systems worked perfectly. The spacecraft burned up in Mars’s atmosphere because of their interaction.

Why CSI Failures Are Different

Not dependency failures: The downstream system is available, it just can’t process what upstream sends.

Not library bugs: Libraries are single-address-space and well-tested. CSI failures cross system boundaries where testing is expensive.

Not component failures: Each system passes its own test suite. The bug only emerges through interaction.

CSI failures manifest across three planes: Data plane (51% – schema/metadata mismatches), Management plane (32% – configuration incoherence), and Control plane (17% – API semantic violations).

For example, study of Apache Spark-Hive integration found 15 distinct discrepancies in simple write-read testing. Hive stored timestamps as long (milliseconds since epoch), Spark expected Timestamp type. Both worked in isolation, failed when integrated. Kafka and Flink encoding mismatch: Kafka set compression.type=lz4, Flink couldn’t decompress due to old LZ4 library. Configuration was silently ignored in Flink, leading to data corruption for 2 weeks before detection.

Why Testing Doesn’t Catch CSI Failures

Analysis of Spark found only 6% of integration tests actually test cross-system interaction. Most “integration tests” test multiple components of the same system. Cross-system testing is expensive and often skipped. The problem compounds with modern architectures:

  • Microservices: More system boundaries to test
  • Multi-cloud: Different clouds with different semantics
  • Serverless: Fine-grained composition increases interaction surface area

How the Simulator Works

The CSI failure simulator models two systems exchanging data, with configurable discrepancies in schemas, encodings, and configurations.

System Model:

  • Two systems (upstream ? downstream)
  • Each has its own schema definition (field types, encoding, nullable fields)
  • Each has its own configuration (timeouts, retry counts, etc.)
  • Data flows from System A to System B with potential conversion failures

Failure Scenarios:

  1. Metadata Mismatch (Hive/Spark):
    • System A: timestamp: long
    • System B: timestamp: Timestamp
    • Failure: Type coercion fails ~30% of the time
  2. Schema Conflict (Producer/Consumer):
    • System A: encoding: latin-1
    • System B: encoding: utf-8
    • Failure: Silent data corruption
  3. Configuration Incoherence (ServiceA/ServiceB):
    • System A: max_retries=3, timeout=30s
    • System B expects: max_retries=5, timeout=60s
    • Failure: ~40% of requests fail due to premature timeout
  4. API Semantic Violation (Upstream/Downstream):
    • Upstream assumes: synchronous, thread-safe
    • Downstream is: asynchronous, not thread-safe
    • Failure: Race conditions, out-of-order processing
  5. Type Confusion (SystemA/SystemB):
    • System A: amount: float
    • System B: amount: decimal
    • Failure: Precision loss in financial calculations

Implementation Details:

class DataSchema:
    def __init__(self, schema_id, fields, encoding, nullable_fields):
        self.fields = fields  # field_name -> type
        self.encoding = encoding
        
    def is_compatible(self, other):
        # Check field types and encoding
        return (self.fields == other.fields and 
                self.encoding == other.encoding)

class DataRecord:
    def serialize(self, target_schema):
        # Attempt type coercion
        for field, value in self.data.items():
            expected_type = target_schema.fields[field]
            actual_type = self.schema.fields[field]
            
            if expected_type != actual_type:
                # 30% failure on type mismatch (simulating real world)
                if random.random() < 0.3:
                    return None  # Serialization failure
        
        # Check encoding compatibility
        if self.schema.encoding != target_schema.encoding:
            if random.random() < 0.2:  # 20% silent corruption
                return None

Key Parameters:

ParameterWhat It Tests
failure_scenarioType of CSI failure (metadata, schema, config, API, type)
durationSimulation length
request_rateLoad (requests per second)

The simulator doesn’t have many tunable parameters because CSI failures are about specific incompatibilities, not gradual degradation. Each scenario models a real-world pattern.

What You Can Learn:

  1. Failure rates: CSI failures often manifest in 20-40% of requests (not 100%)
    • Some requests happen to have compatible data
    • Makes debugging harder (intermittent failures)
  2. Failure location:
    • Research shows 69% of CSI fixes go in the upstream system, often in connector modules that are less than 5% of the codebase
    • Simulator shows which system fails (usually downstream)
  3. Silent vs loud failures:
    • Type mismatches often crash (loud, easy to detect)
    • Encoding mismatches corrupt silently (hard to detect)
    • Config mismatches cause intermittent timeouts
  4. Prevention effectiveness:
    • Schema registry eliminates metadata mismatches
    • Configuration validation catches config incoherence
    • Contract testing prevents API semantic violations

Key Insights from Simulation

The simulator demonstrates that cross-system integration testing is essential but often skipped. Unit tests of each system won’t catch these failures.

Prevention strategies validated by simulation:

  1. Write-Read Testing: Write with System A, read with System B, verify integrity
  2. Schema Registry: Single source of truth for data schemas, enforced across systems
  3. Configuration Coherence Checking: Validate that shared configs match
  4. Contract Testing: Explicit, machine-checkable API contracts

Hybrid Consistency Models

Modern systems increasingly use mixed consistency: RedBlue Consistency (2012) marks operations as needing strong consistency (red) or eventual consistency (blue). Replicache (2024) has the server assign final total order while clients do optimistic local updates with rebase. For example: Calendar Application

# Strong consistency for room reservations (prevent double-booking)
def book_conference_room(room_id, time_slot):
    with transaction(consistency='STRONG'):
        if room.is_available(time_slot):
            room.book(time_slot)
            return True
        return False

# CRDTs for collaborative editing (participant lists, notes)
def update_meeting_notes(meeting_id, notes):
    # LWW-Map CRDT, eventual consistency
    meeting.notes.merge(notes)

# Eventual consistency for preferences
def update_user_calendar_color(user_id, color):
    # Who cares if this propagates slowly?
    user_prefs[user_id] = color

Recent theoretical work on the CALM theorem proves that coordination-free consistency is achievable for certain problem classes. Research in 2025 provided mathematical definitions of when coordination is and isn’t required, separating coordination from computation.

What the Simulators Teach Us

Running all four simulators reveals the consistency spectrum:

No “best” consistency model exists:

  • Quorums are best when you need linearizability and can tolerate latency
  • CRDTs are best when you need high availability and can tolerate eventual consistency
  • Neither approach “bypasses” CAP – they make different tradeoffs
  • Real systems use hybrid models with different consistency for different operations

Practical Lessons

1. Design for Recovery, Not Just Prevention

The metastable failure simulator shows you can’t prevent all failures. Your retry logic, backoff strategy, and circuit breakers are more important than your happy path code. Validated strategies include:

  • Exponential backoff with jitter (spread retries over time)
  • Adaptive retry budgets (limit total fleet-wide retries)
  • Circuit breakers (detect patterns, stop storms)
  • Load shedding (fail fast rather than queue to death)

2. Understand the Consistency Spectrum

The CAP/PACELC simulator demonstrates that consistency is not binary. You need to understand:

  • What consistency level do you actually need? (Most operations don’t need linearizability)
  • What’s the latency cost? (Quorum reads in cross-region deployment can be 100x slower)
  • What happens during partitions? (Can you sacrifice availability or must you serve stale data?)

Decision framework:

  • Use strong consistency for: money, inventory, locks, compliance
  • Use eventual consistency for: feeds, catalogs, analytics, caches
  • Use hybrid models for: most real-world applications

3. Test Cross-System Interactions

The CSI failure simulator reveals that 86% of fixes go into connector modules that are less than 5% of your codebase. This is where bugs hide. Essential tests include:

  • Write-read tests (write with System A, read with System B)
  • Round-trip tests (serialize/deserialize across boundaries)
  • Version compatibility matrix (test combinations)
  • Schema validation (machine-checkable contracts)

4. Leverage CRDTs Where Appropriate

The CRDT simulator shows that conflict-free convergence is possible for specific problem types. But you need to:

  • Understand the semantic limitations (LWW can lose data)
  • Design merge behavior carefully (does it match user intent?)
  • Handle garbage collection (tombstones, vector clocks)
  • Accept eventual consistency (not suitable for all use cases)

5. Monitor for Sustaining Effects

Metastability, retry storms, and goodput collapse are self-sustaining failure modes. They persist after the trigger is gone. Critical metrics include:

  • P99 latency vs timeout threshold (approaching timeout = danger)
  • Retry rate vs success rate (high retries = storm risk)
  • Queue depth (unbounded growth = admission control needed)
  • Goodput vs throughput (doing useful work vs spinning)

Using the Simulators

All four simulators are available at: https://github.com/bhatti/simulators

Installation

git clone https://github.com/bhatti/simulators
cd simulators
pip install -r requirements.txt

Requirements:

  • Python 3.7+
  • streamlit (web UI)
  • simpy (discrete event simulation)
  • plotly (interactive visualizations)
  • numpy, pandas (data analysis)

Running Individual Simulators

# Metastable failure simulator
streamlit run metastable_simulator.py

# CAP/PACELC consistency simulator
streamlit run cap_consistency_simulator.py

# CRDT simulator
streamlit run crdt_simulator.py

# CSI failure simulator
streamlit run csi_failure_simulator.py

Running All Simulators

python run_all_simulators.py

Conclusion

Building distributed systems means confronting failure modes that are expensive or impossible to reproduce in real environments:

  • Metastable failures require specific load patterns and timing
  • Consistency tradeoffs need multi-region deployments to observe
  • CRDT convergence requires orchestrating concurrent operations across replicas
  • CSI failures need exact schema/config mismatches that don’t exist in test environments

Simulators bridge the gap between theoretical understanding and practical intuition:

  1. Cheaper than production testing: No cloud costs, no multi-region setup, instant feedback
  2. Safer than production experiments: Crash the simulator, not your service
  3. More complete than unit tests: See emergent behaviors, not just component correctness
  4. Faster iteration: Tweak parameters, re-run in seconds, build intuition through experimentation

What You Can’t Learn Without Simulation

  • When does retry amplification tip into metastability? (Depends on coordination slope, timeout, backoff)
  • How much does quorum coordination actually cost? (Depends on network latency, replica count, workload)
  • Do your CRDT semantics match user expectations? (Depends on merge behavior, conflict resolution)
  • Will your schema changes break integration? (Depends on type coercion, encoding, version skew)

The goal isn’t to prevent all failures, that’s impossible. The goal is to understand, anticipate, and recover from the failures that will inevitably occur.


References

Key research papers and resources used in this post:

  1. AWS Metastability Research (HotOS 2025) – Sustaining effects and goodput collapse
  2. Marc Brooker on DSQL – Practical distributed SQL considerations
  3. James Hamilton on Reliable Systems – Large-scale system design
  4. CSI Failures Study (EuroSys 2023) – Cross-system interaction failures
  5. PACELC Framework – Beyond CAP theorem
  6. Marc Brooker on CAP – CAP theorem revisited
  7. Anna CRDT Database – Autoscaling with CRDTs
  8. Linearizability Paper – Herlihy & Wing’s foundational work
  9. Designing Data-Intensive Applications by Martin Kleppmann
  10. Distributed Systems Reading Group – MIT CSAIL
  11. Jepsen.io – Kyle Kingsbury’s consistency testing
  12. Aphyr’s blog – Distributed systems deep dives

November 7, 2025

Three Decades of Remote Calls: My Journey from COBOL Mainframes to AI Agents

Filed under: Computing,Web Services — admin @ 9:50 pm

Introduction

I started writing network code in the early 1990s on IBM mainframes, armed with nothing but Assembly and COBOL. Today, I build distributed AI agents using gRPC, RAG pipelines, and serverless functions. Between these worlds lie decades of technological evolution and an uncomfortable realization: we keep relearning the same lessons. Over the years, I’ve seen simple ideas triumph over complex ones. The technology keeps changing, but the problems stay the same. Network latency hasn’t gotten faster relative to CPU speed. Distributed systems are still hard. Complexity still kills projects. And every new generation has to learn that abstractions leak. I’ll show you the technologies I’ve used, the mistakes I’ve made, and most importantly, what the past teaches us about building better systems in the future.

The Mainframe Era

CICS and 3270 Terminals

I started my career on IBM mainframes running CICS, which was used to build online applications accessed through 3270 “green screen” terminals. It used LU6.2 (Logical Unit 6.2) protocol, part of IBM’s Systems Network Architecture (SNA) to provide peer-to-peer communication. Here’s what a typical CICS application looked like in COBOL:

IDENTIFICATION DIVISION.
PROGRAM-ID. CUSTOMER-INQUIRY.

DATA DIVISION.
WORKING-STORAGE SECTION.
01  CUSTOMER-REC.
    05  CUST-ID        PIC 9(8).
    05  CUST-NAME      PIC X(30).
    05  CUST-BALANCE   PIC 9(7)V99.

LINKAGE SECTION.
01  DFHCOMMAREA.
    05  COMM-CUST-ID   PIC 9(8).

PROCEDURE DIVISION.
    EXEC CICS
        RECEIVE MAP('CUSTMAP')
        MAPSET('CUSTSET')
        INTO(CUSTOMER-REC)
    END-EXEC.
    
    EXEC CICS
        READ FILE('CUSTFILE')
        INTO(CUSTOMER-REC)
        RIDFLD(COMM-CUST-ID)
    END-EXEC.
    
    EXEC CICS
        SEND MAP('RESULTMAP')
        MAPSET('CUSTSET')
        FROM(CUSTOMER-REC)
    END-EXEC.
    
    EXEC CICS RETURN END-EXEC.

The CICS environment handled all the complexity—transaction management, terminal I/O, file access, and inter-system communication. For the user interface, I used Basic Mapping Support (BMS), which was notoriously finicky. You had to define screen layouts in a rigid format specifying exactly where each field appeared on the 24×80 character grid:

CUSTMAP  DFHMSD TYPE=&SYSPARM,                                    X
               MODE=INOUT,                                        X
               LANG=COBOL,                                        X
               CTRL=FREEKB
         DFHMDI SIZE=(24,80)
CUSTID   DFHMDF POS=(05,20),                                      X
               LENGTH=08,                                         X
               ATTRB=(UNPROT,NUM),                                X
               INITIAL='________'
CUSTNAME DFHMDF POS=(07,20),                                      X
               LENGTH=30,                                         X
               ATTRB=PROT

This was so painful that I wrote my own tool to convert simple text-based UI templates into BMS format. Looking back, this was my first foray into creating developer tools. Key lesson I learned from the mainframe era was that developer experience mattered. Cumbersome tools slow down development and introduce errors.

Moving to UNIX

Berkeley Sockets

After working on mainframes for a couple of years, I saw the mainframes were already in decline and I then transitioned to C and UNIX systems, which I studied previously in my college. I learned about Berkeley Sockets, which was a lot more powerful and you had complete control over the network. Here’s a simple TCP server in C using Berkeley Sockets:

#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define PORT 8080
#define BUFFER_SIZE 1024

int main() {
    int server_fd, client_fd;
    struct sockaddr_in server_addr, client_addr;
    socklen_t client_len = sizeof(client_addr);
    char buffer[BUFFER_SIZE];
    
    // Create socket
    server_fd = socket(AF_INET, SOCK_STREAM, 0);
    if (server_fd < 0) {
        perror("socket failed");
        exit(EXIT_FAILURE);
    }
    
    // Set socket options to reuse address
    int opt = 1;
    if (setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR, 
                   &opt, sizeof(opt)) < 0) {
        perror("setsockopt failed");
        exit(EXIT_FAILURE);
    }
    
    // Bind to address
    memset(&server_addr, 0, sizeof(server_addr));
    server_addr.sin_family = AF_INET;
    server_addr.sin_addr.s_addr = INADDR_ANY;
    server_addr.sin_port = htons(PORT);
    
    if (bind(server_fd, (struct sockaddr *)&server_addr, 
             sizeof(server_addr)) < 0) {
        perror("bind failed");
        exit(EXIT_FAILURE);
    }
    
    // Listen for connections
    if (listen(server_fd, 10) < 0) {
        perror("listen failed");
        exit(EXIT_FAILURE);
    }
    
    printf("Server listening on port %d\n", PORT);
    
    while (1) {
        // Accept connection
        client_fd = accept(server_fd, 
                          (struct sockaddr *)&client_addr, 
                          &client_len);
        if (client_fd < 0) {
            perror("accept failed");
            continue;
        }
        
        // Read request
        ssize_t bytes_read = recv(client_fd, buffer, 
                                  BUFFER_SIZE - 1, 0);
        if (bytes_read > 0) {
            buffer[bytes_read] = '\0';
            printf("Received: %s\n", buffer);
            
            // Send response
            const char *response = "Message received\n";
            send(client_fd, response, strlen(response), 0);
        }
        
        close(client_fd);
    }
    
    close(server_fd);
    return 0;
}

As you can see, you had to track a lot of housekeeping like socket creation, binding, listening, accepting, reading, writing, and meticulous error handling at every step. Memory management was entirely manual—forget to close() a file descriptor and you’d leak resources. If you make a mistake with recv() buffer sizes and you’d overflow memory. I also experimented with Fast Sockets from UC Berkeley, which used kernel bypass techniques for lower latency and offered better performance.

Key lesson I learned was that low-level control comes at a steep cost. The cognitive load of managing these details makes it nearly impossible to focus on business logic.

Sun RPC and XDR

When working for a physics lab with a large computing facilities consists of Sun workstations, Solaris, and SPARC processors, I discovered Sun RPC (Remote Procedure Call) with XDR (External Data Representation). XDR solved a critical problem: how do you exchange data between machines with different architectures? A SPARC processor uses big-endian byte ordering, while x86 uses little-endian. XDR provided a canonical, architecture-neutral format for representing data. Here’s an XDR definition file (types.x):

/* Define a structure for customer data */
struct customer {
    int customer_id;
    string name<30>;
    float balance;
};

/* Define the RPC program */
program CUSTOMER_PROG {
    version CUSTOMER_VERS {
        int ADD_CUSTOMER(customer) = 1;
        customer GET_CUSTOMER(int) = 2;
    } = 1;
} = 0x20000001;

You’d run rpcgen on this file:

$ rpcgen types.x

This generated the client stub, server stub, and XDR serialization code automatically. Here’s what the server implementation looked like:

#include "types.h"

int *add_customer_1_svc(customer *cust, struct svc_req *rqstp) {
    static int result;
    
    // Add customer to database
    printf("Adding customer: %s (ID: %d)\n", 
           cust->name, cust->customer_id);
    
    result = 1;  // Success
    return &result;
}

customer *get_customer_1_svc(int *cust_id, struct svc_req *rqstp) {
    static customer result;
    
    // Fetch from database
    result.customer_id = *cust_id;
    result.name = strdup("John Doe");
    result.balance = 1000.50;
    
    return &result;
}

And the client:

#include "types.h"

int main(int argc, char *argv[]) {
    CLIENT *clnt;
    customer cust;
    int *result;
    
    clnt = clnt_create("localhost", CUSTOMER_PROG, 
                       CUSTOMER_VERS, "tcp");
    if (clnt == NULL) {
        clnt_pcreateerror("localhost");
        exit(1);
    }
    
    // Call remote procedure
    cust.customer_id = 123;
    cust.name = "Alice Smith";
    cust.balance = 5000.00;
    
    result = add_customer_1(&cust, clnt);
    if (result == NULL) {
        clnt_perror(clnt, "call failed");
    }
    
    clnt_destroy(clnt);
    return 0;
}

This was my first introduction to Interface Definition Languages (IDL) and I found that defining the contract once and generating code automatically reduces errors. This pattern would reappear in CORBA, Protocol Buffers, and gRPC.

Parallel Computing

During my graduate and post-graduate studies in mid 1990s while working full time, I researched into the parallel and distributed computing. I worked with MPI (Message Passing Interface) and IBM’s MPL on SP1/SP2 systems. MPI provided collective operations like broadcast, scatter, gather, and reduce (predecessor to Hadoop like map/reduce). Here’s a simple MPI example that computes the sum of an array in parallel:

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

#define ARRAY_SIZE 1000

int main(int argc, char** argv) {
    int rank, size;
    int data[ARRAY_SIZE];
    int local_sum = 0, global_sum = 0;
    int chunk_size, start, end;
    
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    
    // Initialize data on root
    if (rank == 0) {
        for (int i = 0; i < ARRAY_SIZE; i++) {
            data[i] = i + 1;
        }
    }
    
    // Broadcast data to all processes
    MPI_Bcast(data, ARRAY_SIZE, MPI_INT, 0, MPI_COMM_WORLD);
    
    // Each process computes sum of its chunk
    chunk_size = ARRAY_SIZE / size;
    start = rank * chunk_size;
    end = (rank == size - 1) ? ARRAY_SIZE : start + chunk_size;
    
    for (int i = start; i < end; i++) {
        local_sum += data[i];
    }
    
    // Reduce all local sums to global sum
    MPI_Reduce(&local_sum, &global_sum, 1, MPI_INT, 
               MPI_SUM, 0, MPI_COMM_WORLD);
    
    if (rank == 0) {
        printf("Global sum: %d\n", global_sum);
    }
    
    MPI_Finalize();
    return 0;
}

For my post-graduate project, I built JavaNOW (Java on Networks of Workstations), which was inspired by Linda’s tuple spaces and MPI’s collective operations, but implemented in pure Java for portability. The key innovation was our Actor-inspired model. Instead of heavyweight processes communicating through message passing, I used lightweight Java threads with an Entity Space (distributed associative memory) where “actors” could put and get entities asynchronously. Here’s a simple example:

public class SumTask extends ActiveEntity {
    public Object execute(Object arg, JavaNOWAPI api) {
        Integer myId = (Integer) arg;
        EntitySpace workspace = new EntitySpace("RESULTS");
        
        // Compute partial sum
        int partialSum = 0;
        for (int i = myId * 100; i < (myId + 1) * 100; i++) {
            partialSum += i;
        }
        
        // Store result in EntitySpace
        return new Integer(partialSum);
    }
}

// Main application
public class ParallelSum extends JavaNOWApplication {
    public void master() {
        EntitySpace workspace = new EntitySpace("RESULTS");
        
        // Spawn parallel tasks
        for (int i = 0; i < 10; i++) {
            ActiveEntity task = new SumTask(new Integer(i));
            getJavaNOWAPI().eval(workspace, task, new Integer(i));
        }
        
        // Collect results
        int totalSum = 0;
        for (int i = 0; i < 10; i++) {
            Entity result = getJavaNOWAPI().get(
                workspace, new Entity(new Integer(i)));
            totalSum += ((Integer)result.getEntityValue()).intValue();
        }
        
        System.out.println("Total sum: " + totalSum);
    }
    
    public void slave(int id) {
        // Slave nodes wait for work
    }
}

Since then, I have seen the Actor model have gained a wide adoption. For example, today’s serverless functions (AWS Lambda, Azure Functions, Google Cloud Functions) and modern frameworks like Akka, Orleans, and Dapr all embrace Actor-inspired patterns.

Novell and CGI

I also briefly worked with Novell’s IPX (Internetwork Packet Exchange) protocol, which had painful APIs. Here’s a taste of IPX socket programming (simplified):

#include <nwcalls.h>
#include <nwipxspx.h>

int main() {
    IPXAddress server_addr;
    IPXPacket packet;
    WORD socket_number = 0x4000;
    
    // Open IPX socket
    IPXOpenSocket(socket_number, 0);
    
    // Setup address
    memset(&server_addr, 0, sizeof(IPXAddress));
    memcpy(server_addr.network, target_network, 4);
    memcpy(server_addr.node, target_node, 6);
    server_addr.socket = htons(socket_number);
    
    // Send packet
    packet.packetType = 4;  // IPX packet type
    memcpy(packet.data, "Hello", 5);
    IPXSendPacket(socket_number, &server_addr, &packet);
    
    IPXCloseSocket(socket_number);
    return 0;
}

Early Web Development with CGI

When the web emerged in early 1990s, I built applications using CGI (Common Gateway Interface) with Perl and C. I deployed these on Apache HTTP Server, which was the first production-quality open source web server and quickly became the dominant web server of the 1990s. Apache used process-driven concurrency where it forked a new process for each request or maintained a pool of pre-forked processes. CGI was conceptually simple: the web server launched a new UNIX process for every request, passing input via stdin and receiving output via stdout. Here’s a simple Perl CGI script:

#!/usr/bin/perl
use strict;
use warnings;
use CGI;

my $cgi = CGI->new;

print $cgi->header('text/html');
print "<html><body>\n";
print "<h1>Hello from CGI!</h1>\n";

my $name = $cgi->param('name') || 'Guest';
print "<p>Welcome, $name!</p>\n";

# Simulate database query
my $user_count = 42;
print "<p>Total users: $user_count</p>\n";

print "</body></html>\n";

And in C:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
    char *query_string = getenv("QUERY_STRING");
    
    printf("Content-Type: text/html\n\n");
    printf("<html><body>\n");
    printf("<h1>CGI in C</h1>\n");
    
    if (query_string) {
        printf("<p>Query string: %s</p>\n", query_string);
    }
    
    printf("</body></html>\n");
    return 0;
}

Later, I migrated to more performant servers: Tomcat for Java servlets, Jetty as an embedded server, and Netty for building custom high-performance network applications. These servers used asynchronous I/O and lightweight threads (or even non-blocking event loops in Netty‘s case).

Key Lesson I learned was that scalability matters. The CGI model’s inability to maintain persistent connections or share state made it unsuitable for modern web applications. The shift from process-per-request to thread pools and then to async I/O represented fundamental improvements in how we handle concurrency.

Java Adoption

When Java was released in 1995, I adopted it wholeheartedly. It saved developers from manual memory management using malloc() and free() debugging. Network programming became far more approachable:

import java.io.*;
import java.net.*;

public class SimpleServer {
    public static void main(String[] args) throws IOException {
        int port = 8080;
        
        try (ServerSocket serverSocket = new ServerSocket(port)) {
            System.out.println("Server listening on port " + port);
            
            while (true) {
                try (Socket clientSocket = serverSocket.accept();
                     BufferedReader in = new BufferedReader(
                         new InputStreamReader(clientSocket.getInputStream()));
                     PrintWriter out = new PrintWriter(
                         clientSocket.getOutputStream(), true)) {
                    
                    String request = in.readLine();
                    System.out.println("Received: " + request);
                    
                    out.println("Message received");
                }
            }
        }
    }
}

Java Threads

I had previously used pthreads in C, which were hard to use but Java’s threading model was far simpler:

public class ConcurrentServer {
    public static void main(String[] args) throws IOException {
        ServerSocket serverSocket = new ServerSocket(8080);
        
        while (true) {
            Socket clientSocket = serverSocket.accept();
            
            // Spawn thread to handle client
            new Thread(new ClientHandler(clientSocket)).start();
        }
    }
    
    static class ClientHandler implements Runnable {
        private Socket socket;
        
        public ClientHandler(Socket socket) {
            this.socket = socket;
        }
        
        public void run() {
            try (BufferedReader in = new BufferedReader(
                     new InputStreamReader(socket.getInputStream()));
                 PrintWriter out = new PrintWriter(
                     socket.getOutputStream(), true)) {
                
                String request = in.readLine();
                // Process request
                out.println("Response");
                
            } catch (IOException e) {
                e.printStackTrace();
            } finally {
                try { socket.close(); } catch (IOException e) {}
            }
        }
    }
}

Java’s synchronized keyword simplified thread-safe programming:

public class ThreadSafeCounter {
    private int count = 0;
    
    public synchronized void increment() {
        count++;
    }
    
    public synchronized int getCount() {
        return count;
    }
}

This was so much easier than managing mutexes, condition variables, and semaphores in C!

Java RMI: Remote Objects Made

When Java added RMI (1997), distributed objects became practical. You could invoke methods on objects running on remote machines almost as if they were local. Define a remote interface:

import java.rmi.Remote;
import java.rmi.RemoteException;

public interface Calculator extends Remote {
    int add(int a, int b) throws RemoteException;
    int multiply(int a, int b) throws RemoteException;
}

Implement it:

import java.rmi.server.UnicastRemoteObject;
import java.rmi.RemoteException;

public class CalculatorImpl extends UnicastRemoteObject 
                            implements Calculator {
    
    public CalculatorImpl() throws RemoteException {
        super();
    }
    
    public int add(int a, int b) throws RemoteException {
        return a + b;
    }
    
    public int multiply(int a, int b) throws RemoteException {
        return a * b;
    }
}

Server:

import java.rmi.Naming;
import java.rmi.registry.LocateRegistry;

public class Server {
    public static void main(String[] args) {
        try {
            LocateRegistry.createRegistry(1099);
            Calculator calc = new CalculatorImpl();
            Naming.rebind("Calculator", calc);
            System.out.println("Server ready");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Client:

import java.rmi.Naming;

public class Client {
    public static void main(String[] args) {
        try {
            Calculator calc = (Calculator) Naming.lookup(
                "rmi://localhost/Calculator");
            
            int result = calc.add(5, 3);
            System.out.println("5 + 3 = " + result);
            
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

I found that RMI was constrained and everything had to extend Remote, and you were stuck with Java-to-Java communication. Key lesson I learned was that abstractions that feel natural to developers get adopted.

JINI: RMI with Service Discovery

At a travel booking company in the mid 2000s, I used JINI, which Sun Microsystems pitched as “RMI on steroids.” JINI extended RMI with automatic service discovery, leasing, and distributed events. The core idea: services could join a network, advertise themselves, and be discovered by clients without hardcoded locations. Here’s a JINI service interface and registration:

import net.jini.core.lookup.ServiceRegistrar;
import net.jini.discovery.LookupDiscovery;
import net.jini.lease.LeaseRenewalManager;
import java.rmi.Remote;
import java.rmi.RemoteException;

// Service interface
public interface BookingService extends Remote {
    String searchFlights(String origin, String destination) 
        throws RemoteException;
    boolean bookFlight(String flightId, String passenger) 
        throws RemoteException;
}

// Service provider
public class BookingServiceProvider implements DiscoveryListener {
    
    public void discovered(DiscoveryEvent event) {
        ServiceRegistrar[] registrars = event.getRegistrars();
        
        for (ServiceRegistrar registrar : registrars) {
            try {
                BookingService service = new BookingServiceImpl();
                Entry[] attributes = new Entry[] {
                    new Name("FlightBookingService")
                };
                
                ServiceItem item = new ServiceItem(null, service, attributes);
                ServiceRegistration reg = registrar.register(
                    item, Lease.FOREVER);
                
                // Auto-renew lease
                leaseManager.renewUntil(reg.getLease(), Lease.FOREVER, null);
                
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
}

Client discovery and usage:

public class BookingClient implements DiscoveryListener {
    
    public void discovered(DiscoveryEvent event) {
        ServiceRegistrar[] registrars = event.getRegistrars();
        
        for (ServiceRegistrar registrar : registrars) {
            try {
                ServiceTemplate template = new ServiceTemplate(
                    null, new Class[] { BookingService.class }, null);
                
                ServiceItem item = registrar.lookup(template);
                
                if (item != null) {
                    BookingService booking = (BookingService) item.service;
                    String flights = booking.searchFlights("SFO", "NYC");
                    booking.bookFlight("FL123", "John Smith");
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
}

Though, JINI provided automatic discovery, leasing and location transparency but it was too complex and only supported Java ecosystem. The ideas were sound and reappeared later in service meshes (Consul, Eureka) and Kubernetes service discovery. I learned that service discovery is essential for dynamic systems, but the implementation must be simple.

CORBA

I used CORBA (Common Object Request Broker Architecture) for many years in 1990s when building intelligent traffic Systems. CORBA promised the language-independent, platform-independent distributed objects. You could write a service in C++, invoke it from Java, and have clients in Python using the same IDL. Here’s a simple CORBA IDL definition:

module TrafficMonitor {
    struct SensorData {
        long sensor_id;
        float speed;
        long timestamp;
    };
    
    typedef sequence<SensorData> SensorDataList;
    
    interface TrafficService {
        void reportData(in SensorData data);
        SensorDataList getRecentData(in long minutes);
        float getAverageSpeed();
    };
};

Run the IDL compiler:

$ idl traffic.idl

This generated client stubs and server skeletons for your target language. I built a message-oriented middleware (MOM) system with CORBA that collected traffic data from road sensors and provided real-time traffic information.

C++ server implementation:

#include "TrafficService_impl.h"
#include <iostream>
#include <vector>

class TrafficServiceImpl : public POA_TrafficMonitor::TrafficService {
private:
    std::vector<TrafficMonitor::SensorData> data_store;
    
public:
    void reportData(const TrafficMonitor::SensorData& data) {
        data_store.push_back(data);
        std::cout << "Received data from sensor " 
                  << data.sensor_id << std::endl;
    }
    
    TrafficMonitor::SensorDataList* getRecentData(CORBA::Long minutes) {
        TrafficMonitor::SensorDataList* result = 
            new TrafficMonitor::SensorDataList();
        
        // Filter data from last N minutes
        time_t cutoff = time(NULL) - (minutes * 60);
        for (const auto& entry : data_store) {
            if (entry.timestamp >= cutoff) {
                result->length(result->length() + 1);
                (*result)[result->length() - 1] = entry;
            }
        }
        return result;
    }
    
    CORBA::Float getAverageSpeed() {
        if (data_store.empty()) return 0.0;
        
        float sum = 0.0;
        for (const auto& entry : data_store) {
            sum += entry.speed;
        }
        return sum / data_store.size();
    }
};

Java client:

import org.omg.CORBA.*;
import TrafficMonitor.*;

public class TrafficClient {
    public static void main(String[] args) {
        try {
            // Initialize ORB
            ORB orb = ORB.init(args, null);
            
            // Get reference to service
            org.omg.CORBA.Object obj = 
                orb.string_to_object("corbaname::localhost:1050#TrafficService");
            TrafficService service = TrafficServiceHelper.narrow(obj);
            
            // Report sensor data
            SensorData data = new SensorData();
            data.sensor_id = 101;
            data.speed = 65.5f;
            data.timestamp = (int)(System.currentTimeMillis() / 1000);
            
            service.reportData(data);
            
            // Get average speed
            float avgSpeed = service.getAverageSpeed();
            System.out.println("Average speed: " + avgSpeed + " mph");
            
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

However, CORBA specification was massive and different ORB (Object Request Broker) implementations like Orbix, ORBacus, and TAO couldn’t reliably interoperate despite claiming CORBA compliance. The binary protocol, IIOP, had subtle incompatibilities. CORBA did introduce valuable concepts:

  • Interceptors for cross-cutting concerns (authentication, logging, monitoring)
  • IDL-first design that forced clear interface definitions
  • Language-neutral protocols that actually worked (sometimes)

I learned that standards designed by committee are often over-engineer. CORBA, SOAP tried to solve every problem for everyone and ended up being optimal for no one.

SOAP and WSDL

I used SOAP (Simple Object Access Protocol) and WSDL (Web Services Description Language) on a number of projects in early 2000s that emerged as the standard for web services. The pitch: XML-based, platform-neutral, and “simple.” Here’s a WSDL definition:

<?xml version="1.0"?>
<definitions name="CustomerService"
   targetNamespace="http://example.com/customer"
   xmlns="http://schemas.xmlsoap.org/wsdl/"
   xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/"
   xmlns:tns="http://example.com/customer"
   xmlns:xsd="http://www.w3.org/2001/XMLSchema">
   
   <types>
      <xsd:schema targetNamespace="http://example.com/customer">
         <xsd:complexType name="Customer">
            <xsd:sequence>
               <xsd:element name="id" type="xsd:int"/>
               <xsd:element name="name" type="xsd:string"/>
               <xsd:element name="balance" type="xsd:double"/>
            </xsd:sequence>
         </xsd:complexType>
      </xsd:schema>
   </types>
   
   <message name="GetCustomerRequest">
      <part name="customerId" type="xsd:int"/>
   </message>
   
   <message name="GetCustomerResponse">
      <part name="customer" type="tns:Customer"/>
   </message>
   
   <portType name="CustomerPortType">
      <operation name="getCustomer">
         <input message="tns:GetCustomerRequest"/>
         <output message="tns:GetCustomerResponse"/>
      </operation>
   </portType>
   
   <binding name="CustomerBinding" type="tns:CustomerPortType">
      <soap:binding transport="http://schemas.xmlsoap.org/soap/http"/>
      <operation name="getCustomer">
         <soap:operation soapAction="getCustomer"/>
         <input>
            <soap:body use="literal"/>
         </input>
         <output>
            <soap:body use="literal"/>
         </output>
      </operation>
   </binding>
   
   <service name="CustomerService">
      <port name="CustomerPort" binding="tns:CustomerBinding">
         <soap:address location="http://example.com/customer"/>
      </port>
   </service>
</definitions>

A SOAP request looked like this:

<?xml version="1.0"?>
<soap:Envelope 
    xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
    xmlns:cust="http://example.com/customer">
  <soap:Header>
    <cust:Authentication>
      <cust:username>john</cust:username>
      <cust:password>secret</cust:password>
    </cust:Authentication>
  </soap:Header>
  <soap:Body>
    <cust:getCustomer>
      <cust:customerId>12345</cust:customerId>
    </cust:getCustomer>
  </soap:Body>
</soap:Envelope>

The response:

<?xml version="1.0"?>
<soap:Envelope 
    xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
    xmlns:cust="http://example.com/customer">
  <soap:Body>
    <cust:getCustomerResponse>
      <cust:customer>
        <cust:id>12345</cust:id>
        <cust:name>John Smith</cust:name>
        <cust:balance>5000.00</cust:balance>
      </cust:customer>
    </cust:getCustomerResponse>
  </soap:Body>
</soap:Envelope>

You can look at all that XML overhead! A simple request became hundreds of bytes of markup. As SOAP was designed by committee (IBM, Oracle, Microsoft), it tried to solve every possible enterprise problem: transactions, security, reliability, routing, orchestration. I learned that simplicity beats features and SOAP collapsed under its own weight.

Java Servlets and Filters

With Java 1.1, it added support for Servlets that provided a much better model than CGI. Instead of spawning a process per request, servlets were Java classes instantiated once and reused across requests:

import javax.servlet.*;
import javax.servlet.http.*;
import java.io.*;

public class CustomerServlet extends HttpServlet {
    
    protected void doGet(HttpServletRequest request, 
                        HttpServletResponse response)
            throws ServletException, IOException {
        
        String customerId = request.getParameter("id");
        
        response.setContentType("application/json");
        PrintWriter out = response.getWriter();
        
        // Fetch customer data
        Customer customer = getCustomerFromDatabase(customerId);
        
        if (customer != null) {
            out.println(String.format(
                "{\"id\": \"%s\", \"name\": \"%s\", \"balance\": %.2f}",
                customer.getId(), customer.getName(), customer.getBalance()
            ));
        } else {
            response.setStatus(HttpServletResponse.SC_NOT_FOUND);
            out.println("{\"error\": \"Customer not found\"}");
        }
    }
    
    protected void doPost(HttpServletRequest request, 
                         HttpServletResponse response)
            throws ServletException, IOException {
        
        BufferedReader reader = request.getReader();
        StringBuilder json = new StringBuilder();
        String line;
        while ((line = reader.readLine()) != null) {
            json.append(line);
        }
        
        // Parse JSON and create customer
        Customer customer = parseJsonToCustomer(json.toString());
        saveCustomerToDatabase(customer);
        
        response.setStatus(HttpServletResponse.SC_CREATED);
        response.setContentType("application/json");
        PrintWriter out = response.getWriter();
        out.println(json.toString());
    }
}

Servlet Filters

The Filter API with Java Servlets was quite powerful and it supported a chain-of-responsibility pattern for handling cross-cutting concerns:

import javax.servlet.*;
import javax.servlet.http.*;
import java.io.IOException;

public class AuthenticationFilter implements Filter {
    
    public void doFilter(ServletRequest request, 
                        ServletResponse response,
                        FilterChain chain) 
            throws IOException, ServletException {
        
        HttpServletRequest httpRequest = (HttpServletRequest) request;
        HttpServletResponse httpResponse = (HttpServletResponse) response;
        
        // Check for authentication token
        String token = httpRequest.getHeader("Authorization");
        
        if (token == null || !isValidToken(token)) {
            httpResponse.setStatus(HttpServletResponse.SC_UNAUTHORIZED);
            httpResponse.getWriter().println("{\"error\": \"Unauthorized\"}");
            return;
        }
        
        // Pass to next filter or servlet
        chain.doFilter(request, response);
    }
    
    private boolean isValidToken(String token) {
        // Validate token
        return token.startsWith("Bearer ") && 
               validateJWT(token.substring(7));
    }
}

Configuration in web.xml:

<filter>
    <filter-name>AuthenticationFilter</filter-name>
    <filter-class>com.example.AuthenticationFilter</filter-class>
</filter>

<filter-mapping>
    <filter-name>AuthenticationFilter</filter-name>
    <url-pattern>/api/*</url-pattern>
</filter-mapping>

You could chain filters for compression, logging, transformation, rate limiting with clean separation of concerns without touching business logic. I previously had experienced with CORBA interceptors for injecting cross-cutting business logic and the filter pattern solved similar cross-cutting concerns problem. This pattern would reappear in service meshes and API gateways.

Enterprise Java Beans

I used Enterprise Java Beans (EJB) in late 1990s and early 2000s that attempted to make distributed objects transparent. Its key idea was that use regular Java objects and let the application server handle all the distribution, persistence, transactions, and security. Here’s what an EJB 2.x entity bean looked like:

// Remote interface
public interface Customer extends EJBObject {
    String getName() throws RemoteException;
    void setName(String name) throws RemoteException;
    double getBalance() throws RemoteException;
    void setBalance(double balance) throws RemoteException;
}

// Home interface
public interface CustomerHome extends EJBHome {
    Customer create(Integer id, String name) throws CreateException, RemoteException;
    Customer findByPrimaryKey(Integer id) throws FinderException, RemoteException;
}

// Bean implementation
public class CustomerBean implements EntityBean {
    private Integer id;
    private String name;
    private double balance;
    
    public String getName() { return name; }
    public void setName(String name) { this.name = name; }
    public double getBalance() { return balance; }
    public void setBalance(double balance) { this.balance = balance; }
    
    // Container callbacks
    public void ejbActivate() {}
    public void ejbPassivate() {}
    public void ejbLoad() {}
    public void ejbStore() {}
    public void setEntityContext(EntityContext ctx) {}
    public void unsetEntityContext() {}
    
    public Integer ejbCreate(Integer id, String name) {
        this.id = id;
        this.name = name;
        this.balance = 0.0;
        return null;
    }
    
    public void ejbPostCreate(Integer id, String name) {}
}

The N+1 Selects Problem and Network Fallacy

The fatal flaw: EJB pretended network calls were free. I watched teams write code like this:

CustomerHome home = // ... lookup
Customer customer = home.findByPrimaryKey(customerId);

// Each getter is a remote call!
String name = customer.getName();        // Network call
double balance = customer.getBalance();  // Network call

Worse, I saw code that made remote calls in loops:

Collection customers = home.findAll();
double totalBalance = 0.0;
for (Customer customer : customers) {
    // Remote call for EVERY iteration!
    totalBalance += customer.getBalance();
}

This violated the first Fallacy of Distributed Computing: The network is reliable. It’s also not zero latency. What looked like simple property access actually made HTTP calls to a remote server. I had previously built distributed and parallel applications, so I understood network latency. But it blindsided most developers because EJB deliberately hid it.

I learned that you can’t hide distribution. Network calls are fundamentally different from local calls. Latency, failure modes, and semantics are different. Transparency is a lie.

REST Standard

Before REST became mainstream, I experimented with “Plain Old XML” (POX) over HTTP by just sending XML documents via HTTP POST without all the SOAP ceremony:

import requests
import xml.etree.ElementTree as ET

# Create XML request
root = ET.Element('getCustomer')
ET.SubElement(root, 'customerId').text = '12345'
xml_data = ET.tostring(root, encoding='utf-8')

# Send HTTP POST
response = requests.post(
    'http://api.example.com/customer',
    data=xml_data,
    headers={'Content-Type': 'application/xml'}
)

# Parse response
response_tree = ET.fromstring(response.content)
name = response_tree.find('name').text

This was simpler than SOAP, but still ad-hoc. Then REST (Representational State Transfer), based on Roy Fielding’s 2000 dissertation offered a principled approach:

  • Use HTTP methods semantically (GET, POST, PUT, DELETE)
  • Resources have URLs
  • Stateless communication
  • Hypermedia as the engine of application state (HATEOAS)

Here’s a RESTful API in Python with Flask:

from flask import Flask, jsonify, request

app = Flask(__name__)

# In-memory data store
customers = {
    '12345': {'id': '12345', 'name': 'John Smith', 'balance': 5000.00}
}

@app.route('/customers/<customer_id>', methods=['GET'])
def get_customer(customer_id):
    customer = customers.get(customer_id)
    if customer:
        return jsonify(customer), 200
    return jsonify({'error': 'Customer not found'}), 404

@app.route('/customers', methods=['POST'])
def create_customer():
    data = request.get_json()
    customer_id = data['id']
    customers[customer_id] = data
    return jsonify(data), 201

@app.route('/customers/<customer_id>', methods=['PUT'])
def update_customer(customer_id):
    if customer_id not in customers:
        return jsonify({'error': 'Customer not found'}), 404
    
    data = request.get_json()
    customers[customer_id].update(data)
    return jsonify(customers[customer_id]), 200

@app.route('/customers/<customer_id>', methods=['DELETE'])
def delete_customer(customer_id):
    if customer_id in customers:
        del customers[customer_id]
        return '', 204
    return jsonify({'error': 'Customer not found'}), 404

if __name__ == '__main__':
    app.run(debug=True)

Client code became trivial:

import requests

# GET customer
response = requests.get('http://localhost:5000/customers/12345')
if response.status_code == 200:
    customer = response.json()
    print(f"Customer: {customer['name']}")

# Create new customer
new_customer = {
    'id': '67890',
    'name': 'Alice Johnson',
    'balance': 3000.00
}
response = requests.post(
    'http://localhost:5000/customers',
    json=new_customer
)

# Update customer
update_data = {'balance': 3500.00}
response = requests.put(
    'http://localhost:5000/customers/67890',
    json=update_data
)

# Delete customer
response = requests.delete('http://localhost:5000/customers/67890')

Hypermedia and HATEOAS

True REST embraced hypermedia—responses included links to related resources:

{
  "id": "12345",
  "name": "John Smith",
  "balance": 5000.00,
  "_links": {
    "self": {"href": "/customers/12345"},
    "orders": {"href": "/customers/12345/orders"},
    "transactions": {"href": "/customers/12345/transactions"}
  }
}

In practice, most APIs called “REST” weren’t truly RESTful and didn’t implement HATEOAS or use HTTP status codes correctly. But even “REST-ish” APIs were far simpler than SOAP. Key lesson I leared was that REST succeeded because it built on HTTP, something every platform already supported. No new protocols, no complex tooling. Just URLs, HTTP verbs, and JSON.

JSON Replaces XML

With adoption of REST, I saw a decline of XML Web Services (JAX-WS) and I used JAX-RS for REST services that supported JSON payload. XML required verbose markup:

<?xml version="1.0"?>
<customer>
    <id>12345</id>
    <name>John Smith</name>
    <balance>5000.00</balance>
    <orders>
        <order>
            <id>001</id>
            <date>2024-01-15</date>
            <total>99.99</total>
        </order>
        <order>
            <id>002</id>
            <date>2024-02-20</date>
            <total>149.50</total>
        </order>
    </orders>
</customer>

The same data in JSON:

{
  "id": "12345",
  "name": "John Smith",
  "balance": 5000.00,
  "orders": [
    {
      "id": "001",
      "date": "2024-01-15",
      "total": 99.99
    },
    {
      "id": "002",
      "date": "2024-02-20",
      "total": 149.50
    }
  ]
}

JSON does have limitations. It doesn’t natively support references or circular structures, making recursive relationships awkward:

{
  "id": "A",
  "children": [
    {
      "id": "B",
      "parent_id": "A"
    }
  ]
}

You have to encode references manually, unlike some XML schemas that support IDREF.

Erlang/OTP

I learned about actor model in college and built a framework based on actors and Linda memory model. In the mid-2000s, I encountered Erlang that used actors for building distributed systems. Erlang was designed in the 1980s at Ericsson for building telecom switches and is based on following design:

  • “Let it crash” philosophy
  • No shared memory between processes
  • Lightweight processes (not OS threads—Erlang processes)
  • Supervision trees for fault recovery
  • Hot code swapping for zero-downtime updates

Here’s what an Erlang actor (process) looks like:

-module(customer_server).
-export([start/0, init/0, get_customer/1, update_balance/2]).

% Start the server
start() ->
    Pid = spawn(customer_server, init, []),
    register(customer_server, Pid),
    Pid.

% Initialize with empty state
init() ->
    State = #{},  % Empty map
    loop(State).

% Main loop - handle messages
loop(State) ->
    receive
        {get_customer, CustomerId, From} ->
            Customer = maps:get(CustomerId, State, not_found),
            From ! {customer, Customer},
            loop(State);
        
        {update_balance, CustomerId, NewBalance, From} ->
            Customer = maps:get(CustomerId, State),
            UpdatedCustomer = Customer#{balance => NewBalance},
            NewState = maps:put(CustomerId, UpdatedCustomer, State),
            From ! {ok, updated},
            loop(NewState);
        
        {add_customer, CustomerId, Customer, From} ->
            NewState = maps:put(CustomerId, Customer, State),
            From ! {ok, added},
            loop(NewState);
        
        stop ->
            ok;
        
        _ ->
            loop(State)
    end.

% Client functions
get_customer(CustomerId) ->
    customer_server ! {get_customer, CustomerId, self()},
    receive
        {customer, Customer} -> Customer
    after 5000 ->
        timeout
    end.

update_balance(CustomerId, NewBalance) ->
    customer_server ! {update_balance, CustomerId, NewBalance, self()},
    receive
        {ok, updated} -> ok
    after 5000 ->
        timeout
    end.

Erlang made concurrency became simple by using messaging passing with actors.

The Supervision Tree

A key innovation of Erlang was supervision trees. You organized processes in a hierarchy, and supervisors would restart crashed children:

-module(customer_supervisor).
-behaviour(supervisor).

-export([start_link/0, init/1]).

start_link() ->
    supervisor:start_link({local, ?MODULE}, ?MODULE, []).

init([]) ->
    % Supervisor strategy
    SupFlags = #{
        strategy => one_for_one,  % Restart only failed child
        intensity => 5,            % Max 5 restarts
        period => 60               % Per 60 seconds
    },
    
    % Child specifications
    ChildSpecs = [
        #{
            id => customer_server,
            start => {customer_server, start, []},
            restart => permanent,   % Always restart
            shutdown => 5000,
            type => worker,
            modules => [customer_server]
        },
        #{
            id => order_server,
            start => {order_server, start, []},
            restart => permanent,
            shutdown => 5000,
            type => worker,
            modules => [order_server]
        }
    ],
    
    {ok, {SupFlags, ChildSpecs}}.

If a process crashed, the supervisor automatically restarted it and the system self-healed. A key lesson I learned from actor model and Erlang was that a shared mutable state is the enemy. Message passing with isolated state is simpler, more reliable, and easier to reason about. Today, AWS Lambda, Azure Durable Functions, and frameworks like Akka all embrace the Actor model.

Distributed Erlang

Erlang made distributed computing almost trivial. Processes on different nodes communicated identically to local processes:

% On node1@host1
RemotePid = spawn('node2@host2', module, function, [args]),
RemotePid ! {message, data}.

% On node2@host2 - receives the message
receive
    {message, Data} -> 
        io:format("Received: ~p~n", [Data])
end.

The VM handled all the complexity of node discovery, connection management, and message routing. Today’s serverless functions are actors and kubernetes pods are supervised processes.

Asynchronous Messaging

As systems grew more complex, asynchronous messaging became essential. I worked extensively with Oracle Tuxedo, IBM MQSeries, WebLogic JMS, WebSphere MQ, and later ActiveMQ, MQTT / AMQP, ZeroMQ and RabbitMQ primarily for inter-service communication and asynchronous processing. Here’s a JMS producer in Java:

import javax.jms.*;
import javax.naming.*;

public class OrderProducer {
    public static void main(String[] args) throws Exception {
        Context ctx = new InitialContext();
        ConnectionFactory factory = 
            (ConnectionFactory) ctx.lookup("ConnectionFactory");
        Queue queue = (Queue) ctx.lookup("OrderQueue");
        
        Connection connection = factory.createConnection();
        Session session = connection.createSession(
            false, Session.AUTO_ACKNOWLEDGE);
        MessageProducer producer = session.createProducer(queue);
        
        // Create message
        TextMessage message = session.createTextMessage();
        message.setText("{ \"orderId\": \"12345\", " +
                       "\"customerId\": \"67890\", " +
                       "\"amount\": 99.99 }");
        
        // Send message
        producer.send(message);
        System.out.println("Order sent: " + message.getText());
        
        connection.close();
    }
}

JMS consumer:

import javax.jms.*;
import javax.naming.*;

public class OrderConsumer implements MessageListener {
    
    public static void main(String[] args) throws Exception {
        Context ctx = new InitialContext();
        ConnectionFactory factory = 
            (ConnectionFactory) ctx.lookup("ConnectionFactory");
        Queue queue = (Queue) ctx.lookup("OrderQueue");
        
        Connection connection = factory.createConnection();
        Session session = connection.createSession(
            false, Session.AUTO_ACKNOWLEDGE);
        MessageConsumer consumer = session.createConsumer(queue);
        
        consumer.setMessageListener(new OrderConsumer());
        connection.start();
        
        System.out.println("Waiting for messages...");
        Thread.sleep(Long.MAX_VALUE);  // Keep running
    }
    
    public void onMessage(Message message) {
        try {
            TextMessage textMessage = (TextMessage) message;
            System.out.println("Received order: " + 
                             textMessage.getText());
            
            // Process order
            processOrder(textMessage.getText());
            
        } catch (JMSException e) {
            e.printStackTrace();
        }
    }
    
    private void processOrder(String orderJson) {
        // Business logic here
    }
}

Asynchronous messaging is essential for building resilient, scalable systems. It decouples producers from consumers, provides natural backpressure, and enables event-driven architectures.

Spring Framework and Aspect-Oriented Programming

In early 2000, I used aspect oriented programming (AOP) to inject cross cutting concerns like logging, security, monitoring, etc. Here is a typical example:

@Aspect
@Component
public class LoggingAspect {
    
    private static final Logger logger = 
        LoggerFactory.getLogger(LoggingAspect.class);
    
    @Before("execution(* com.example.service.*.*(..))")
    public void logBefore(JoinPoint joinPoint) {
        logger.info("Executing: " + 
                   joinPoint.getSignature().getName());
    }
    
    @AfterReturning(
        pointcut = "execution(* com.example.service.*.*(..))",
        returning = "result")
    public void logAfterReturning(JoinPoint joinPoint, Object result) {
        logger.info("Method " + 
                   joinPoint.getSignature().getName() + 
                   " returned: " + result);
    }
    
    @Around("@annotation(com.example.Monitored)")
    public Object measureTime(ProceedingJoinPoint joinPoint) 
            throws Throwable {
        long start = System.currentTimeMillis();
        Object result = joinPoint.proceed();
        long time = System.currentTimeMillis() - start;
        logger.info(joinPoint.getSignature().getName() + 
                   " took " + time + " ms");
        return result;
    }
}

I later adopted Spring Framework that revolutionized Java development with dependency injection and aspect-oriented programming (AOP):

// Spring configuration
@Configuration
public class AppConfig {
    
    @Bean
    public CustomerService customerService() {
        return new CustomerServiceImpl(customerRepository());
    }
    
    @Bean
    public CustomerRepository customerRepository() {
        return new DatabaseCustomerRepository(dataSource());
    }
    
    @Bean
    public DataSource dataSource() {
        DriverManagerDataSource ds = new DriverManagerDataSource();
        ds.setDriverClassName("com.mysql.jdbc.Driver");
        ds.setUrl("jdbc:mysql://localhost/mydb");
        return ds;
    }
}

// Service class
@Service
public class CustomerServiceImpl implements CustomerService {
    private final CustomerRepository repository;
    
    @Autowired
    public CustomerServiceImpl(CustomerRepository repository) {
        this.repository = repository;
    }
    
    @Transactional
    public void updateBalance(String customerId, double newBalance) {
        Customer customer = repository.findById(customerId);
        customer.setBalance(newBalance);
        repository.save(customer);
    }
}

Spring Remoting

Spring added its own remoting protocols. HTTP Invoker serialized Java objects over HTTP:

// Server configuration
@Configuration
public class ServerConfig {
    
    @Bean
    public HttpInvokerServiceExporter customerService() {
        HttpInvokerServiceExporter exporter = 
            new HttpInvokerServiceExporter();
        exporter.setService(customerServiceImpl());
        exporter.setServiceInterface(CustomerService.class);
        return exporter;
    }
}

// Client configuration
@Configuration
public class ClientConfig {
    
    @Bean
    public HttpInvokerProxyFactoryBean customerService() {
        HttpInvokerProxyFactoryBean proxy = 
            new HttpInvokerProxyFactoryBean();
        proxy.setServiceUrl("http://localhost:8080/customer");
        proxy.setServiceInterface(CustomerService.class);
        return proxy;
    }
}

I learned that AOP addressed cross-cutting concerns elegantly for monoliths. But in microservices, these concerns moved to the infrastructure layer like service meshes, API gateways, and sidecars.

Proprietary Protocols

When working for large companies like Amazon, I encountered Amazon Coral, which is a proprietary RPC framework influenced by CORBA. Coral used an IDL to define service interfaces and supported multiple languages:

// Coral IDL
namespace com.amazon.example

structure CustomerData {
    1: required integer customerId
    2: required string name
    3: optional double balance
}

service CustomerService {
    CustomerData getCustomer(1: integer customerId)
    void updateCustomer(1: CustomerData customer)
    list<CustomerData> listCustomers()
}

The IDL compiler generated client and server code for Java, C++, and other languages. Coral handled serialization, versioning, and service discovery. When I later worked for AWS, I used Smithy that was successor Coral, which Amazon open-sourced. Here is a similar example of a Smithy contract:

namespace com.example

service CustomerService {
    version: "2024-01-01"
    operations: [
        GetCustomer
        UpdateCustomer
        ListCustomers
    ]
}

@readonly
operation GetCustomer {
    input: GetCustomerInput
    output: GetCustomerOutput
    errors: [CustomerNotFound]
}

structure GetCustomerInput {
    @required
    customerId: String
}

structure GetCustomerOutput {
    @required
    customer: Customer
}

structure Customer {
    @required
    customerId: String
    
    @required
    name: String
    
    balance: Double
}

@error("client")
structure CustomerNotFound {
    @required
    message: String
}

I learned IDL-first design remains valuable. Smithy learned from CORBA, Protocol Buffers, and Thrift.

Long Polling, WebSockets, and Real-Time

In late 2000s, I built real-time applications for streaming financial charts and technical data. I used long polling where the client made a request that the server held open until data was available:

// Client-side long polling
function pollServer() {
    fetch('/api/events')
        .then(response => response.json())
        .then(data => {
            console.log('Received event:', data);
            updateUI(data);
            
            // Immediately poll again
            pollServer();
        })
        .catch(error => {
            console.error('Polling error:', error);
            // Retry after delay
            setTimeout(pollServer, 5000);
        });
}

pollServer();

Server-side (Node.js):

const express = require('express');
const app = express();

let pendingRequests = [];

app.get('/api/events', (req, res) => {
    // Hold request open
    pendingRequests.push(res);
    
    // Timeout after 30 seconds
    setTimeout(() => {
        const index = pendingRequests.indexOf(res);
        if (index !== -1) {
            pendingRequests.splice(index, 1);
            res.json({ type: 'heartbeat' });
        }
    }, 30000);
});

// When an event occurs
function broadcastEvent(event) {
    pendingRequests.forEach(res => {
        res.json(event);
    });
    pendingRequests = [];
}

WebSockets

I also used WebSockets for real time applications that supported true bidirectional communication. However, earlier browsers didn’t fully support them so I used long polling as a fallback when websockets were not supported:

// Server (Node.js with ws library)
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', (ws) => {
    console.log('Client connected');
    
    // Send initial data
    ws.send(JSON.stringify({
        type: 'INIT',
        data: getInitialData()
    }));
    
    // Handle messages
    ws.on('message', (message) => {
        const msg = JSON.parse(message);
        
        if (msg.type === 'SUBSCRIBE') {
            subscribeToSymbol(ws, msg.symbol);
        }
    });
    
    ws.on('close', () => {
        console.log('Client disconnected');
        unsubscribeAll(ws);
    });
});

// Stream live data
function streamPriceUpdate(symbol, price) {
    wss.clients.forEach((client) => {
        if (client.readyState === WebSocket.OPEN) {
            if (isSubscribed(client, symbol)) {
                client.send(JSON.stringify({
                    type: 'PRICE_UPDATE',
                    symbol: symbol,
                    price: price,
                    timestamp: Date.now()
                }));
            }
        }
    });
}

Client:

const ws = new WebSocket('ws://localhost:8080');

ws.onopen = () => {
    console.log('Connected to server');
    
    // Subscribe to symbols
    ws.send(JSON.stringify({
        type: 'SUBSCRIBE',
        symbol: 'AAPL'
    }));
};

ws.onmessage = (event) => {
    const message = JSON.parse(event.data);
    
    switch (message.type) {
        case 'INIT':
            initializeChart(message.data);
            break;
        case 'PRICE_UPDATE':
            updateChart(message.symbol, message.price);
            break;
    }
};

ws.onerror = (error) => {
    console.error('WebSocket error:', error);
};

ws.onclose = () => {
    console.log('Disconnected, attempting reconnect...');
    setTimeout(connectWebSocket, 1000);
};

I learned that different problems need different protocols. REST works for request-response. WebSockets excel for real-time bidirectional communication.

Vert.x and Hazelcast for High-Performance Streaming

For a production streaming chart system handling high-volume market data, I used Vert.x with Hazelcast. Vert.x is a reactive toolkit built on Netty that excels at handling thousands of concurrent connections with minimal resources. Hazelcast provided distributed caching and coordination across multiple Vert.x instances. Market data flowed into Hazelcast distributed topics, Vert.x instances subscribed to these topics and pushed updates to connected WebSocket clients. If WebSocket wasn’t supported, we fell back to long polling automatically.

import io.vertx.core.Vertx;
import io.vertx.core.http.HttpServer;
import io.vertx.core.http.ServerWebSocket;
import com.hazelcast.core.Hazelcast;
import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.core.ITopic;
import com.hazelcast.core.Message;
import com.hazelcast.core.MessageListener;
import java.util.concurrent.ConcurrentHashMap;
import java.util.Set;

public class MarketDataServer {
    private final Vertx vertx;
    private final HazelcastInstance hazelcast;
    private final ConcurrentHashMap<String, Set<ServerWebSocket>> subscriptions;
    
    public MarketDataServer() {
        this.vertx = Vertx.vertx();
        this.hazelcast = Hazelcast.newHazelcastInstance();
        this.subscriptions = new ConcurrentHashMap<>();
        
        // Subscribe to market data topic
        ITopic<MarketData> topic = hazelcast.getTopic("market-data");
        topic.addMessageListener(new MessageListener<MarketData>() {
            public void onMessage(Message<MarketData> message) {
                broadcastToSubscribers(message.getMessageObject());
            }
        });
    }
    
    public void start() {
        HttpServer server = vertx.createHttpServer();
        
        server.webSocketHandler(ws -> {
            String path = ws.path();
            
            if (path.startsWith("/stream/")) {
                String symbol = path.substring(8);
                handleWebSocketConnection(ws, symbol);
            } else {
                ws.reject();
            }
        });
        
        // Long polling fallback
        server.requestHandler(req -> {
            if (req.path().startsWith("/poll/")) {
                String symbol = req.path().substring(6);
                handleLongPolling(req, symbol);
            }
        });
        
        server.listen(8080, result -> {
            if (result.succeeded()) {
                System.out.println("Market data server started on port 8080");
            }
        });
    }
    
    private void handleWebSocketConnection(ServerWebSocket ws, String symbol) {
        subscriptions.computeIfAbsent(symbol, k -> ConcurrentHashMap.newKeySet())
                     .add(ws);
        
        ws.closeHandler(v -> {
            Set<ServerWebSocket> sockets = subscriptions.get(symbol);
            if (sockets != null) {
                sockets.remove(ws);
            }
        });
        
        // Send initial snapshot from Hazelcast cache
        IMap<String, MarketData> cache = hazelcast.getMap("market-snapshot");
        MarketData data = cache.get(symbol);
        if (data != null) {
            ws.writeTextMessage(data.toJson());
        }
    }
    
    private void handleLongPolling(HttpServerRequest req, String symbol) {
        String lastEventId = req.getParam("lastEventId");
        
        // Hold request until data available or timeout
        long timerId = vertx.setTimer(30000, id -> {
            req.response()
               .putHeader("Content-Type", "application/json")
               .end("{\"type\":\"heartbeat\"}");
        });
        
        // Register one-time listener
        subscriptions.computeIfAbsent(symbol + ":poll", 
            k -> ConcurrentHashMap.newKeySet())
            .add(new PollHandler(req, timerId));
    }
    
    private void broadcastToSubscribers(MarketData data) {
        String symbol = data.getSymbol();
        
        // WebSocket subscribers
        Set<ServerWebSocket> sockets = subscriptions.get(symbol);
        if (sockets != null) {
            String json = data.toJson();
            sockets.forEach(ws -> {
                if (!ws.isClosed()) {
                    ws.writeTextMessage(json);
                }
            });
        }
        
        // Update Hazelcast cache for new subscribers
        IMap<String, MarketData> cache = hazelcast.getMap("market-snapshot");
        cache.put(symbol, data);
    }
    
    public static void main(String[] args) {
        new MarketDataServer().start();
    }
}

Publishing market data to Hazelcast from data feed:

public class MarketDataPublisher {
    private final HazelcastInstance hazelcast;
    
    public void publishUpdate(String symbol, double price, long volume) {
        MarketData data = new MarketData(symbol, price, volume, 
                                         System.currentTimeMillis());
        
        // Publish to topic - all Vert.x instances receive it
        ITopic<MarketData> topic = hazelcast.getTopic("market-data");
        topic.publish(data);
    }
}

This architecture provided:

  • Vert.x Event Loop: Non-blocking I/O handled 10,000+ concurrent WebSocket connections per instance
  • Hazelcast Distribution: Market data shared across multiple Vert.x instances without a central message broker
  • Horizontal Scaling: Adding Vert.x instances automatically joined the Hazelcast cluster
  • Low Latency: Sub-millisecond message propagation within the cluster
  • Automatic Fallback: Clients detected WebSocket support; older browsers used long polling

Facebook Thrift and Google Protocol Buffers

I experimented with Facebook Thrift and Google Protocol Buffers that provided IDL-based RPC with multiple protocols: Here is an example of Protocol Buffers:

syntax = "proto3";

package customer;

message Customer {
    int32 customer_id = 1;
    string name = 2;
    double balance = 3;
}

service CustomerService {
    rpc GetCustomer(GetCustomerRequest) returns (Customer);
    rpc UpdateBalance(UpdateBalanceRequest) returns (UpdateBalanceResponse);
    rpc ListCustomers(ListCustomersRequest) returns (CustomerList);
}

message GetCustomerRequest {
    int32 customer_id = 1;
}

message UpdateBalanceRequest {
    int32 customer_id = 1;
    double new_balance = 2;
}

message UpdateBalanceResponse {
    bool success = 1;
}

message ListCustomersRequest {}

message CustomerList {
    repeated Customer customers = 1;
}

Python server with gRPC (which uses Protocol Buffers):

import grpc
from concurrent import futures
import customer_pb2
import customer_pb2_grpc

class CustomerServicer(customer_pb2_grpc.CustomerServiceServicer):
    
    def GetCustomer(self, request, context):
        return customer_pb2.Customer(
            customer_id=request.customer_id,
            name="John Doe",
            balance=5000.00
        )
    
    def UpdateBalance(self, request, context):
        print(f"Updating balance for {request.customer_id} " +
              f"to {request.new_balance}")
        return customer_pb2.UpdateBalanceResponse(success=True)
    
    def ListCustomers(self, request, context):
        customers = [
            customer_pb2.Customer(customer_id=1, name="Alice", balance=1000),
            customer_pb2.Customer(customer_id=2, name="Bob", balance=2000),
        ]
        return customer_pb2.CustomerList(customers=customers)

def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    customer_pb2_grpc.add_CustomerServiceServicer_to_server(
        CustomerServicer(), server)
    server.add_insecure_port('[::]:50051')
    server.start()
    print("Server started on port 50051")
    server.wait_for_termination()

if __name__ == '__main__':
    serve()

I learned that binary protocols offer significant efficiency gains. JSON is human-readable and convenient for debugging, but in high-performance scenarios, binary protocols like Protocol Buffers reduce payload size and serialization overhead.

Serverless and Lambda: Functions as a Service

Around 2015, AWS Lambda introduced serverless computing where you wrote functions, and AWS handled all the infrastructure:

// Lambda function (Node.js)
exports.handler = async (event) => {
    const customerId = event.queryStringParameters.customerId;
    
    // Query DynamoDB
    const AWS = require('aws-sdk');
    const dynamodb = new AWS.DynamoDB.DocumentClient();
    
    const result = await dynamodb.get({
        TableName: 'Customers',
        Key: { customerId: customerId }
    }).promise();
    
    if (result.Item) {
        return {
            statusCode: 200,
            body: JSON.stringify(result.Item)
        };
    } else {
        return {
            statusCode: 404,
            body: JSON.stringify({ error: 'Customer not found' })
        };
    }
};

Serverless was powerful with no servers to manage, automatic scaling, pay-per-invocation pricing. It felt like the Actor model I’d worked for my research that offered small, stateless, event-driven functions.

However, I also encountered several problems with serverless:

  • Cold starts: First invocation could be slow (though it has improved with recent updates)
  • Timeouts: Functions had maximum execution time (15 minutes for Lambda)
  • State management: Functions were stateless; you needed external state stores
  • Orchestration: Coordinating multiple functions was complex

The ping-pong anti-pattern emerged where Lambda A calls Lambda B, which calls Lambda C, which calls Lambda D. This created hard-to-debug systems with unpredictable costs. AWS Step Functions and Azure Durable Functions addressed orchestration:

{
  "Comment": "Order processing workflow",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:ValidateOrder",
      "Next": "CheckInventory"
    },
    "CheckInventory": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:CheckInventory",
      "Next": "ChargeCustomer"
    },
    "ChargeCustomer": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:ChargeCustomer",
      "Catch": [{
        "ErrorEquals": ["PaymentError"],
        "Next": "PaymentFailed"
      }],
      "Next": "ShipOrder"
    },
    "ShipOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:ShipOrder",
      "End": true
    },
    "PaymentFailed": {
      "Type": "Fail",
      "Cause": "Payment processing failed"
    }
  }
}

gRPC: Modern RPC

In early 2020s, I started using gRPC extensively. It combined the best ideas from decades of RPC evolution:

  • Protocol Buffers for IDL
  • HTTP/2 for transport (multiplexing, header compression, flow control)
  • Strong typing with code generation
  • Streaming support (unary, server streaming, client streaming, bidirectional)

Here’s a gRPC service definition:

syntax = "proto3";

package customer;

service CustomerService {
    rpc GetCustomer(GetCustomerRequest) returns (Customer);
    rpc UpdateCustomer(Customer) returns (UpdateResponse);
    rpc StreamOrders(StreamOrdersRequest) returns (stream Order);
    rpc BidirectionalChat(stream ChatMessage) returns (stream ChatMessage);
}

message Customer {
    int32 customer_id = 1;
    string name = 2;
    double balance = 3;
}

message GetCustomerRequest {
    int32 customer_id = 1;
}

message UpdateResponse {
    bool success = 1;
    string message = 2;
}

message StreamOrdersRequest {
    int32 customer_id = 1;
}

message Order {
    int32 order_id = 1;
    double amount = 2;
    string status = 3;
}

message ChatMessage {
    string user = 1;
    string message = 2;
    int64 timestamp = 3;
}

Go server implementation:

package main

import (
    "context"
    "fmt"
    "log"
    "net"
    "time"
    
    "google.golang.org/grpc"
    pb "example.com/customer"
)

type server struct {
    pb.UnimplementedCustomerServiceServer
}

func (s *server) GetCustomer(ctx context.Context, req *pb.GetCustomerRequest) (*pb.Customer, error) {
    return &pb.Customer{
        CustomerId: req.CustomerId,
        Name:       "John Doe",
        Balance:    5000.00,
    }, nil
}

func (s *server) UpdateCustomer(ctx context.Context, customer *pb.Customer) (*pb.UpdateResponse, error) {
    log.Printf("Updating customer %d", customer.CustomerId)
    
    return &pb.UpdateResponse{
        Success: true,
        Message: "Customer updated successfully",
    }, nil
}

func (s *server) StreamOrders(req *pb.StreamOrdersRequest, stream pb.CustomerService_StreamOrdersServer) error {
    orders := []*pb.Order{
        {OrderId: 1, Amount: 99.99, Status: "shipped"},
        {OrderId: 2, Amount: 149.50, Status: "processing"},
        {OrderId: 3, Amount: 75.25, Status: "delivered"},
    }
    
    for _, order := range orders {
        if err := stream.Send(order); err != nil {
            return err
        }
        time.Sleep(time.Second)  // Simulate delay
    }
    
    return nil
}

func (s *server) BidirectionalChat(stream pb.CustomerService_BidirectionalChatServer) error {
    for {
        msg, err := stream.Recv()
        if err != nil {
            return err
        }
        
        log.Printf("Received: %s from %s", msg.Message, msg.User)
        
        // Echo back with server prefix
        response := &pb.ChatMessage{
            User:      "Server",
            Message:   fmt.Sprintf("Echo: %s", msg.Message),
            Timestamp: time.Now().Unix(),
        }
        
        if err := stream.Send(response); err != nil {
            return err
        }
    }
}

func main() {
    lis, err := net.Listen("tcp", ":50051")
    if err != nil {
        log.Fatalf("Failed to listen: %v", err)
    }
    
    s := grpc.NewServer()
    pb.RegisterCustomerServiceServer(s, &server{})
    
    log.Println("Server listening on :50051")
    if err := s.Serve(lis); err != nil {
        log.Fatalf("Failed to serve: %v", err)
    }
}

Go client:

package main

import (
    "context"
    "io"
    "log"
    "time"
    
    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials/insecure"
    pb "example.com/customer"
)

func main() {
    conn, err := grpc.Dial("localhost:50051", 
        grpc.WithTransportCredentials(insecure.NewCredentials()))
    if err != nil {
        log.Fatalf("Failed to connect: %v", err)
    }
    defer conn.Close()
    
    client := pb.NewCustomerServiceClient(conn)
    ctx := context.Background()
    
    // Unary call
    customer, err := client.GetCustomer(ctx, &pb.GetCustomerRequest{
        CustomerId: 12345,
    })
    if err != nil {
        log.Fatalf("GetCustomer failed: %v", err)
    }
    log.Printf("Customer: %v", customer)
    
    // Server streaming
    stream, err := client.StreamOrders(ctx, &pb.StreamOrdersRequest{
        CustomerId: 12345,
    })
    if err != nil {
        log.Fatalf("StreamOrders failed: %v", err)
    }
    
    for {
        order, err := stream.Recv()
        if err == io.EOF {
            break
        }
        if err != nil {
            log.Fatalf("Receive error: %v", err)
        }
        log.Printf("Order: %v", order)
    }
}

The Load Balancing Challenge

gRPC had one major gotcha in Kubernetes: connection persistence breaks load balancing. I documented this exhaustively in my blog post The Complete Guide to gRPC Load Balancing in Kubernetes and Istio. HTTP/2 multiplexes multiple requests over a single TCP connection. Once that connection is established to one pod, all requests go there. Kubernetes Service load balancing happens at L4 (TCP), so it doesn’t see individual gRPC calls and it only sees one connection. I used Istio’s Envoy sidecar, which operates at L7 and routes each gRPC call independently:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: grpc-service
spec:
  host: grpc-service
  trafficPolicy:
    connectionPool:
      http:
        http2MaxRequests: 100
        maxRequestsPerConnection: 10  # Force connection rotation
    loadBalancer:
      simple: LEAST_REQUEST  # Better than ROUND_ROBIN
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s

I learned that modern protocols solve old problems but introduce new ones. gRPC is excellent, but you must understand how it interacts with infrastructure. Production systems require deep integration between application protocol and deployment environment.

Modern Messaging and Streaming

I have been using Apache Kafka for many years that transformed how we think about data. It’s not just a message queue instead it’s a distributed commit log:

from kafka import KafkaProducer, KafkaConsumer
import json

# Producer
producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

order = {
    'order_id': '12345',
    'customer_id': '67890',
    'amount': 99.99,
    'timestamp': time.time()
}

producer.send('orders', value=order)
producer.flush()

# Consumer
consumer = KafkaConsumer(
    'orders',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
    group_id='order-processors'
)

for message in consumer:
    order = message.value
    print(f"Processing order: {order['order_id']}")
    # Process order

Kafka’s provided:

  • Durability: Messages are persisted to disk
  • Replayability: Consumers can reprocess historical events
  • Partitioning: Horizontal scalability through partitions
  • Consumer groups: Multiple consumers can process in parallel

Key Lesson: Event-driven architectures enable loose coupling and temporal decoupling. Systems can be rebuilt from the event log. This is Event Sourcing—a powerful pattern that Kafka makes practical at scale.

Agentic RPC: MCP and Agent-to-Agent Protocol

Over the last year, I have been building Agentic AI applications using Model Context Protocol (MCP) and more recently Agent-to-Agent (A2A) protocol. Both use JSON-RPC 2.0 underneath. After decades of RPC evolution, from Sun RPC to CORBA to gRPC, we’ve come full circle to JSON-RPC for AI agents. I recently built a daily minutes assistant that aggregates information from multiple sources into a morning briefing. After decades of RPC evolution, from Sun RPC to CORBA to gRPC, it has come full circle to JSON-RPC for AI agents.

Service Discovery

A2A immediately reminded me of Sun’s Network Information Service (NIS), originally called Yellow Pages that I used in early 1990s. NIS provided a centralized directory service for Unix systems to look up user accounts, host names, and configuration data across a network. I saw this pattern repeated throughout the decades:

  • CORBA Naming Service (1990s): Objects registered themselves with a hierarchical naming service, and clients discovered them by name
  • JINI (late 1990s): Services advertised themselves via multicast, and clients discovered them through lookup registrars (as I described earlier in the JINI section)
  • UDDI (2000s): Universal Description, Discovery, and Integration for web services—a registry where SOAP services could be published and discovered
  • Consul, Eureka, etcd (2010s): Modern service discovery for microservices
  • Kubernetes DNS/Service Discovery (2010s-present): Built-in service registry and DNS-based discovery

Model Context Protocol (MCP)

MCP lets AI agents discover and invoke tools provided by servers. I recently built a daily minutes assistant that aggregates information from multiple sources into a morning briefing. Here’s the MCP server that exposes tools to the AI agent:

from mcp.server import Server
import mcp.types as types
from typing import Any
import asyncio

class DailyMinutesServer:
    def __init__(self):
        self.server = Server("daily-minutes")
        self.setup_handlers()
        
    def setup_handlers(self):
        @self.server.list_tools()
        async def handle_list_tools() -> list[types.Tool]:
            return [
                types.Tool(
                    name="get_emails",
                    description="Fetch recent emails from inbox",
                    inputSchema={
                        "type": "object",
                        "properties": {
                            "hours": {
                                "type": "number",
                                "description": "Hours to look back"
                            },
                            "limit": {
                                "type": "number", 
                                "description": "Max emails to fetch"
                            }
                        }
                    }
                ),
                types.Tool(
                    name="get_hackernews",
                    description="Fetch top Hacker News stories",
                    inputSchema={
                        "type": "object",
                        "properties": {
                            "limit": {
                                "type": "number",
                                "description": "Number of stories"
                            }
                        }
                    }
                ),
                types.Tool(
                    name="get_rss_feeds",
                    description="Fetch latest RSS feed items",
                    inputSchema={
                        "type": "object",
                        "properties": {
                            "feed_urls": {
                                "type": "array",
                                "items": {"type": "string"}
                            }
                        }
                    }
                ),
                types.Tool(
                    name="get_weather",
                    description="Get current weather forecast",
                    inputSchema={
                        "type": "object",
                        "properties": {
                            "location": {"type": "string"}
                        }
                    }
                )
            ]
        
        @self.server.call_tool()
        async def handle_call_tool(
            name: str, 
            arguments: dict[str, Any]
        ) -> list[types.TextContent]:
            if name == "get_emails":
                result = await email_connector.fetch_recent(
                    hours=arguments.get("hours", 24),
                    limit=arguments.get("limit", 10)
                )
            elif name == "get_hackernews":
                result = await hn_connector.fetch_top_stories(
                    limit=arguments.get("limit", 10)
                )
            elif name == "get_rss_feeds":
                result = await rss_connector.fetch_feeds(
                    feed_urls=arguments["feed_urls"]
                )
            elif name == "get_weather":
                result = await weather_connector.get_forecast(
                    location=arguments["location"]
                )
            else:
                raise ValueError(f"Unknown tool: {name}")
            
            return [types.TextContent(
                type="text",
                text=json.dumps(result, indent=2)
            )]

Each connector is a simple async module. Here’s the Hacker News connector:

import aiohttp
from typing import List, Dict

class HackerNewsConnector:
    BASE_URL = "https://hacker-news.firebaseio.com/v0"
    
    async def fetch_top_stories(self, limit: int = 10) -> List[Dict]:
        async with aiohttp.ClientSession() as session:
            # Get top story IDs
            async with session.get(f"{self.BASE_URL}/topstories.json") as resp:
                story_ids = await resp.json()
            
            # Fetch details for top N stories
            stories = []
            for story_id in story_ids[:limit]:
                async with session.get(
                    f"{self.BASE_URL}/item/{story_id}.json"
                ) as resp:
                    story = await resp.json()
                    stories.append({
                        "title": story.get("title"),
                        "url": story.get("url"),
                        "score": story.get("score"),
                        "by": story.get("by"),
                        "time": story.get("time")
                    })
            
            return stories

RSS and weather connectors follow the same pattern—simple, focused modules that the MCP server orchestrates.

JSON-RPC Under the Hood

MCP is that it’s just JSON-RPC 2.0 over stdio or HTTP. Here’s what a tool call looks like on the wire:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "get_emails",
    "arguments": {
      "hours": 12,
      "limit": 5
    }
  }
}

Response:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "content": [
      {
        "type": "text",
        "text": "[{\"from\": \"john@example.com\", \"subject\": \"Q4 Review\", ...}]"
      }
    ]
  }
}

After using Sun RPC, CORBA, SOAP, and gRPC, I appreciate MCP’s simplicity. It solves a specific problem: letting AI agents discover and invoke tools.

The Agent Workflow

My daily minutes agent follows this workflow:

  1. Agent calls get_emails to fetch recent messages
  2. Agent calls get_hackernews for tech news
  3. Agent calls get_rss_feeds for blog updates
  4. Agent calls get_weather for local forecast
  5. Agent synthesizes everything into a concise morning briefing

The AI decides which tools to call, in what order, based on the user’s preferences. I don’t hardcode the workflow.

Agent-to-Agent Protocol (A2A)

While MCP focuses on tool calling, A2A addresses agent-to-agent discovery and communication. It’s the modern equivalent of NIS/Yellow Pages for agents. Agents register their capabilities in a directory, and other agents discover and invoke them. A2A also uses JSON-RPC 2.0, but adds a discovery layer. Here’s how an agent registers itself:

from a2a import Agent, Capability

class ResearchAgent(Agent):
    def __init__(self):
        super().__init__(
            agent_id="research-agent-01",
            name="Research Agent",
            description="Performs web research and summarization"
        )
        
        # Register capabilities
        self.register_capability(
            Capability(
                name="web_search",
                description="Search the web for information",
                input_schema={
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"},
                        "max_results": {"type": "integer", "default": 10}
                    },
                    "required": ["query"]
                },
                output_schema={
                    "type": "object",
                    "properties": {
                        "results": {
                            "type": "array",
                            "items": {
                                "type": "object",
                                "properties": {
                                    "title": {"type": "string"},
                                    "url": {"type": "string"},
                                    "snippet": {"type": "string"}
                                }
                            }
                        }
                    }
                }
            )
        )
    
    async def handle_request(self, capability: str, params: dict):
        if capability == "web_search":
            return await self.perform_web_search(
                query=params["query"],
                max_results=params.get("max_results", 10)
            )
    
    async def perform_web_search(self, query: str, max_results: int):
        # Actual search implementation
        results = await search_engine.search(query, limit=max_results)
        return {"results": results}

Another agent discovers and invokes the research agent:

class CoordinatorAgent(Agent):
    def __init__(self):
        super().__init__(
            agent_id="coordinator-01",
            name="Coordinator Agent"
        )
        self.directory = AgentDirectory()
    
    async def research_topic(self, topic: str):
        # Discover agents with web_search capability
        agents = await self.directory.find_agents_with_capability("web_search")
        
        if not agents:
            raise Exception("No research agents available")
        
        # Select an agent (load balancing, availability, etc.)
        research_agent = agents[0]
        
        # Invoke the capability via JSON-RPC
        result = await research_agent.invoke(
            capability="web_search",
            params={
                "query": topic,
                "max_results": 20
            }
        )
        
        return result

The JSON-RPC exchange looks like this:

Discovery request:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "directory.find_agents",
  "params": {
    "capability": "web_search",
    "filters": {
      "availability": "online"
    }
  }
}

Discovery response:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "agents": [
      {
        "agent_id": "research-agent-01",
        "name": "Research Agent",
        "endpoint": "http://agent-service:8080/rpc",
        "capabilities": ["web_search"],
        "metadata": {
          "load": 0.3,
          "response_time_ms": 150
        }
      }
    ]
  }
}

The Security Problem

Though, I appreciate the simplicity of MCP and A2A but here’s what worries me: both protocols largely ignore decades of hard-won lessons about security. The Salesloft breach in 2024 showed exactly what happens: their AI chatbot stored authentication tokens for hundreds of services. MCP and A2A give us standard protocols for tool calling and agent coordination, which is valuable. But they create a false sense of security while ignoring fundamentals we solved decades ago:

  • Authentication: How do we verify an agent’s identity?
  • Authorization: What capabilities should this agent have access to?
  • Credential rotation: How do we handle token expiration and renewal?
  • Observability: How do we trace agent interactions for debugging and auditing?
  • Principle of least privilege: How do we ensure agents only access what they need?
  • Rate limiting: How do we prevent a misbehaving agent from overwhelming services?

The community needs to address this before A2A and MCP see widespread enterprise adoption.

Lessons Learned

1. Complexity is the Enemy

Every failed technology I’ve used failed because of complexity. CORBA, SOAP, EJB—they all collapsed under their own weight. Successful technologies like REST, gRPC, Kafka focused on doing one thing well.

Implication: Be suspicious of solutions that try to solve every problem. Prefer composable, focused tools.

2. Network Calls Are Expensive

The first Fallacy of Distributed Computing haunts us still: The network is not reliable. It’s also not zero latency, infinite bandwidth, or secure. I’ve watched this lesson be relearned in every generation:

  • EJB entity beans made chatty network calls
  • Microservices make chatty REST calls
  • GraphQL makes chatty database queries

Implication: Design APIs to minimize round trips. Batch operations. Cache aggressively. Monitor network latency religiously. (See my blog on fault tolerance in microservices for details.)

3. Statelessness Scales

Stateless services scale horizontally. But real applications need state—session data, shopping carts, user preferences. The solution isn’t to make services stateful instead it’s to externalize state:

  • Session stores (Redis, Memcached)
  • Databases (PostgreSQL, DynamoDB)
  • Event logs (Kafka)
  • Distributed caches

Implication: Keep service logic stateless. Push state to specialized systems designed for it.

4. The Actor Model Is Underappreciated

My research with actors and Linda memory model convinced me that the Actor model simplifies concurrent and distributed systems. Today’s serverless functions are essentially actors. Frameworks like Akka, Orleans, and Dapr embrace it. Actors eliminate shared mutable shared state, which the source of most concurrency bugs.

Implication: For event-driven systems, consider Actor-based frameworks. They map naturally to distributed problems.

5. Observability

Modern distributed systems require extensive instrumentation. You need:

  • Structured logging with correlation IDs
  • Metrics for performance and health
  • Distributed tracing to follow requests across services
  • Alarms with proper thresholds

Implication: Instrument your services from day one. Observability is infrastructure, not a nice-to-have. (See my blog posts on fault tolerance and load shedding for specific metrics.)

6. Throttling and Load Shedding

Every production system eventually faces traffic spikes or DDoS attacks. Without throttling and load shedding, your system will collapse. Key techniques:

  • Rate limiting by client/user/IP
  • Admission control based on queue depth
  • Circuit breakers to fail fast
  • Backpressure to slow down producers

Implication: Build throttling and load shedding into your architecture early. They’re harder to retrofit. (See my comprehensive blog post on this topic.)

7. Idempotency

Network failures mean requests may be retried. If your operations aren’t idempotent, you’ll process payments twice, create duplicate orders, and corrupt data (See my blog on idempotency topic). Make operations idempotent:

  • Use idempotency keys
  • Check if operation already succeeded
  • Design APIs to be safely retryable

Implication: Every non-read operation should be idempotent. It saves you from a world of hurt.

8. External and Internal APIs Should Differ

I have learned that external APIs need a good UX and developer empathy so that APIs are intuitive, consistent, well-documented. Internal APIs can optimize for performance, reliability, and operational needs. Don’t expose your internal architecture to external consumers. Use API gateways to translate between external contracts and internal services.

Implication: Design external APIs for developers using them. Design internal APIs for operational excellence.

9. Standards Beat Proprietary Solutions

Novell IPX failed because it was proprietary. Sun RPC succeeded as an open standard. REST thrived because it built on HTTP. gRPC uses open standards (HTTP/2, Protocol Buffers).

Implication: Prefer open standards. If you must use proprietary tech, understand the exit strategy.

10. Developer Experience Matters

Technologies with great developer experience get adopted. Java succeeded because it was easier than C++. REST beat SOAP because it was simpler. Kubernetes won because it offered a powerful abstraction.

Implication: Invest in developer tools, documentation, and ergonomics. Friction kills momentum.

Upcoming Trends

WebAssembly: The Next Runtime

WebAssembly (Wasm) is emerging as a universal runtime. Code written in Rust, Go, C, or AssemblyScript compiles to Wasm and runs anywhere. Platforms like wasmCloud, Fermyon, and Lunatic are building Actor-based systems on Wasm. Combined with the Component Model and WASI (WebAssembly System Interface), Wasm offers near-native performance, strong sandboxing, and portability. It might replace Docker containers for some workloads. Solomon Hykes, creator of Docker, famously said:

“If WASM+WASI existed in 2008, we wouldn’t have needed to create Docker. That’s how important it is. WebAssembly on the server is the future of computing. A standardized system interface was the missing link. Let’s hope WASI is up to the task!” — Solomon Hykes, March 2019

WebAssembly isn’t ready yet. Critical gaps:

  • WASI maturity: Still evolving (Preview 2 in development)
  • Async I/O: Limited compared to native runtimes
  • Database drivers: Many don’t support WASM
  • Networking: WASI sockets still experimental
  • Ecosystem tooling: Debugging, profiling still primitive

Service Meshes

Istio, Linkerd, Dapr move cross-cutting concerns out of application code:

  • Authentication/authorization
  • Rate limiting
  • Circuit breaking
  • Retries with exponential backoff
  • Distributed tracing
  • Metrics collection

Tradeoff: Complexity shifts from application code to infrastructure. Teams need deep Kubernetes and service mesh expertise.

The Edge Is Growing

Edge computing brings computation closer to users. CDNs like Cloudflare Workers and Fastly Compute@Edge run code globally with single-digit millisecond latency. This requires new thinking like eventual consistency, CRDTs (Conflict-free Replicated Data Types), and geo-distributed state management.

AI Agents and Multi-Agent Systems

I’m currently building agentic AI systems using LangGraph, RAG, and MCP. These are inherently distributed and agents communicate asynchronously, maintain local state, and coordinate through message passing. It’s the Actor model again.

What’s Missing

Despite all this progress, we still struggle with:

  • Distributed transactions: Two-phase commit doesn’t scale; SAGA patterns are complex
  • Testing distributed systems: Mocking services, simulating failures, and reproducing production bugs remain hard. I have written a number of tools for mock testing.
  • Observability at scale: Tracing millions of requests generates too much data
  • Cost management: Cloud bills spiral as systems grow
  • Cognitive load: Modern systems require expertise in dozens of technologies

Conclusion

I’ve been writing network code for decades and have used dozens of protocols, frameworks, and paradigms. Here is what I have learned:

  • Simplicity beats complexity (SOAP died, REST thrived)
  • Network calls aren’t free (EJB entity beans, chatty microservices)
  • State is hard; externalize it (Erlang, serverless functions)
  • Observability is essential (You can’t fix what you can’t see)
  • Developer experience matters (Java beat C++, REST beat SOAP)
  • Make It Work, Then Make It Fast
  • Design for Failure from Day One (Systems built with circuit breakers, retries, timeouts, and graceful degradation from the start).

Other tips from evolution of remote services include:

  • Design systems as message-passing actors from the start. Whether that’s Erlang processes, Akka actors, Orleans grains, or Lambda functions—embrace isolated state and message passing.
  • Invest in Observability with structured logging with correlation IDs, instrumented metrics, distributed tracing and alarms.
  • Separate External and Internal APIs. Use REST or GraphQL for external APIs (with versioning) and use gRPC or Thrift for internal communication (efficient).
  • Build Throttling and Load Shedding by rate limiting by client/user/IP at the edge and implement admission control at the service level (See my blog on Effective Load Shedding and Throttling).
  • Make Everything Idempotent as networks fail and requests get retried. Use idempotency keys for all mutations.
  • Choose Boring Technology (See Choose Boring Technology). For your core infrastructure, use proven tech (PostgreSQL, Redis, Kafka).
  • Test for Failure. Most code only handles the happy path. Production is all about unhappy paths.
  • Learn about the Fallacies of Distributed Computing and read A Note on Distributed Computing (1994).
  • Make chaos engineering part of CI/CD and use property-based testing (See my blog on property-based testing).

The technologies change like mainframes to serverless, Assembly to Go, CICS to Kubernetes. But the underlying principles remain constant. We oscillate between extremes:

  • Monoliths -> Microservices -> (now) Modular Monoliths
  • Strongly typed IDLs (CORBA) -> Untyped JSON -> Strongly typed again (gRPC)
  • Centralized -> Distributed -> Edge -> (soon) Peer-to-peer?
  • Synchronous RPC -> Asynchronous messaging -> Reactive streams

Each swing teaches us something. CORBA was too complex, but IDL-first design is valuable. REST was liberating, but binary protocols are more efficient. Microservices enable agility, but operational complexity explodes. The sweet spot is usually in the middle. Modular monoliths with clear boundaries. REST for external APIs, gRPC for internal communication. Some synchronous calls, some async messaging.

Here are a few trends that I see becoming prevalent:

  1. WebAssembly may replace containers for some workloads: Faster startup, better security with platforms like wasmCloud and Fermyon.
  2. Service meshes are becoming invisible: Currently they are too complex. Ambient mesh (no sidecars) and eBPF-based routing are gaining wider adoption.
  3. The Actor model will eat the world: Serverless functions are actors and durable functions are actor orchestration.
  4. Edge computing will force new patterns: We can’t rely on centralized state and may need CRDTs and eventual consistency.
  5. AI agents will need distributed coordination. Multi-agent systems = distributed systems and may need message passing between agents.

The best engineers don’t just learn the latest framework, they study the history, understand the trade-offs, and recognize when old ideas solve new problems. The future of distributed systems won’t be built by inventing entirely new paradigms instead it’ll be built by taking the best ideas from the past, learning from the failures, and applying them with better tools.


Check out my other blog posts:

November 4, 2025

Building a Production-Grade Enterprise AI Platform with vLLM: A Complete Guide from the Trenches

Filed under: Agentic AI — admin @ 11:48 am

TL;DR: Tested open-source LLM serving (vLLM) on GCP L4 GPUs. Achieved 93% cost savings vs OpenAI GPT-4, 100% routing accuracy, and 91% cache hit rates. Prototype proves feasibility; production requires 5-7 months additional work (security, HA, ops). All code at github.com/bhatti/vllm-tutorial.

Background

Last year, our CEO mandated “AI adoption” across the organization and everyone had access to LLMs through an internal portal that used Vertex AI. However, there was a little training or best practices. I saw engineers using the most expensive models for simple queries, no cost tracking, zero observability into what was being used, and no policies around data handling. People tried AI, built some demos and got mixed results.

This mirrors what’s happening across the industry. Recent research shows 95% of AI pilots fail at large companies, and McKinsey found 42% of companies abandoned generative AI projects citing “no significant bottom line impact.” The 5% that succeed do something fundamentally different: they treat AI as infrastructure requiring proper tooling, not just API access.

This experience drove me to explore better approaches. I built prototypes using vLLM and open-source tools, tested them on GCP L4 GPUs, and documented what actually works. This blog shares those findings with real code, benchmarks, and lessons from building production-ready AI infrastructure. Every benchmark ran on actual hardware (GCP L4 GPUs), every pattern emerged from solving real problems, and all code is available at github.com/bhatti/vllm-tutorial.


Why Hosted LLM Access Isn’t Enough

Even with managed services like Vertex AI or Bedrock, enterprise AI needs additional layers that most organizations overlook:

Cost Management

  • No intelligent routing between models (GPT-4 for simple definitions that Phi-2 could handle)
  • No per-user, per-team budgets or limits
  • No cost attribution or chargeback
  • Result: Unpredictable expenses, no accountability

Observability

  • Can’t track which prompts users send
  • Can’t identify failing queries or quality degradation
  • Can’t measure actual usage patterns
  • Result: Flying blind when issues occur

Security & Governance

  • Data flows through third-party infrastructure
  • No granular access controls beyond API keys
  • Limited audit trails for compliance
  • Result: Compliance gaps, security risks

Performance Control

  • Can’t deploy custom fine-tuned models
  • No A/B testing between models
  • Limited control over routing logic
  • Result: Vendor lock-in, inflexibility

The Solution: vLLM with Production Patterns

After evaluating options, I built prototypes using vLLM—a high-performance inference engine for running open-source LLMs (Llama, Mistral, Phi) on your infrastructure. Think of vLLM as NGINX for LLMs: battle-tested, optimized runtime that makes production deployments feasible.

Why vLLM specifically?

  • PagedAttention: Revolutionary memory management enabling 22.5x higher throughput
  • Continuous batching: Automatically batches requests for maximum efficiency
  • Production-ready: Used by major companies, not experimental
  • Open source: Full control, no vendor lock-in

What I tested:

  • Intelligent model routing (complexity-based selection)
  • Budget enforcement (hard limits, not just monitoring)
  • Prefix caching (80% cost reduction)
  • Quantization (3.7x memory reduction with AWQ)
  • Complete observability (Prometheus + Grafana + Langfuse)
  • Production error handling (retries, circuit breakers, fallbacks)

System Architecture

Here’s the complete system architecture I’ve built and tested:

Production AI requires three monitoring layers:

Layer 1: Infrastructure (Prometheus + Grafana)

  • GPU utilization, memory usage
  • Request rate, error rate, latency (P50, P95, P99)
  • Integration via /metrics endpoint that vLLM exposes
  • Grafana dashboards visualize trends and trigger alerts

Layer 2: Application Metrics

  • Time to First Token (TTFT), tokens per second
  • Cost per request, model distribution
  • Budget tracking (daily, monthly limits)
  • Custom Prometheus metrics embedded in application code

Layer 3: LLM Observability (Langfuse)

  • Full prompt/response history for debugging
  • Cost attribution per user/team
  • Quality tracking over time
  • Essential for understanding what users actually do

Here’s what I’ve built and tested:


Setting Up Your Environment: GCP L4 GPU Setup

Before we dive into the concepts, let’s get your environment ready. I’m using GCP L4 GPUs because they offer the best price/performance for this workload ($0.45/hour), but the code works on any CUDA-capable GPU.

Minimum Hardware Requirements

  • NVIDIA GPU with 16GB+ VRAM (L4, T4, A10G, A100)
  • 4 CPU cores
  • 16GB RAM
  • 100GB disk space

Step 1: Create GCP L4 Instance

# Create instance with L4 GPU
gcloud compute instances create vllm-test \
  --zone=us-central1-a \
  --machine-type=g2-standard-8 \
  --accelerator=type=nvidia-l4,count=1 \
  --image-family=ubuntu-2004-lts \
  --image-project=ubuntu-os-cloud \
  --boot-disk-size=200GB \
  --boot-disk-type=pd-ssd \
  --maintenance-policy=TERMINATE

# SSH into instance
gcloud compute ssh vllm-test --zone=us-central1-a

Step 2: Install CUDA 11.8

# Update system
sudo apt update && sudo apt upgrade -y

# Install CUDA 11.8
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
sudo sh cuda_11.8.0_520.61.05_linux.run --silent --toolkit

# Add to PATH
echo 'export PATH=/usr/local/cuda-11.8/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# Verify
nvidia-smi  # Should show your L4 GPU
nvcc --version  # Should show CUDA 11.8

Troubleshooting: If nvidia-smi doesn’t work, reboot the instance: sudo reboot

Step 3: Install Python Dependencies

# Install Python 3.10
sudo apt install -y python3.10 python3.10-venv python3-pip

# Clone the repository
git clone https://github.com/bhatti/vllm-tutorial.git
cd vllm-tutorial

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt

Step 4: Verify Installation

# Test vLLM installation
python -c "import vllm; print(f'vLLM version: {vllm.__version__}')"

# Quick functionality test
python examples/01_basic_vllm.py

Expected output:

Loading model microsoft/phi-2...
Model loaded in 8.3 seconds

Generating response...
Generated 50 tokens in 987ms
Throughput: 41.5 tokens/sec

? vLLM is working!

Quick Start

Before we dive deep, let’s get something running:

  1. Clone the repo:
   git clone https://github.com/bhatti/vllm-tutorial.git
   cd vllm-tutorial
  1. If you have a GPU available:
   # Follow setup instructions in README
   python examples/01_basic_vllm.py
  1. No GPU? Run the benchmarks locally:
   # See the actual results from GCP L4 testing
   cat benchmarks/results/01_throughput_results.json
  1. Explore the code:

Core Concept 1: Intelligent Model Routing

The problem: Not all queries need your most expensive model.

  • “What is EBITDA?” needs a 30-word definition ? Use Phi-2 ($0.0001)
  • “Analyze Microsoft’s 10-K risk factors…” needs deep reasoning ? Use Llama-3-8B ($0.0003)

Most teams send everything to their best model, which is wasteful.

The solution: Route queries to the right model based on complexity.

The Three-Tier Routing Strategy

TierModelUse CasesCost (per 1K tokens)% of Queries
SimplePhi-2 (2.7B)Definitions, facts$0.0001 / 1K60%
MediumMistral-7BSummaries, comparisons$0.0002 / 1K30%
ComplexLlama-3-8BAnalysis, reasoning$0.0003 / 1K10%

Routing Decision Flow

Implementation: Complexity Classification

Here’s how I classify query complexity:

def classify_complexity(self, prompt: str) -> str:
    """
    Classify prompt complexity to select appropriate model

    Rules:
    - Simple: Definitions, quick facts, <50 words
    - Medium: Summaries, comparisons, 50-150 words
    - Complex: Deep analysis, multi-step reasoning, >150 words
    """
    word_count = len(prompt.split())

    # Keywords indicating complexity
    complex_keywords = [
        "analyze", "compare", "evaluate", "assess risk",
        "recommend", "predict", "forecast", "implications"
    ]

    medium_keywords = [
        "summarize", "explain", "describe", "list",
        "what are", "how does", "differences"
    ]

    has_complex = any(kw in prompt.lower() for kw in complex_keywords)
    has_medium = any(kw in prompt.lower() for kw in medium_keywords)

    # Classification logic
    if word_count > 150 or has_complex:
        return "complex"
    elif word_count > 50 or has_medium:
        return "medium"
    else:
        return "simple"

Why this works:

  • Length is a strong signal (detailed questions need detailed answers)
  • Keywords indicate intent (“analyze” needs more reasoning than “define”)
  • Conservative defaults (when in doubt, route up)

Testing Results

I tested this with 11 queries on GCP L4. Here are the actual results:

Query: "What is EBITDA?"
Classified as: simple ? Routed to: Phi-2
Cost: $0.00002038
Latency: 4,843ms (first request, includes model loading)
Quality: ? Perfect (simple definition)

Query: "Summarize Apple's Q4 2024 earnings highlights"
Classified as: medium ? Routed to: Mistral-7B
Cost: $0.00000865
Latency: 4,827ms
Quality: ? Good summary

Query: "Analyze Microsoft's 10-K risk factors and assess their potential impact on future earnings"
Classified as: complex ? Routed to: Llama-3-8B
Cost: $0.00001382
Latency: 4,836ms
Quality: ? Detailed analysis

Accuracy: 100% (11/11 queries routed correctly)
Cost savings: 30% vs routing everything to the most expensive model

Complete Router

Here’s the full intelligent router (you can find this in src/intelligent_router.py):

from typing import Dict, Optional
from dataclasses import dataclass
from vllm import LLM, SamplingParams

@dataclass
class ModelConfig:
    """Configuration for a model tier"""
    name: str
    complexity: str  # "simple", "medium", "complex"
    cost_per_1k_tokens: float
    max_tokens: int

class IntelligentRouter:
    """
    Production-ready intelligent router with:
    - Complexity-based routing
    - Budget enforcement
    - Cost tracking
    - Fallback handling
    """

    def __init__(self, daily_budget_usd: float = 100.0):
        self.daily_budget_usd = daily_budget_usd
        self.total_cost_today = 0.0

        # Model configurations
        self.models = {
            "phi-2": ModelConfig(
                name="microsoft/phi-2",
                complexity="simple",
                cost_per_1k_tokens=0.0001,
                max_tokens=1024,
            ),
            "mistral-7b": ModelConfig(
                name="mistralai/Mistral-7B-Instruct-v0.2",
                complexity="medium",
                cost_per_1k_tokens=0.0002,
                max_tokens=2048,
            ),
            "llama-3-8b": ModelConfig(
                name="meta-llama/Meta-Llama-3-8B",
                complexity="complex",
                cost_per_1k_tokens=0.0003,
                max_tokens=4096,
            ),
        }

        # Initialize LLM (in production, these would be separate instances)
        self.llm = LLM(
            model=self.models["phi-2"].name,
            trust_remote_code=True,
            gpu_memory_utilization=0.9,
        )

    def route_request(self, prompt: str, max_tokens: int = 200) -> Dict:
        """
        Route request to appropriate model

        Returns:
            Dict with 'response', 'model_used', 'cost', 'latency_ms'
        """
        # Step 1: Classify complexity
        complexity = self.classify_complexity(prompt)

        # Step 2: Select model
        model_id = self._select_model(complexity)
        model_config = self.models[model_id]

        # Step 3: Check budget
        estimated_cost = self._estimate_cost(model_config, prompt, max_tokens)
        if self.total_cost_today + estimated_cost > self.daily_budget_usd:
            # Budget exceeded - fallback to cheapest model
            model_id = "phi-2"
            model_config = self.models[model_id]

        # Step 4: Generate response
        sampling_params = SamplingParams(
            temperature=0.7,
            top_p=0.9,
            max_tokens=max_tokens,
        )

        start_time = time.time()
        outputs = self.llm.generate([prompt], sampling_params)
        latency_ms = (time.time() - start_time) * 1000

        # Step 5: Track cost
        tokens_generated = len(outputs[0].outputs[0].token_ids)
        actual_cost = self._calculate_cost(model_config, prompt, tokens_generated)
        self.total_cost_today += actual_cost

        return {
            "response": outputs[0].outputs[0].text,
            "model_used": model_id,
            "cost_usd": actual_cost,
            "latency_ms": latency_ms,
            "tokens_generated": tokens_generated,
        }

    def _select_model(self, complexity: str) -> str:
        """Select model based on complexity"""
        for model_id, config in self.models.items():
            if config.complexity == complexity:
                return model_id
        return "phi-2"  # Default fallback

    def _estimate_cost(self, config: ModelConfig, prompt: str, max_tokens: int) -> float:
        """Estimate cost before generation"""
        input_tokens = len(prompt) / 4  # Rough estimate
        total_tokens = input_tokens + max_tokens
        return (total_tokens / 1000) * config.cost_per_1k_tokens

    def _calculate_cost(self, config: ModelConfig, prompt: str, tokens_generated: int) -> float:
        """Calculate actual cost after generation"""
        input_tokens = len(prompt) / 4
        total_tokens = input_tokens + tokens_generated
        return (total_tokens / 1000) * config.cost_per_1k_tokens

How to use it:

# Initialize router with daily budget
router = IntelligentRouter(daily_budget_usd=100.0)

# Route a simple query
result = router.route_request("What is gross margin?")
print(f"Model used: {result['model_used']}")  # phi-2
print(f"Cost: ${result['cost_usd']:.6f}")     # $0.000020

# Route a complex query
result = router.route_request(
    "Analyze Tesla's competitive positioning in the EV market "
    "and provide investment recommendations based on recent trends"
)
print(f"Model used: {result['model_used']}")  # llama-3-8b
print(f"Cost: ${result['cost_usd']:.6f}")     # $0.000138

Core Concept 2: Budget Enforcement

The problem: Monitoring costs isn’t the same as preventing them.

I have seen hundreds of thousands spent on a company AI hackathon because developers were using expensive models needlessly.

The solution: Hard limits that reject requests before they burn your budget.

The Three Levels of Budget Control

from dataclasses import dataclass
from datetime import datetime

@dataclass
class BudgetConfig:
    """Budget configuration with multiple enforcement levels"""
    max_cost_per_request: float = 0.50        # Level 1: prevent accidents
    daily_budget_usd: float = 100.0           # Level 2: daily cap
    monthly_budget_usd: float = 3000.0        # Level 3: monthly cap
    warning_threshold_pct: float = 0.80       # Warn at 80%

class BudgetEnforcer:
    """Hard budget enforcement - prevents spending, not just monitors"""
    
    def __init__(self, config: BudgetConfig):
        self.config = config
        self.daily_spend = 0.0
        self.monthly_spend = 0.0
        # ... implementation
    
    def check_budget(self, estimated_cost: float) -> Dict:
        """Check BEFORE generating - this is the key difference"""
        
        # Level 1: Per-request limit
        if estimated_cost > self.config.max_cost_per_request:
            return {"action": "reject", "reason": "Request too expensive"}
        
        # Level 2: Daily budget
        if self.daily_spend + estimated_cost > self.config.daily_budget_usd:
            return {"action": "downgrade", "reason": "Daily limit approaching"}
        
        # Level 3: Monthly budget
        if self.monthly_spend + estimated_cost > self.config.monthly_budget_usd:
            return {"action": "downgrade", "reason": "Monthly limit approaching"}
        
        return {"action": "allow"}

Best Practices

  • Set conservative limits initially
  • Monitor budget utilization trends
  • Implement graceful degradation
  • Track who’s using what

Core Concept 3: Prefix Caching

Problem: You’re paying to process the same content repeatedly.

In enterprise AI, you typically have a structure like this:

[Fixed System Prompt - 500 tokens]
You are a financial analyst AI assistant specializing in:
- Earnings report analysis
- SEC filing interpretation
- Market sentiment analysis
...

[User Query - 50 tokens]
What is EBITDA?

[Response - 100 tokens]
EBITDA stands for...

Total tokens: 650 (500 system + 50 query + 100 response)
What you pay for: All 650 tokens, every single request

The Solution: Prefix Caching

vLLM has a feature called “prefix caching” that solves this elegantly:

How to Enable It

from vllm import LLM

# WITHOUT prefix caching
llm = LLM(
    model="microsoft/phi-2",
    trust_remote_code=True,
)

# WITH prefix caching (80% cost reduction!)
llm = LLM(
    model="microsoft/phi-2",
    trust_remote_code=True,
    enable_prefix_caching=True,  # <-- That's it!
)

Testing Results

I tested this on GCP L4 with our end-to-end integration test. Here are the actual numbers:

Test setup:

  • Fixed system prompt: 500 tokens
  • 11 different user queries: 15-290 tokens each
  • Model: Phi-2 (2.7B)

Results WITHOUT prefix caching:

Request 1: $0.00010188 (full cost)
Request 2: $0.00010188 (full cost)
Request 3: $0.00010188 (full cost)
...
Total: $0.00112068 (11 × $0.00010188)

Results WITH prefix caching:

Request 1: $0.00002038 (full cost - establishes cache)
Request 2: $0.00000414 (80% cheaper - uses cache!)
Request 3: $0.00000409 (80% cheaper)
Request 4: $0.00000865 (80% cheaper)
...
Total: $0.00010031

Savings: $0.00102037 (91% reduction!)
Cache hit rate: 90.9% (10/11 requests)

Here is what just happened:

  • Same 11 queries
  • Same model
  • Same responses
  • One parameter change
  • 91% cost reduction

Best use cases:

  • RAG systems (fixed context, many questions): 80% savings
  • Template generation (fixed format, variable content): 70% savings
  • Conversations (history grows, new turns added): 50% savings

When it doesn’t help:

  • Every request is unique (no repeated prefix)
  • Prefix changes frequently (cache invalidated)
  • Very short queries (overhead dominates)

Rule of thumb: If you have a fixed prefix >200 tokens reused across requests, enable prefix caching.


Core Concept 4: Quantization

The problem: The models you want don’t fit in the GPUs you can afford.

  • Llama-3-70B in full precision: Requires 140GB GPU memory
  • Your budget: Maybe a 24GB L4 GPU
  • The gap: 116GB short

The solution: Use fewer bits per number with minimal quality loss, e.g., converting FP16 into INT8.

Quantization Schemes

MethodMemoryCompressionQuality LossWorks On
FP16 (baseline)19.3 GB0%All GPUs
AWQ5.2 GB3.7×~2%L4, A100
FP8~9.7 GB~1%H100 only

I’ve tested three quantization approaches on GCP L4:

1. FP8 (8-bit floating point)

  • Compression: 2x (FP16 ? FP8)
  • Quality: ~99% of original
  • Speed: Same or faster (better memory bandwidth)
  • Limitation: Requires H100 GPU (NOT supported on L4)

2. AWQ (Activation-aware Weight Quantization)

  • Compression: 3.7x (FP16 ? W4A16)
  • Quality: ~98% of original
  • Speed: Slightly slower than FP16
  • Limitation: Requires pre-quantized model

3. GPTQ (Post-training quantization)

  • Compression: 3.5x (FP16 ? INT4)
  • Quality: ~97% of original
  • Speed: Similar to AWQ
  • Limitation: Longer quantization process

Benchmark Results

I ran quantization benchmarks on GCP L4 with Phi-2. Here’s what I measured (from benchmarks/04_quantization_comparison.py):

# Benchmark code
class QuantizationBenchmark:
    def benchmark_quantization(self, quantization: str):
        """
        Test quantization scheme

        Args:
            quantization: "none" (FP16), "fp8", "awq", or "gptq"
        """
        llm_kwargs = {
            "model": "microsoft/phi-2",
            "trust_remote_code": True,
            "gpu_memory_utilization": 0.9,
            "max_model_len": 1024,
        }

        # Add quantization if specified
        if quantization != "none":
            llm_kwargs["quantization"] = quantization

        # Load model
        start = time.time()
        llm = LLM(**llm_kwargs)
        load_time = time.time() - start

        # Measure memory
        gpu_memory = torch.cuda.memory_allocated() / 1e9  # GB

        # Benchmark generation
        prompt = "Explain quantum computing in simple terms"
        sampling_params = SamplingParams(max_tokens=100)

        start = time.time()
        outputs = llm.generate([prompt], sampling_params)
        latency_ms = (time.time() - start) * 1000

        return {
            "quantization": quantization,
            "memory_gb": gpu_memory,
            "load_time_sec": load_time,
            "latency_ms": latency_ms,
            "tokens_per_sec": 100 / (latency_ms / 1000),
        }

How to Use AWQ Quantization

The easiest approach is using pre-quantized models from HuggingFace:

from vllm import LLM

# Option 1: Use pre-quantized AWQ model
llm = LLM(
    model="TheBloke/Mistral-7B-Instruct-v0.2-AWQ",  # Pre-quantized!
    quantization="awq",
    trust_remote_code=True,
    gpu_memory_utilization=0.9,
)

# That's it! 3.7x smaller, ready to use

Available AWQ models (from TheBloke on HuggingFace):

  • Llama-2-7B-AWQ
  • Llama-2-13B-AWQ
  • Mistral-7B-Instruct-v0.2-AWQ
  • CodeLlama-7B-AWQ
  • Mixtral-8x7B-AWQ

Memory savings example:

# Mistral-7B in FP16
llm_fp16 = LLM(model="mistralai/Mistral-7B-Instruct-v0.2")
# Memory: ~16GB VRAM
# Fits on: A100-40GB, L4 (barely)

# Mistral-7B in AWQ
llm_awq = LLM(
    model="TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
    quantization="awq"
)
# Memory: ~4.3GB VRAM
# Fits on: T4-16GB, L4 (comfortably), even RTX 3090

# Savings: 72% memory reduction!

When to Use Quantization

? Use quantization when:

  • You’re memory-constrained (model doesn’t fit)
  • You want to use cheaper GPUs
  • Quality loss <2% is acceptable
  • You’re deploying at scale (cost matters)

? Skip quantization when:

  • You have unlimited GPU budget (rare!)
  • You need absolute maximum quality
  • Model already fits comfortably
  • You’re still prototyping (optimize later)

My recommendation: Start with AWQ for all production deployments. The cost savings alone justify it, and quality loss is negligible for most tasks.


Core Concept 5: Complete Observability

The problem: When your AI system breaks, you need to know what, when, why, and who.

The solution: Three monitoring layers.

The Three Layers of AI Observability

Layer 1: Infrastructure (What Prometheus tracks)

GPU metrics:

  • Memory usage (prevent out-of-memory)
  • Utilization (optimize capacity)
  • Temperature (hardware health)

Service metrics:

  • Request rate (traffic patterns)
  • Error rate (system health)
  • Latency percentiles (user experience)

Layer 2: Application Metrics (AI-specific)

  • Time to First Token (TTFT)
  • Inter-Token Latency (ITL)
  • Tokens per second
  • Cost per request
  • Model distribution
  • Tool: Custom metrics in Prometheus

Layer 3: LLM Observability (Content-level)

  • What prompts are users sending?
  • What responses are being generated?
  • Cost attribution per user/team
  • Quality trends over time
  • Tool: Langfuse or Arize Phoenix

Custom Application Metrics

Here’s how I export custom metrics from the vLLM application – Layer 2 (from src/observability_monitoring.py):

from prometheus_client import Counter, Histogram, Gauge, generate_latest
from typing import Dict
import time

class VLLMMetrics:
    """
    Production metrics for vLLM serving

    Tracks:
    - Request counts (total, success, failure)
    - Latency distributions (P50, P95, P99)
    - Token throughput
    - Cost tracking
    - Model distribution
    """

    def __init__(self):
        # Request counters
        self.requests_total = Counter(
            'vllm_requests_total',
            'Total number of requests',
            ['model', 'status']
        )

        # Latency histogram
        self.latency = Histogram(
            'vllm_latency_ms',
            'Request latency in milliseconds',
            ['model'],
            buckets=[10, 50, 100, 250, 500, 1000, 2500, 5000, 10000]
        )

        # Token metrics
        self.tokens_generated = Counter(
            'vllm_tokens_generated_total',
            'Total tokens generated',
            ['model']
        )

        self.tokens_per_second = Gauge(
            'vllm_tokens_per_second',
            'Current tokens per second',
            ['model']
        )

        # Cost tracking
        self.cost_usd = Counter(
            'vllm_cost_usd_total',
            'Total cost in USD',
            ['model']
        )

        self.daily_cost = Gauge(
            'vllm_daily_cost_usd',
            'Cost today in USD'
        )

        # GPU memory
        self.gpu_memory_used = Gauge(
            'vllm_gpu_memory_used_gb',
            'GPU memory used in GB'
        )

        self.gpu_memory_total = Gauge(
            'vllm_gpu_memory_total_gb',
            'Total GPU memory in GB'
        )

        # Cache metrics
        self.cache_hit_rate = Gauge(
            'vllm_cache_hit_rate',
            'Prefix cache hit rate'
        )

    def record_request(
        self,
        model: str,
        latency_ms: float,
        tokens: int,
        cost_usd: float,
        success: bool,
        cached: bool = False
    ):
        """Record request metrics"""

        # Update counters
        status = "success" if success else "failure"
        self.requests_total.labels(model=model, status=status).inc()

        if success:
            # Latency
            self.latency.labels(model=model).observe(latency_ms)

            # Tokens
            self.tokens_generated.labels(model=model).inc(tokens)
            tokens_per_sec = tokens / (latency_ms / 1000)
            self.tokens_per_second.labels(model=model).set(tokens_per_sec)

            # Cost
            self.cost_usd.labels(model=model).inc(cost_usd)

    def update_gpu_memory(self):
        """Update GPU memory metrics"""
        if torch.cuda.is_available():
            used_gb = torch.cuda.memory_allocated() / 1e9
            total_gb = torch.cuda.get_device_properties(0).total_memory / 1e9

            self.gpu_memory_used.set(used_gb)
            self.gpu_memory_total.set(total_gb)

    def export_metrics(self) -> str:
        """Export Prometheus metrics"""
        return generate_latest().decode('utf-8')

# Usage in FastAPI
from fastapi import FastAPI

app = FastAPI()
metrics = VLLMMetrics()

@app.post("/generate")
async def generate(request: GenerateRequest):
    start = time.time()

    try:
        # Generate response
        result = llm.generate(request.prompt)

        # Record success metrics
        latency_ms = (time.time() - start) * 1000
        metrics.record_request(
            model="phi-2",
            latency_ms=latency_ms,
            tokens=len(result.tokens),
            cost_usd=calculate_cost(result),
            success=True,
        )

        return result

    except Exception as e:
        # Record failure
        latency_ms = (time.time() - start) * 1000
        metrics.record_request(
            model="phi-2",
            latency_ms=latency_ms,
            tokens=0,
            cost_usd=0,
            success=False,
        )
        raise

@app.get("/metrics")
async def get_metrics():
    """Prometheus scrape endpoint"""
    metrics.update_gpu_memory()
    return Response(
        content=metrics.export_metrics(),
        media_type="text/plain"
    )

What this tracks:

  • ? Request rate (by model, by status)
  • ? Latency distribution (with percentiles)
  • ? Token throughput (tokens/sec)
  • ? Cost tracking (per model, daily total)
  • ? GPU memory usage
  • ? Cache hit rates

Integration code for Langfuse – Layer 3 (from examples/05_llm_observability.py):

from langfuse import Langfuse
import os

# Initialize Langfuse
langfuse = Langfuse(
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
    host=os.getenv("LANGFUSE_HOST", "http://localhost:3001"),
)

def generate_with_observability(prompt: str, user_id: str, metadata: Dict = None):
    """Generate response with full Langfuse tracing"""

    # Create trace
    trace = langfuse.trace(
        name="financial_analysis",
        user_id=user_id,
        metadata=metadata or {},
    )

    # Start generation span
    generation = trace.generation(
        name="vllm_generate",
        model="microsoft/phi-2",
        input=prompt,
        metadata={
            "quantization": "awq",
            "max_tokens": 200,
        }
    )

    # Generate
    start = time.time()
    result = llm.generate(prompt)
    latency_ms = (time.time() - start) * 1000

    # Calculate cost
    tokens_in = len(prompt) / 4
    tokens_out = len(result.tokens)
    cost_usd = ((tokens_in + tokens_out) / 1000) * 0.0001

    # End span with metrics
    generation.end(
        output=result.text,
        usage={
            "input_tokens": int(tokens_in),
            "output_tokens": tokens_out,
            "total_tokens": int(tokens_in + tokens_out),
        },
        metadata={
            "latency_ms": latency_ms,
            "cost_usd": cost_usd,
            "model": "phi-2",
        }
    )

    return result

# Usage
result = generate_with_observability(
    prompt="Analyze Apple's Q4 earnings",
    user_id="analyst_001",
    metadata={
        "team": "equity_research",
        "department": "finance",
    }
)

You can see following in Langfuse dashboard:

  • Every prompt and response
  • Cost per request, per user, per team
  • Latency trends over time
  • Token usage patterns
  • Quality scores (if you add feedback)
  • Prompt versions (track what works)

Alerting Strategy

You can configure Langfuse with alerting with various severity such as:

Critical (PagerDuty/Phone):

  • Service down
  • Error rate >10%
  • Daily budget exceeded by 50%
  • GPU out of memory

Warning (Slack):

  • Error rate >5%
  • P95 latency >1000ms
  • Daily budget at 80%
  • GPU memory >95%

Info (Email):

  • Daily usage summary
  • Cost reports
  • Quality metrics

Observability isn’t optional for production AI—it’s essential.


Core Concept 6: Production Error Handling

Your AI system will fail. GPUs crash, networks drop, users send garbage, budgets get exceeded.

Error Handling Pattern Flow

Five essential patterns:

Pattern 1: Retry with Exponential Backoff

Here is a retry logic (from examples/07_advanced_error_handling.py):

from typing import Callable
from dataclasses import dataclass
import time

@dataclass
class RetryConfig:
    """Retry configuration"""
    max_retries: int = 3
    initial_delay: float = 1.0
    max_delay: float = 60.0
    exponential_base: float = 2.0

def retry_with_backoff(config: RetryConfig = RetryConfig()):
    """
    Decorator: Retry with exponential backoff

    Example:
        @retry_with_backoff()
        def generate_text(prompt):
            return llm.generate(prompt)
    """
    def decorator(func: Callable) -> Callable:
        def wrapper(*args, **kwargs):
            delay = config.initial_delay

            for attempt in range(config.max_retries):
                try:
                    return func(*args, **kwargs)

                except Exception as e:
                    if attempt == config.max_retries - 1:
                        raise  # Last attempt, re-raise

                    error_type = classify_error(e)

                    # Don't retry on invalid input
                    if error_type == ErrorType.INVALID_INPUT:
                        raise

                    print(f"??  Attempt {attempt + 1} failed: {error_type.value}")
                    print(f"   Retrying in {delay:.1f}s...")
                    time.sleep(delay)

                    # Exponential backoff
                    delay = min(delay * config.exponential_base, config.max_delay)

            raise RuntimeError(f"Failed after {config.max_retries} retries")

        return wrapper
    return decorator

# Usage
@retry_with_backoff(RetryConfig(max_retries=3, initial_delay=1.0))
def generate_with_retry(prompt: str):
    """Generate with automatic retry on failure"""
    return llm.generate(prompt)

# This will retry up to 3 times with exponential backoff
result = generate_with_retry("Analyze earnings report")

Pattern 2: Circuit Breaker

When a service starts failing repeatedly, stop calling it:

from datetime import datetime, timedelta
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing recovery

class CircuitBreaker:
    """
    Circuit breaker for fault tolerance

    Prevents cascading failures by stopping calls to
    failing services
    """

    def __init__(
        self,
        failure_threshold: int = 5,
        timeout: int = 60,
        expected_exception: type = Exception
    ):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.expected_exception = expected_exception

        self.failure_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED

    def call(self, func: Callable, *args, **kwargs):
        """Execute function with circuit breaker protection"""

        if self.state == CircuitState.OPEN:
            # Check if timeout elapsed
            if datetime.now() - self.last_failure_time > timedelta(seconds=self.timeout):
                self.state = CircuitState.HALF_OPEN
                print("? Circuit breaker: HALF_OPEN (testing recovery)")
            else:
                raise RuntimeError("Circuit breaker OPEN - service unavailable")

        try:
            result = func(*args, **kwargs)

            # Success - reset if recovering
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
                print("? Circuit breaker: CLOSED (service recovered)")

            return result

        except self.expected_exception as e:
            self.failure_count += 1
            self.last_failure_time = datetime.now()

            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN
                print(f"? Circuit breaker: OPEN (threshold {self.failure_threshold} reached)")

            raise

# Usage
circuit_breaker = CircuitBreaker(failure_threshold=5, timeout=60)

def generate_protected(prompt: str):
    """Generate with circuit breaker protection"""
    return circuit_breaker.call(llm.generate, prompt)

# If llm.generate fails 5 times, circuit breaker opens
# Requests fail fast for 60 seconds
# Then one test request (half-open)
# If successful, normal operation resumes

This prevents:

  • Thundering herd problem
  • Resource exhaustion
  • Long timeouts on every request

Pattern 3: Rate Limiting

Protect your system from overload:

import time

class RateLimiter:
    """
    Token bucket rate limiter

    Limits requests per second to prevent overload
    """

    def __init__(self, max_requests: int, time_window: float = 1.0):
        self.max_requests = max_requests
        self.time_window = time_window
        self.tokens = max_requests
        self.last_update = time.time()

    def acquire(self, tokens: int = 1) -> bool:
        """Try to acquire tokens, return True if allowed"""

        now = time.time()
        elapsed = now - self.last_update

        # Refill tokens based on elapsed time
        self.tokens = min(
            self.max_requests,
            self.tokens + (elapsed / self.time_window) * self.max_requests
        )
        self.last_update = now

        if self.tokens >= tokens:
            self.tokens -= tokens
            return True
        else:
            return False

    def wait_for_token(self, tokens: int = 1):
        """Wait until token is available"""
        while not self.acquire(tokens):
            time.sleep(0.1)

# Usage
rate_limiter = RateLimiter(max_requests=100, time_window=1.0)

@app.post("/generate")
async def generate(request: GenerateRequest):
    # Check rate limit
    if not rate_limiter.acquire():
        raise HTTPException(
            status_code=429,
            detail="Rate limit exceeded (100 req/sec)"
        )

    # Process request
    result = llm.generate(request.prompt)
    return result

Why this matters:

  • Prevents DoS (accidental or malicious)
  • Protects GPU from overload
  • Ensures fair usage

Pattern 4: Fallback Strategies

When primary fails, don’t just error—degrade gracefully:

def generate_with_fallback(prompt: str) -> str:
    """
    Try multiple strategies before failing

    Strategy 1: Primary model (Llama-3-8B)
    Strategy 2: Cached response (if available)
    Strategy 3: Simpler model (Phi-2)
    Strategy 4: Template response
    """

    # Try primary model
    try:
        return llm_primary.generate(prompt)

    except Exception as e:
        print(f"??  Primary model failed: {e}")

        # Fallback 1: Check cache
        cached_response = cache.get(prompt)
        if cached_response:
            print("? Returning cached response")
            return cached_response

        # Fallback 2: Try simpler model
        try:
            print("? Falling back to Phi-2")
            return llm_simple.generate(prompt)

        except Exception as e2:
            print(f"??  Fallback model also failed: {e2}")

            # Fallback 3: Template response
            return (
                "I apologize, but I'm unable to process your request right now. "
                "Please try again in a few minutes, or contact support if the issue persists."
            )

# User never sees "Internal Server Error"
# They always get SOME response

Graceful degradation examples:

  • Can’t generate full analysis? Return summary
  • Can’t use complex model? Use simple model
  • Can’t generate? Return cached response
  • Everything failing? Return polite error message

Pattern 5: Timeout Handling

Don’t let requests hang forever:

import signal

class TimeoutError(Exception):
    pass

def timeout_handler(signum, frame):
    raise TimeoutError("Request timed out")

def generate_with_timeout(prompt: str, timeout_seconds: int = 30):
    """Generate with timeout"""

    # Set timeout
    signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(timeout_seconds)

    try:
        result = llm.generate(prompt)

        # Cancel timeout
        signal.alarm(0)
        return result

    except TimeoutError:
        print(f"? Request timed out after {timeout_seconds}s")
        return "Request timed out. Please try a shorter prompt."

# Or using asyncio
import asyncio

async def generate_with_timeout_async(prompt: str, timeout_seconds: int = 30):
    """Generate with async timeout"""

    try:
        result = await asyncio.wait_for(
            llm.generate_async(prompt),
            timeout=timeout_seconds
        )
        return result

    except asyncio.TimeoutError:
        return "Request timed out. Please try a shorter prompt."

Why timeouts matter:

  • Prevent resource leaks
  • Free up GPU for other requests
  • Give users fast feedback

Combined Example

Here’s how I combine all patterns:

from fastapi import FastAPI, HTTPException
from circuitbreaker import CircuitBreaker, CircuitBreakerError

app = FastAPI()

# Initialize components
circuit_breaker = CircuitBreaker(failure_threshold=5, timeout=60)
rate_limiter = RateLimiter(max_requests=100, time_window=1.0)
cache = ResponseCache(ttl=3600)

@app.post("/generate")
@retry_with_backoff(max_retries=3)
async def generate(request: GenerateRequest):
    """
    Generate with full error handling:
    - Rate limiting
    - Circuit breaker
    - Retry with backoff
    - Timeout
    - Fallback strategies
    - Caching
    """

    # Rate limiting
    if not rate_limiter.acquire():
        raise HTTPException(status_code=429, detail="Rate limit exceeded")

    # Check cache first
    cached = cache.get(request.prompt)
    if cached:
        return {"text": cached, "cached": True}

    try:
        # Circuit breaker protection
        result = circuit_breaker.call(
            generate_with_timeout,
            request.prompt,
            timeout_seconds=30
        )

        # Cache successful response
        cache.set(request.prompt, result)

        return {"text": result, "status": "success"}

    except CircuitBreakerError:
        # Circuit breaker open - return fallback
        return {
            "text": "Service temporarily unavailable. Using cached response.",
            "status": "degraded",
            "fallback": True
        }

    except TimeoutError:
        raise HTTPException(status_code=504, detail="Request timed out")

    except Exception as e:
        # Log error
        logger.error(f"Generation failed: {e}")

        # Return graceful error
        return {
            "text": "I apologize, but I'm unable to process your request.",
            "status": "error",
            "fallback": True
        }

What this provides:

  • ? Prevents overload (rate limiting)
  • ? Fast failure (circuit breaker)
  • ? Automatic recovery (retry)
  • ? Resource protection (timeout)
  • ? Graceful degradation (fallback)
  • ? Performance (caching)

Deployment Recommendations

While my testing remained at POC level, these patterns prepare for production deployment:

Before deploying:

Load Testing

  • Test with expected peak load (10-100x normal traffic)
  • Measure P95 latency under load (<500ms target)
  • Verify error rate stays <1%
  • Confirm GPU memory stable (no leaks)

Production Deployment Checklist

Before going live, verify:

Infrastructure:

  • [ ] GPU drivers installed and working (nvidia-smi)
  • [ ] Docker and Docker Compose installed
  • [ ] Sufficient disk space (200GB+ for models)
  • [ ] Network configured (firewall rules, security groups)
  • [ ] SSL/TLS certificates (for HTTPS)

Configuration:

  • [ ] Model name set correctly in .env
  • [ ] Quantization configured (AWQ recommended)
  • [ ] GPU memory utilization set (0.9 typical)
  • [ ] Prefix caching enabled (ENABLE_PREFIX_CACHING=True)
  • [ ] Budget limits configured
  • [ ] Log level appropriate (info for prod)

Monitoring:

  • [ ] Prometheus scraping vLLM metrics
  • [ ] Grafana dashboard imported and working
  • [ ] Alerts configured in alert_rules.yml
  • [ ] Alert destinations set (PagerDuty, Slack, email)
  • [ ] Langfuse set up (if using LLM observability)

Testing:

  • [ ] Health check returns 200 OK
  • [ ] Can generate completions via API
  • [ ] Metrics endpoint returning data
  • [ ] Error handling works (try invalid input)
  • [ ] Budget limits enforced (if configured)
  • [ ] Load test passed (see next section)

Security:

  • [ ] API authentication enabled
  • [ ] Rate limiting configured
  • [ ] HTTPS enforced (no HTTP)
  • [ ] CORS policies set
  • [ ] Input validation in place
  • [ ] Secrets not in git (use env variables)

Operations:

  • [ ] Backup strategy for logs
  • [ ] Model cache backed up
  • [ ] Runbook written (how to handle incidents)
  • [ ] On-call rotation defined
  • [ ] SLAs documented
  • [ ] Disaster recovery plan

Real-World Results

Testing on GCP L4 GPUs with 11 queries produced these validated results:

End-to-End Integration Test Results

Test configuration:

  • Model: Phi-2 (2.7B parameters)
  • Quantization: None (FP16 baseline)
  • Prefix caching: Enabled
  • Budget: $10/day
  • Hardware: GCP L4 GPU

Results:

MetricValue
Total Requests11
Success Rate100% (11/11) ?
Total Tokens Generated2,200
Total Cost$0.000100
Average Latency5,418ms
Cache Hit Rate90.9% ?
Budget Utilization0.001%

Model distribution:

  • Phi-2: 54.5% (6 requests)
  • Llama-3-8B: 27.3% (3 requests)
  • Mistral-7B: 18.2% (2 requests)

What this proves:
? Intelligent routing works (3 models selected correctly)
? Budget enforcement works (under budget, no overruns)
? Prefix caching works (91% hit rate = huge savings)
? Multi-model support works (distributed correctly)
? Observability works (all metrics collected)

Cost Comparison

Let me show you the exact cost calculations:

Per-request costs (from actual test):

Request 1 (uncached): $0.00002038
Requests 2-11 (cached): $0.00000414 average

Total: $0.00010031 for 11 requests
Average: $0.0000091 per request

Extrapolated monthly costs (10,000 requests/day):

ConfigurationDaily CostMonthly CostSavings
Without caching$0.91$27.30Baseline
With caching (91% hit rate)$0.18$5.4680%
With quantization (AWQ)$0.09$2.7390%
All optimizations$0.09$2.7390%

Add in infrastructure costs:

GCP L4 GPU: $0.45/hour = $328/month

Total monthly cost:
- Infrastructure: $328
- API costs: $2.73
- Total: $330.73/month for 10,000 requests/day

Compare to OpenAI:

OpenAI GPT-4:
- Input: $0.03 per 1K tokens
- Output: $0.06 per 1K tokens
- Average request: 100 tokens in + 100 tokens out = $0.009
- 10,000 requests/day = $90/day = $2,700/month

Savings: $2,369/month (88% cheaper!)

Benchmark Results Summary

Here are all the benchmark results from GCP L4:

1. Throughput Benchmark (benchmarks/01_throughput_comparison.py)

Batch SizeTokens/SecLatency (ms)Speedup
141.59871x
4165.82474x
8331.61248x
16663.26216x
32934.49922.5x

Key insight: Batching provides massive throughput improvements (22.5x!)

2. Memory Efficiency (benchmarks/02_memory_efficiency.py)

Batch SizeMemory Used (GB)Overhead
119.30Baseline
419.33+0.16%
819.38+0.41%
1619.45+0.78%
3219.58+1.45%

Key insight: PagedAttention keeps memory growth near zero even with large batches

3. Cost Analysis (benchmarks/03_cost_analysis.py)

ScenarioCost/Monthvs GPT-4
OpenAI GPT-4$666Baseline
OpenAI GPT-3.5$15-98%
vLLM Phi-2 (FP16)$324-51%
vLLM + AWQ$87-87%
vLLM + AWQ + Caching$65-90%
All optimizations$45-93%

Key insight: Self-hosting with vLLM is 93% cheaper than OpenAI GPT-4

4. Quantization (benchmarks/04_quantization_comparison.py)

SchemeMemory (GB)CompressionQuality Loss
FP1619.31x0%
AWQ5.23.7x~2%

Key insight: AWQ provides 3.7x compression with minimal quality loss

What validated:
? Intelligent routing correctly classified 100% of queries
? Budget enforcement prevented overruns
? Prefix caching delivered promised 80% savings
? Multi-model support distributed load appropriately
? Observability captured all metrics accurately

What Surprised Me

Good surprises:

  1. Cache hit rates higher than expected – I expected 70%, got 91%
  2. Quantization quality loss minimal – Barely noticeable in real use
  3. vLLM stability – Zero crashes during testing
  4. Cost savings magnitude – 93% cheaper than GPT-4 is huge

Challenges:

  1. FP8 not supported on L4 – Had to use AWQ instead (still great)
  2. First request slow – Model loading takes 8 seconds (then fast)
  3. Large context memory usage – 2K tokens works, 4K+ needs more GPU

ROI Calculation (50,000 requests/day)

Option A: OpenAI GPT-4

Cost per request: $0.009
Daily: $450
Monthly: $13,500
Annual: $162,000

Option B: vLLM on GCP L4 (our solution)

Infrastructure: $328/month
API costs (with optimizations): $13.65/month
Monthly total: $341.65
Annual: $4,100

Savings: $157,900/year (97%)

Break-even:

Setup time: 2 days engineering ($2,000)
Maintenance: 4 hours/month ($200/month)

Year 1:
  Savings: $157,900
  Costs: $2,000 setup + $2,400 maintenance = $4,400
  Net: $153,500 saved

ROI: 3,500% in year 1

At scale (500,000 requests/day):

OpenAI GPT-4: $1,350,000/year
vLLM solution: $41,000/year

Savings: $1,309,000/year (97%)

Production Readiness Checklist

Based on testing, here’s what enterprise deployment requires:

Security & Compliance:

  • Authentication/authorization at API level
  • Data encryption (rest and transit)
  • PII detection and redaction capabilities
  • Audit logs for compliance (GDPR, HIPAA)
  • Network security (VPC, firewalls, no public exposure)

Operational Excellence:

  • Comprehensive monitoring (3 layers: infra, app, LLM)
  • Alerting strategy (critical/warning/info tiers)
  • Structured logging with aggregation
  • Backup/recovery procedures tested
  • Incident response runbook documented

Performance & Scale:

  • Load testing validates capacity
  • P95 latency meets SLAs (<500ms)
  • Success rate >99.9% under load
  • Auto-scaling strategy defined
  • Capacity planning for 2x, 5x, 10x growth

Cost Governance:

  • Hard budget limits (daily, monthly)
  • Per-user and per-team tracking
  • Cost dashboards for visibility
  • Automated alerts at 80%, 100%
  • Chargeback reports for finance

Quality Assurance:

  • Automated test suite (unit, integration, e2e)
  • Error handling verified (retries, circuit breakers)
  • Fallback strategies tested
  • Chaos engineering (simulate failures)
  • SLA monitoring automated

Final Thoughts

After building and testing this platform, I understand why enterprise AI differs from giving developers ChatGPT access and why 95% of initiatives fail. Here is why these layers matter:

  • Cost tracking isn’t about being cheap—it’s about accountability. Finance won’t approve next year’s AI budget without ROI proof.
  • Intelligent routing prevents the death spiral: early excitement ? everyone uses the expensive model ? costs spiral ? finance pulls the plug ? initiative dies.
  • Observability builds trust. When executives ask “Is AI working?”, you need data: success rates, cost per department, quality trends. Without metrics, you get politics and cancellation.
  • Error handling and budgets are professional table stakes. Enterprises can’t have systems that randomly fail or spend unpredictably.

Here are things missing from the prototype:

  • Security: No SSO, PII detection, audit logs for compliance, encryption at rest, security review
  • High Availability: Single instance, no load balancer, no failover, no disaster recovery
  • Operations: No CI/CD, secrets management, log aggregation, incident playbooks
  • Scale: No auto-scaling, multi-region, or load testing beyond 100 concurrent
  • Governance: No approval workflows, per-user limits, content filtering, A/B testing

I have learned that vLLM works, open models are competitive, the tooling is mature. This POC proves that the patterns work and the savings are real. The 5% that succeed treat AI as infrastructure requiring proper tooling. The 95% that fail treat it as magic requiring only faith.

Try it yourself: All code at github.com/bhatti/vllm-tutorial. Clone it, test it, prove it works in your environment. Then build the business case for production investment.

Powered by WordPress