In past projects, I saw most engineering teams ran load tests before major launches and rarely at any other time. The assumption is that if a code change is small, performance is probably fine. In practice, that assumption fails regularly. A runtime upgrade can change memory allocation patterns, garbage collection behavior, and connection handling in ways that only appear under load. A third-party library upgrade can introduce synchronous blocking where there was none before. A new database index can shift query planner behavior and affect read latency at scale. None of these surface in functional tests. None of them are visible in code review. They show up under load, in production, usually at the worst possible time.
Performance testing isn’t a pre-launch ceremony. It’s part of how you understand and maintain your system’s behavior as your code evolves, your dependencies change, and your traffic grows. This guide covers the full scope: the test types and what each one tells you, how to design meaningful tests, what metrics to collect, which tools to use, how to handle dependencies in your tests, and how to make this a regular part of your development process rather than a one-time event.
Why Performance Testing Gets Skipped
Often teams skip performance testing due to setup time, cost or slow feedback loop. These constraints are legitimate, but they lead to a familiar outcome where performance problems get discovered in production. Another common pattern I have observed is that many teams don’t have a clear baseline picture of how their application actually behaves. They don’t know their normal memory footprint. They don’t know which code paths are hot. They don’t know at what concurrency level their database connection pool saturates or when their cache hit rate starts degrading. Without a baseline, you can’t detect regressions, you can’t capacity plan accurately, and you can’t tell a normal traffic spike from an actual problem. The goal of performance testing is to know your system well enough to predict how it behaves and catch it when behavior changes unexpectedly.
Performance Testing in the SDLC
The most effective teams don’t treat performance testing as a separate phase instead they integrate it into their regular development process at multiple levels.
During development: I have found profiling tools like JProbe/Yourkit for Java, pprof for GO, V8 profiler for Node.js, XCode instruments for Swift/Objective-C incredibly useful to find hot code path, memory leaks or concurrency issues.
During code review: Another common pattern that I have found useful is flagging changes to caching, database queries, serialization, or hot code paths for load testing before merge.
Nightly CI/CD pipelines: Though, load testing on each commit would be excessive but they can partially run as part of nightly build so that we can fix them before they reach production.
On a regular schedule: Another option is to run full-scale load and soak tests run on a defined cadence like weekly.
Before major releases: Comprehensive tests covering all scenarios like average load, peak load, stress, spikes, soak can run against a production-representative environment.
After significant dependency upgrades: Runtime upgrades, major library version bumps, and infrastructure changes all deserve their own performance test pass.
The Testing Taxonomy
Following are different types of performance tests:
Profiling
Profiling instruments your application during execution and shows you exactly where time and memory are spent like which functions consume CPU, which allocate the most memory, where goroutines or threads block. You can run profiling locally before the code review so that you understand the bottlenecks already exist in your code. Load testing tells you how those bottlenecks behave when many users hit them simultaneously. Most runtimes include profiling support like Go’s pprof, Node.js’s built-in CPU and heap profiler, Python’s cProfile so you can also enable them in a test environment if needed.
Load Testing
Load testing applies a realistic, expected workload and verifies the system meets defined performance targets. The workload mirrors production traffic such as request distribution, concurrency level, and payload shapes. The goal isn’t to break anything. It’s to confirm the system handles its designed workload within acceptable response times and error rates. Any change that could affect throughput like a code change in a hot path, a dependency upgrade, a configuration change, a schema migration should warrant a load test.
Stress Testing
Stress testing pushes load well beyond expected levels to find where the system breaks and how it breaks. At what point does performance degrade? What component fails first? Does the system fail gracefully or catastrophically, or corrupting state? In past projects, I found a practical target in cloud environments is 10x your expected peak load. This accounts for real-world variability: viral traffic events, bot traffic, cascading retries from upstream services, and faster growth than planned. Stress tests also expose whether your failure modes are safe. When your system can’t keep up, what happens? Does it queue requests until it runs out of memory? Does it reject new connections cleanly with meaningful errors? Does retry behavior from clients amplify load turning a recoverable spike into a full outage?
Spike Testing
Spike testing applies an abrupt load increase not a gradual ramp but a sharp jump so that we can learn how the system absorbs and recovers from it. This simulates promotional emails going out, products appearing in news, scheduled batch jobs triggering thousands of concurrent operations, or a mobile app push notification causing a synchronized rush of API calls. The spike testing can identify problems like cold-start latency when new instances initialize, connection pool exhaustion when concurrency jumps faster than the pool replenishes, cache stampedes when many concurrent requests miss cache simultaneously, and auto-scaling lag when the metric-to-action delay is too long. After the spike, watch recovery. Latency should return to baseline. Resource utilization should drop. If it doesn’t, the system is carrying forward pressure that will degrade subsequent traffic.
Soak Testing
Soak testing runs a moderate, sustained load over an extended period from several hours to several days. The load level isn’t extreme; the duration is the point so that it can uncover problems that only occur after a long duration such as:
Memory leaks: Usage climbs slowly and continuously. The system that runs fine for 30 minutes may run out of heap after 8 hours. This is especially important to test after runtime or library upgrades, which can change allocator behavior.
Connection leaks: Database or HTTP connections that aren’t properly released accumulate until the pool is exhausted.
Thread accumulation: Background threads that don’t terminate properly compound over time.
Disk exhaustion: Log files that aren’t rotated, or temporary files that aren’t cleaned up, fill disk gradually.
Cache degradation: Caches misconfigured for their access patterns may perform well initially and degrade as the working set evolves.
GC pressure: Garbage collection that runs cleanly initially can become increasingly frequent and pause-heavy as heap fragmentation grows over time.
Scalability Testing
Scalability testing validates that your system scales up to absorb increasing load and scales back down when load subsides. Cloud infrastructure assumes elastic scaling so scalability testing verifies the assumption. This helps verify that: the metric driving scale-up (CPU, request rate, queue depth) actually reaches its threshold under realistic load. The scaling event actually reduces the pressure that triggered it. Scale-up happens fast enough that users don’t experience degradation during the lag. Scale-down doesn’t trigger an immediate scale-up cycle, creating instability. In practice, auto-scaling especially first scale event can take several minutes so you need to make sure that you have some extra capacity to handle increased load.
Volume Testing
Many performance characteristics change materially as data grows. Index scan times increase. Query planner behavior shifts. Cache hit rates drop as the working set outgrows cache size. Search latency that is acceptable at 50 million records may become unacceptable at 250 million. Test at your current production data volume, then at projected volumes for 1 and 3 years out. The time to address data growth challenges in architecture is before you’re already there.
Recovery Testing
Recovery testing applies an abnormal condition like a dependency failure, a network partition, a resource exhaustion event and measures how long the system takes to return to normal operation. The key questions: does the system recover at all? How long does recovery take? What’s the user-visible impact during the recovery window?
Handling Dependencies in Your Tests
One of the practical decisions in every load test is what to do about dependencies like external APIs, third-party services, internal microservices, payment processors, identity providers, email services, and so on. You have two approaches, and which one you choose depends on what your test is trying to answer.
Mock Dependencies When You’re Focused on Your Own Code
When your goal is to validate your application’s internal performance like memory footprint, CPU usage, throughput of your business logic, efficiency of your data access layer then mocking external dependencies is often the right call. However, you will need to build a well-designed mock that returns realistic response payloads with configurable latency. Mocking lets you:
Isolate your application’s performance characteristics from the noise of external variability
Simulate dependency failure modes (timeouts, errors, slow responses) in a controlled way
Run tests without consuming third-party quotas or generating costs in external systems
Reproduce specific latency profiles to understand how your code behaves under different dependency performance conditions
Include Real Dependencies When Integration Behavior Matters
When your goal is to validate end-to-end system behavior including the interaction effects between your system and its dependencies then you can use real dependencies or realistic stubs deployed under your control. The reason this matters: under load, dependencies behave differently than they do at idle. For example, higher latency in dependencies can propagate creating back-pressure in your system that a mock would never reveal. Dependencies that are slow, throttled, or unavailable under load can:
Exhaust your connection pools (connections held open waiting for a slow response)
Fill your request queues (new requests queueing behind slow in-flight requests)
Trigger retry storms (your retry logic amplifying load on an already-struggling dependency)
Surface timeout and circuit-breaker behavior that only activates under real latency conditions
If you include real third-party services in your load test, be explicit about two things: you may consume quota and generate costs, and their performance becomes part of your results. When a dependency is slow, it appears as latency in your own metrics — know what you’re measuring.
A practical middle ground: deploy internal stubs for your external dependencies. A stub is a service you control that returns realistic responses with configurable behavior. Unlike a mock in a test harness, a stub runs as a real service and participates in your actual network topology. It lets you test realistic integration behavior without the unpredictability or cost of real external services.
Watch for Automatic Retry Amplification
Another factor that can skew results from performance testing is automated retries at various layers when a request fails or times out. Under load, this multiplies traffic. If your application generates 400 write operations per second against a dependency, and that dependency starts returning errors, your client may retry each failed request two or three times and suddenly generating 800 to 1,200 operations per second against an already-struggling system. In your load tests, verify that your retry behavior is bounded and doesn’t turn a manageable degradation into a cascading failure. Exponential backoff with jitter, retry budgets, and circuit breakers all exist to prevent this.
Design Your Load Model
Before writing a test script, model the load you intend to generate. A poorly designed load model produces results that feel meaningful but don’t correspond to anything real.
Use Production Traffic Patterns as Your Starting Point
Study your actual production metrics. Identify:
Average requests per second across a normal operating period
Peak requests per second during your highest-traffic periods
Request distribution across endpoints: what percentage of traffic hits each API? Most services have a small number of high-traffic endpoints and many low-traffic ones.
Read/write ratio: most production services are read-heavy; your load model should reflect that
Payload characteristics — average request and response sizes
User session behavior: are users authenticated? Do requests carry session state? Do later requests in a workflow depend on earlier ones?
Geographic distribution: does your traffic come from one region or many?
Use Stepped Load Progression
Ramp load gradually rather than jumping to peak immediately. A stepped approach produces distinct data points at each level, making it easier to identify where behavior changes.
Hold each step long enough for metrics to stabilize and for any auto-scaling events to complete. If your auto-scaling policy triggers after 5 minutes of sustained high CPU, your steps need to run for at least 7-10 minutes. Steps that are too short produce transient data that doesn’t represent steady-state behavior.
Model Think Time
Real users don’t send requests as fast as possible. They read pages, fill forms, wait for results, and make decisions. Think time like the pause between user actions should be randomized within a realistic range based on observed production behavior. Omitting think time concentrates load artificially, inflates concurrency counts, and produces results that don’t correspond to real user behavior.
Model Transaction Workflows, Not Just Endpoints
A user doesn’t hit /api/checkout. They authenticate, browse products, add items to a cart, enter payment details, and confirm an order. Each step depends on the previous step and carries state forward. Test complete workflows. Measure the whole transaction, not just individual request latency. This reveals which step in the workflow breaks first under load, which is your actual bottleneck. For transactional workflows, count the full transaction as your unit of measurement, not individual requests. A checkout that takes 12 requests and completes in 3 seconds is different from one that requires 12 requests and only completes 60% of the time under load.
The Test Environment
Your test environment is the single largest source of invalid load test results. Get this wrong and every metric, analysis, and conclusion downstream becomes unreliable.
Match Production Infrastructure
The test environment should match production in:
Instance types, sizes, and counts
Database configuration: connection pool size, cache allocation, index configuration, replica count
Caching layers and their sizes (this is a common miss a cache sized to 10% of production will warm and evict very differently)
Auto-scaling configuration and thresholds
Load balancer and network configuration
All service configurations that affect throughput or latency
Pay particular attention to cache sizes. Under-sized caches in test environments produce unrealistically high cache miss rates, which increases database load and makes your results look worse than production will be. Over-sized caches make things look better.
Use Representative Data Volumes
Test environments with small datasets produce misleading results. A database with 1 million rows behaves differently from one with 100 million rows in ways that are significant and non-linear. Index performance, query planner behavior, partition routing, and cache hit rates all change with data volume. Populate your test environment with data that reflects realistic production scale before running meaningful performance tests.
Isolate the Test Environment Completely
I have seen a load test takes down production environment because it shared a common infrastructure. A test environment that shares any infrastructure with production like databases, message queues, caching clusters, network paths, logging infrastructure creates two simultaneous problems: invalid test results (because production traffic contaminates your measurements) and potential production incidents (because your load test contaminates production systems). Shared test environments that connect to production Messaging bus, Kafka, or database clusters have caused outages. Enforce complete isolation.
Account for Test Data Accumulation
Load tests generate real data. After many test runs, your test database accumulates records, logs grow, and storage fills. Plan your test data lifecycle from the start, e.g., how you populate data before tests, whether you clean up between runs, and how you prevent accumulated test data from affecting your test environment’s performance over time.
Document Your Environment Specification
Version-control your environment definition alongside your test scripts. When you compare results across time, you need to know that what changed was the system under test, not the test environment. An environment specification that exists only in someone’s memory cannot be reproduced reliably.
Metrics: Collect the Right Things
Load testing generates a lot of data. The teams that extract the most value don’t collect more metrics, they collect the right metrics and actually analyze them.
Latency
Track percentiles, not averages. Averages hide tail behavior that determines user experience.
P50 — what the median user experiences
P90 — your common-case ceiling; nine in ten requests complete within this
P99 — your near-worst case; one in a hundred users waits this long
P99.9 — your extreme tail; relevant for high-volume services where 0.1% is still thousands of users
The gap between P50 and P99.9 tells you about consistency. A wide gap means some users experience good performance while others experience unacceptable degradation. Systems under load often hold P50 steady while P99 climbs.
Throughput
Requests per second: raw system throughput
Successful transactions per second: throughput filtered by correctness; throughput with a 20% error rate is not good throughput
Throughput per resource unit: requests per CPU core, per GB of memory helps with capacity planning
Error distribution — which specific errors, at what load levels
Don’t aggregate errors into a single rate. A 2% error rate composed entirely of timeouts tells you something different from a 2% error rate of connection refused responses. Decompose your error data and correlate specific error types with the load levels at which they appear.
Resource Utilization
Collect these for every component like application servers, databases, caches, message queues, load balancers, and load generators:
CPU: overall and per-core; watch for single-threaded bottlenecks where overall CPU looks fine but one core is maxed
Memory: heap usage, GC frequency and pause duration, swap usage; track memory over time in soak tests to detect leaks
Disk I/O: read and write throughput, queue depth, utilization percentage; relevant for databases and any service that writes logs or temp files
Network I/O: ingress and egress bytes per second, connection counts, dropped packets
Thread and connection pool utilization: active threads, queued requests, pool exhaustion events
Application-Level Metrics
Cache hit/miss/eviction rates: degrading hit rates under load reveal cache sizing or key distribution problems
Queue depths: growing queues indicate consumers can’t keep pace with producers
Database connection pool saturation: one of the most common failure modes under load
GC pause duration and frequency: GC pressure under load causes latency spikes that don’t show up in CPU metrics directly
Retry rates: high retry rates indicate a dependency is struggling, and may be amplifying load
Circuit breaker state: how often circuit breakers open under load, and what triggers them
Dependency-Level Metrics
When you include real dependencies in your test, monitor them as carefully as your own service:
Response latency from each dependency (P50, P99)
Error rates from each dependency
Dependency-side resource utilization (if you have access)
Message bus ingress and egress (if applicable)
Partition utilization for distributed storage systems
When a dependency is slow or erroring, that signal propagates through your system as elevated latency and errors in your own metrics. You need dependency-level metrics to trace the source.
Availability
Define availability targets before testing:
Service availability: percentage of requests that succeed
Per-endpoint availability: some endpoints degrade before others; measure them independently
Dependency availability: availability of each system your service calls
Business-Level Metrics
The most important metrics are often furthest from the infrastructure:
Orders completed per minute
Successful authentication rate
Payment processing completion rate
Data write confirmation rate
Infrastructure metrics tell you what the system is doing. Business metrics tell you what users are experiencing. A system where P99 latency stays within SLA but checkout completion drops 15% under load has a problem that infrastructure metrics alone won’t reveal clearly.
Tools
Over the years, I have used various commercial and open source tools like LoadRunner, Grinder, Tsung, etc. that are no longer well maintained. Here are common tools that can be used for load testing:
For Simple Endpoint Testing
ab (Apache Bench) and Hey: Command-line tools that generate load against a single endpoint. No scripting required, fast to start.
Vegeta: Generates load at a constant request rate, independent of server response time. This distinction matters: when your server responds slowly, most tools automatically reduce request rate. Vegeta maintains the configured rate as latency climbs, which means you observe back-pressure and degradation accurately.
k6: Scripted in JavaScript, distributed as a single Go binary. k6 handles multi-step scenarios natively, supports parameterized test data, models think time, and exposes rich built-in metrics. It integrates with Prometheus, CloudWatch, and Grafana for analysis, and supports threshold-based pass/fail in CI pipelines.
Apache JMeter: Meter supports complex scenarios through a GUI, handles correlation between requests, has a broad plugin ecosystem, and has extensive enterprise adoption.
Locust: Pure Python, code-defined test scenarios (not XML), a built-in web UI for real-time monitoring, distributed mode via a controller/worker model, and trivially scriptable.
For Distributed Load Generation
AWS Distributed Load Testing: When a single machine can’t generate the volume you need, this solution orchestrates load across multiple instances, accepts JMeter scripts as the test definition, and streams results to time-series storage for analysis. Use it when your bandwidth or TPS requirements exceed what a single load generator can produce.
For Observability During Tests
You can use following monitoring stack to gather performance metrics:
Prometheus + Grafana: commonly used for infrastructure and application metrics; k6 exports directly to Prometheus
CloudWatch: native AWS monitoring; integrates with most AWS services and many load testing tools
Distributed tracing (Jaeger, Zipkin, AWS X-Ray): essential for understanding latency in distributed systems; propagate correlation IDs through every service boundary so you can trace a slow request to the specific component that caused it
Without distributed tracing, diagnosing latency in a multi-service system under load is largely guesswork.
Execution
Warm Up Before Measuring: JIT compilation, connection pool initialization, cache population, and DNS resolution all affect early request latency. Build a ramp-up period into every test. Discard metrics from the warmup phase. Measure steady-state behavior only.
Verify Your Load Generator Isn’t the Bottleneck: Before trusting any results, confirm: load generator CPU stays well below saturation (under 70%), network I/O doesn’t approach the bandwidth ceiling, and the tool achieves the TPS you configured not a lower number due to local resource constraints. If you configure 1,000 TPS but the generator only achieves 600, your results reflect the generator’s limits, not your system’s.
Notify Dependent Teams Before Testing: If your test environment shares any infrastructure with other teams, notify them before running high-volume tests. Unexpected load from your tests against a shared component (a database, a message bus, a routing layer) can cause problems for teams who have no idea a load test is running.
Run Each Scenario in Isolation First: Test each scenario independently before running combinations. An isolated test that reveals a problem gives you more diagnostic information than a combined test that reveals the same problem buried in noise from other scenarios.
Don’t Overwrite Previous Results: Each test run should write to a new, timestamped output file. Overwriting results from a previous run is a common mistake when running iterative tests in a loop. You lose the ability to compare across runs.
Pause Between Runs: Allow the system to fully drain between test iterations like connections close, queues clear, resource utilization returns to baseline. Residual load from one run contaminates the starting conditions of the next.
Common Pitfalls
Testing a single endpoint and calling it done. A service’s behavior under load isn’t determined by any single endpoint. Test complete workflows, including the paths that matter most to users.
Ignoring dependencies. When your dependencies are slow or unavailable, your service appears slow. When your service hammers a dependency with load, the dependency may degrade and create a feedback loop. Model dependency behavior explicitly and mock it when you want to isolate your own code, use real or realistic stubs when integration behavior matters.
Mismatch between test environment and production. Different hardware, different cache sizes, different connection pool limits, different network latency profiles, any of these make test results non-transferable to production. Document your environment specification. Validate that it matches production before trusting results.
Small data volumes. A test environment with 1% of production data volume produces optimistic results. Populate test data to realistic scale.
Running load tests once. Performance characteristics change with every code change, every dependency upgrade, and every growth milestone. A load test you ran six months ago tells you about a system that no longer exists.
Ignoring ramp-down. Verify that resource utilization returns to baseline after load subsides. A system that doesn’t recover cleanly carries forward pressure that degrades subsequent traffic.
Not collecting metrics from all layers. Application-level metrics without infrastructure metrics leave you guessing about root cause. Infrastructure metrics without application or business-level metrics leave you unable to quantify user impact. Collect all three.
Stopping tests when something goes wrong instead of analyzing the failure mode. When a stress test surfaces a failure, that’s the point. Note what failed, under what conditions, and how the system behaved. Stopping the test immediately loses the degradation data that tells you whether the failure mode is safe or catastrophic.
Analysis
Establish a Baseline Before Comparing Anything: Every metric needs a reference point. P99 latency of 300ms is good or bad depending entirely on what P99 looks like at baseline load. Run a baseline test with minimal concurrent users before escalating. Capture that baseline explicitly. Compare every subsequent measurement against it.
Separate Signal from Noise: A single high-latency data point is noise. A systematic increase in P99 as concurrency crosses 500 users is signal. Look for the pattern: where does behavior change? At what load level? After what duration? What resource metric correlates with the change?
Trace Latency to Its Source: When you observe elevated latency, resist looking first at application CPU. Latency accumulates in many places: network round trips between services, database query execution, lock contention, GC pause accumulation, connection pool queuing, and downstream dependency latency. Distributed tracing lets you follow a slow request through every component it touched and attribute the latency precisely. Fix the actual source, not the nearest visible symptom.
Investigate Unexpectedly Good Results: If your system performs better than expected under load, investigate before celebrating. Unexpected improvement often means your test isn’t exercising the paths you intended such as caches warming too aggressively, load not reaching the components you think it is, or test data creating unrealistic access patterns. Results you can’t explain aren’t results you can rely on.
Generate Comparative Reports: A report listing numbers has limited value. A report comparing those numbers to your baseline, to your previous test run, and to your defined thresholds has significant value. For each metric, capture: – Current result – Baseline value – SLA or target threshold – Previous test result (regression or improvement?) – Load level at which the metric was captured
Store test results in a queryable format over time.
Building a Continuous Performance Practice
The teams with the most reliable services don’t treat performance testing as a project. They treat it as a discipline with regular cadence.
Define performance goals and revisit them annually. Goals should include throughput targets, latency percentiles, error rate limits, resource utilization ceilings, and headroom targets (how much capacity should remain available at peak). As your traffic patterns change, your service evolves, and your SLAs tighten, these goals need updating.
Automate pass/fail thresholds in CI. Encode your performance targets as pipeline gates. A change that increases P99 latency by 40% under load should fail the build, the same way a change that breaks a unit test fails the build.
Run performance canaries in production. Continuously exercise production endpoints at low volume from monitoring infrastructure. Track latency, error rates, and throughput over time. Detect gradual degradation before users do.
Assign a performance owner on each service team. Performance improvements don’t happen without someone watching the metrics, reviewing throttling rules, identifying regressions, and driving improvements.
Review results across time for patterns. Look at all your load test results over the past quarter. Which metrics trend in the wrong direction? Which components appear repeatedly in bottleneck analysis? Patterns across multiple tests reveal systemic issues that any individual test misses.
Share what you learn. Performance problems and their solutions are valuable organizational knowledge. Document them. Share them across teams. The team dealing with connection pool exhaustion today is probably not the first team to hit that issue.
The Pre-Test Checklist
Before any load test:
[ ] Test objectives and pass/fail thresholds defined in writing before execution
[ ] Test environment completely isolated from production
[ ] Test environment infrastructure matches production configuration (instance types, cache sizes, connection pools, scaling settings)
[ ] Test data populated to realistic production scale
[ ] Dependent services decided: mock, stub, or real — with rationale documented
[ ] Monitoring dashboards active for all components, including load generators
[ ] Dependent team on-call contacts notified
[ ] Output file naming prevents overwrites between iterations
[ ] Previous test results available for comparison
During execution:
[ ] Baseline captured before escalating load
[ ] Load generator resource utilization verified (not the bottleneck)
[ ] Error rates monitored in real time — abnormal errors trigger a pause for investigation
[ ] Each step held long enough for metrics to stabilize
[ ] Auto-scaling events logged with timestamps
After execution:
[ ] Results compared to defined thresholds and previous runs
[ ] Anomalies investigated before conclusions are drawn
[ ] Root cause documented for any threshold violations
[ ] Action items assigned with owners and deadlines
[ ] Test results stored in versioned, queryable storage
Any code change can affect performance. A dependency upgrade, a new index, a configuration tweak, a framework version bump — all of these can change memory footprint, CPU usage, throughput, and latency in ways that don’t appear until you run real load. The only reliable way to catch these changes before they affect users is to make performance testing a routine part of how you build and ship software, not something you do once before a big launch.
Start with profiling to understand where time and memory go in your own code. Add load tests to your CI pipeline to catch regressions early. Run soak tests to find memory and connection leaks. Stress test to 10x your expected peak so you know what your ceiling looks like and how you fail when you hit it. Test with real dependency behavior when integration effects matter, and mock dependencies when you want to isolate your own code.
Collect metrics at every layer such as application, infrastructure, and business so you can connect a latency spike to its root cause and quantify its user impact. Store results over time so you can detect gradual regressions before they become incidents. The goal is to know your system well enough that production behavior matches what you measured in testing.
Over the years, I have watched distributed services evolve through phases I lived through personally such as CORBA, EJB, SOA, REST microservices, containers WebAssembly feels different. It compiles code from any language into a universal binary format, runs it in a sandboxed environment, and delivers near-native performance without containers or language-specific runtimes cluttering your production stack.
When I built PlexSpaces for serverless FaaS applications, I designed its polyglot layer on top of WebAssembly and the WASI Component Model. It allows you to write actors in Python, Rust, Go, or TypeScript, compile them to WASM, and deploy them to the same runtime. The framework handles persistence, fault tolerance, supervision, and scaling regardless of programming language. In this post, I’ll walk you through the core WebAssembly concepts, show how PlexSpaces leverages the Component Model for polyglot development, and demonstrate building, testing, and deploying applications in all four languages. I’ll also show a PlexSpaces Application Server model that lets you deploy entire application bundleslike deploying a WAR file to Tomcat, but with the fault-tolerance of Erlang/OTP built in.
WebAssembly Introduction
WebAssembly launched in 2017 as a browser technology. I ignored it for years — client-side JavaScript ecosystem drama wasn’t something I wanted to track. The server-side story changed everything.
How WebAssembly Executes Code
WebAssembly is a stack-based virtual machine that executes a compact binary instruction format. Every language that compiles to WASM follows the same pipeline:
The WASM binary format encodes typed functions, a linear memory model, and a set of imports and exports. The runtime validates the binary at load time, then executes it using either just-in-time (JIT) compilation or ahead-of-time (AOT) compilation to native machine code. Three properties make this execution model powerful for distributed systems:
Deterministic execution. Given the same inputs, a WASM module produces the same outputs. This property underpins PlexSpaces’ durable execution, which replays journaled messages through the same WASM binary and arrives at the exact same state.
Memory isolation. Each WASM instance gets its own linear memory. One module cannot read, write, or corrupt another module’s memory. No shared-memory race conditions and buffer overflows escaping the sandbox. The runtime enforces these boundaries at the hardware level.
Capability-based security. A WASM module starts with zero capabilities. It cannot access the filesystem, the network, or even a clock unless the host explicitly provides each capability through imported functions. PlexSpaces grants actors exactly the capabilities they need like messaging, key-value storage, tuple spaces.
The Component Model
Early WebAssembly only understood numbers. You passed integers and floats across the boundary, and that was it. The WebAssembly Component Model fixes this limitation by defining rich, typed interfaces that components use to communicate. You can think of it as an IDL (Interface Definition Language) for WASM but one that works across every language. The key building blocks:
WIT (WebAssembly Interface Types): A language for defining typed function signatures across components. A function defined in WIT can accept strings, records, lists, variants, and enums. WIT bridges the type systems of Rust, Python, Go, and TypeScript into a single, shared contract.
Components: Self-contained WASM modules that declare their imports (what they need from the host) and exports (what they provide). A Rust component and a Python component that implement the same WIT interface become interchangeable at the binary level.
WASI (WebAssembly System Interface): The standardized API that gives WASM modules access to system resources like file I/O, networking, clocks, and random number generation within the sandbox. WASI Preview 2 shipped in 2024 with HTTP, filesystem, and socket support. WASI 0.3, released in February 2026, added native async support for concurrent I/O.
Wasm 3.0 and WasmGC
The WebAssembly ecosystem crossed a critical threshold. Wasm 3.0 became the W3C standard in 2025, standardizing nine production features in a single release:
WasmGC: garbage collection support built into the runtime, eliminating the need for languages like Go, Python, and Java to ship their own GC inside the WASM binary. This shrinks binary sizes and improves performance for GC-dependent languages dramatically.
Exception handling: structured try/catch at the WASM level, replacing the expensive setjmp/longjmp workarounds that inflated binaries.
Tail calls: proper tail call optimization for functional programming patterns without stack overflow.
SIMD (Single Instruction, Multiple Data): vector operations for parallel numeric computation, critical for ML inference and scientific workloads.
For PlexSpaces, WasmGC means Go and Python actors run faster with smaller binaries. SIMD means computational actors like n-body simulations, matrix multiplies, genomics pipelines that process data at near-native throughput inside the sandbox.
What This Means in Practice
You compile a Python actor and a Rust actor to WASM. Both implement the same WIT interface. The runtime loads them identically, calls the same exported functions, and provides the same host capabilities like messaging, key-value storage, tuple spaces, distributed locks. The Python actor handles ML inference; the Rust actor handles high-throughput event processing. They communicate through PlexSpaces message passing without knowing or caring which language sits on the other side.
This is not “Write Once, Run Anywhere” in the old Java sense. This is “Write in Whatever Language Fits, Run Together on the Same Runtime.”
How PlexSpaces Makes It Work
PlexSpaces is a unified distributed actor framework that combines patterns from Erlang/OTP, Orleans, Temporal, and modern serverless architectures into a single abstraction. I described the five foundational pillars in my earlier post: TupleSpace coordination, Erlang/OTP supervision, durable execution, WASM runtime, and Firecracker isolation. Here I focus on the WASM layer and how it enables polyglot development.
Architecture at a Glance
The WIT Contract for Actor
Every actor regardless of source language targets the same WIT world. Here is the simplified world that most polyglot actors use:
The full-featured actor package adds dedicated WIT interfaces for workflows, channels, durability/journaling, registry/service discovery, HTTP client, and cron scheduling. PlexSpaces also defines specialized worlds that import only the capabilities each actor needs:
WIT World
Imports
Use Case
plexspaces-actor
All 13 interfaces
Full-featured actors needing every capability
simple-actor
Messaging + Logging
Lightweight stateless workers
durable-actor
Messaging + Durability
Actors with crash recovery and journaling
coordination-actor
Messaging + TupleSpace
Actors coordinating through shared tuple space
event-actor
Messaging + Channels
Event-driven actors using queues and topics
This design keeps WASM binaries small. A simple actor that only needs messaging imports two interfaces not thirteen.
Language Toolchains
Each language uses a different compiler to produce WASM, but the output targets the same runtime:
Language
Compiler
WASM Size
Performance
Best For
Rust
cargo (wasm32-wasip2)
100KB-1MB
Excellent
Production, performance-critical paths
Go
tinygo
2-5MB
Good
Balanced performance, fast iteration
TypeScript
jco componentize
500KB-2MB
Good
Web integration, rapid development
Python
componentize-py
30-40MB
Moderate
ML inference, data processing, prototyping
Now let’s build something real in each language.
Getting Started
Before diving into the language examples, set up your development environment.
Prerequisites
Rust 1.70+ (for building PlexSpaces itself)
Docker (optional — for the fastest path to a running node)
One or more WASM compilers for your target languages (see below)
Option 1: Docker Quickstart
Pull and run a PlexSpaces node in seconds:
# Pull the official image
docker pull plexobject/plexspaces:latest
# Run a single node with HTTP API on port 8001
docker run -d \
--name plexspaces-node \
-p 8000:8000 \
-p 8001:8001 \
-e PLEXSPACES_NODE_ID=node1 \
-e PLEXSPACES_DISABLE_AUTH=1 \
plexobject/plexspaces:latest
The node exposes a gRPC endpoint on port 8000 and an HTTP/REST gateway on port 8001 with interactive Swagger UI documentation.
Option 2: Build from Source
git clone https://github.com/bhatti/PlexSpaces.git
cd PlexSpaces
./scripts/server.sh
# Or use the Makefile step by step
make build # Build all crates
make test # Run all tests
Install Language Compilers
Install the WASM compiler for each language you plan to use:
# Rust (produces the smallest, fastest WASM)
rustup target add wasm32-wasip2
# Go (pragmatic balance of performance and dev speed)
# macOS:
brew install tinygo
# Also need wasm-tools for component creation:
cargo install wasm-tools
# TypeScript (rapid development, web ecosystem)
npm install -g @bytecodealliance/jco
# Python (ML, data processing, prototyping)
pip install componentize-py
# Optional: WASM binary optimizer (shrinks binaries further)
cargo install wasm-opt
Start the Node and Deploy Your First Actor
# Start a PlexSpaces node (from source)
cargo run --release --bin plexspaces -- start \
--node-id dev-node \
--listen-addr 0.0.0.0:8000 \
--release-config release-config.toml
# Deploy a WASM actor (from any language)
curl -X POST http://localhost:8001/api/v1/applications/deploy \
-F "application_id=my-app" \
-F "name=my-actor" \
-F "version=1.0.0" \
-F "wasm_file=@my_actor.wasm"
# Send it a message
curl -X POST http://localhost:8001/api/v1/actors/my-app/ask \
-H "Content-Type: application/json" \
-d '{"message_type": "hello", "payload": {}}'
Now let’s build real actors in each language.
Python: A Calculator Actor with the SDK
Python shines for rapid prototyping and data-heavy workloads. The PlexSpaces Python SDK uses decorators (@actor, @handler, state()) that eliminate boilerplate and let you focus on business logic.
The Actor Code
# calculator_actor.py
from plexspaces import actor, state, handler, init_handler
@actor
class Calculator:
"""Calculator actor implementing basic math operations."""
# Persistent state fields -- survive crashes via journaling
last_operation: str = state(default=None)
last_result: float = state(default=None)
history: list = state(default_factory=list)
@init_handler
def on_init(self, config: dict):
"""Initialize calculator with optional config."""
if "state" in config:
saved = config["state"]
self.last_operation = saved.get("last_operation")
self.last_result = saved.get("last_result")
self.history = saved.get("history", [])
@handler("add")
def add(self, operands: list = None) -> dict:
"""Add operands together."""
result = sum(operands or [])
self._record("add", operands, result)
return {"result": result, "operation": "add"}
@handler("subtract")
def subtract(self, operands: list = None) -> dict:
"""Subtract: first operand minus rest."""
if not operands or len(operands) < 2:
return {"error": "Subtract requires at least 2 operands"}
result = operands[0] - sum(operands[1:])
self._record("subtract", operands, result)
return {"result": result, "operation": "subtract"}
@handler("multiply")
def multiply(self, operands: list = None) -> dict:
"""Multiply all operands."""
result = 1
for op in (operands or []):
result *= op
self._record("multiply", operands, result)
return {"result": result, "operation": "multiply"}
@handler("divide")
def divide(self, operands: list = None) -> dict:
"""Divide first operand by second."""
if not operands or len(operands) < 2:
return {"error": "Divide requires 2 operands"}
if operands[1] == 0:
return {"error": "Division by zero"}
result = operands[0] / operands[1]
self._record("divide", operands, result)
return {"result": result, "operation": "divide"}
@handler("get_history")
def get_history(self) -> dict:
"""Return calculation history."""
return {"history": self.history}
@handler("call", "get_state")
def get_state_handler(self) -> dict:
"""Snapshot current state."""
return {
"last_operation": self.last_operation,
"last_result": self.last_result,
"history": self.history,
}
def _record(self, operation, operands, result):
self.last_operation = operation
self.last_result = result
self.history.append({
"operation": operation,
"operands": operands,
"result": result,
})
Notice how the @actor decorator marks the class, state() declares persistent fields that survive crashes, and each @handler("operation") routes incoming messages to the right method. The SDK handles WIT serialization, state checkpointing, and all the plumbing underneath.
Build and Deploy
# Install the Python SDK
pip install -e "sdks/python/[dev]"
# Build WASM using the SDK CLI
plexspaces-py build calculator_actor.py \
-o calculator_actor.wasm \
--wit-dir wit/plexspaces-simple-actor
# Deploy the WASM module
curl -X POST http://localhost:8001/api/v1/applications/deploy \
-F "application_id=calculator-app" \
-F "name=calculator" \
-F "version=1.0.0" \
-F "wasm_file=@calculator_actor.wasm"
The actor processes the request, updates its persistent state, and returns the result. If the node crashes and restarts, the framework replays the journal and restores the calculator’s state .
TypeScript: A Bank Account with Durable State
TypeScript brings type safety and rapid development. The PlexSpaces TypeScript SDK uses an inheritance-based pattern: extend PlexSpacesActor, implement on<Operation>() handlers, and the SDK wires everything to WIT.
The BankAccountActor manages deposits, withdrawals, and transaction history with full durability. The onReplay() handler rebuilds the balance from the transaction log, demonstrating event-sourcing patterns that the framework makes trivial.
Build and Deploy
The TypeScript build uses a three-step pipeline: compile TypeScript, bundle with esbuild, then create a WASM component with jco:
Go delivers a pragmatic balance between performance and developer productivity. The PlexSpaces Go SDK uses an interface-based pattern: implement the Actor interface, embed BaseActor for automatic state serialization, and register your actor for WASM export via plexspaces.Register().
The Actor Code
This example implements a sliding-window rate limiter, the kind you find inside API gateways like NGINX, Kong, or Envoy. Each client gets an independent window with configurable limits:
// rate_limiter.go
package main
import (
"encoding/json"
"fmt"
"github.com/plexobject/plexspaces/sdks/go/plexspaces"
)
type SlidingWindowLimiter struct {
plexspaces.BaseActor
WindowSizeMs uint64 `json:"window_size_ms"`
MaxRequests int `json:"max_requests"`
Clients map[string]*ClientWindow `json:"clients"`
TotalChecks int `json:"total_checks"`
TotalAllowed int `json:"total_allowed"`
TotalDenied int `json:"total_denied"`
}
type ClientWindow struct {
Timestamps []uint64 `json:"timestamps"`
Allowed int `json:"allowed"`
Denied int `json:"denied"`
}
var host = plexspaces.NewHost()
func NewSlidingWindowLimiter() *SlidingWindowLimiter {
a := &SlidingWindowLimiter{
WindowSizeMs: 60000,
MaxRequests: 100,
Clients: make(map[string]*ClientWindow),
}
a.SetSelf(a) // enables automatic JSON state serialization
return a
}
func (s *SlidingWindowLimiter) Init(configJSON string) string {
var config struct {
ActorID string `json:"actor_id"`
Args map[string]any `json:"args"`
}
json.Unmarshal([]byte(configJSON), &config)
if args := config.Args; args != nil {
if v, ok := args["window_size_ms"]; ok {
s.WindowSizeMs = uint64(v.(float64))
}
if v, ok := args["max_requests"]; ok {
s.MaxRequests = int(v.(float64))
}
}
host.Info(fmt.Sprintf("RateLimiter: window=%dms, max=%d req/window",
s.WindowSizeMs, s.MaxRequests))
return ""
}
func (s *SlidingWindowLimiter) Handle(from, msgType, payloadJSON string) string {
switch msgType {
case "check_rate":
return s.checkRate(payloadJSON)
case "stats":
return s.getStats()
default:
data, _ := json.Marshal(map[string]any{"error": "unknown: " + msgType})
return string(data)
}
}
func (s *SlidingWindowLimiter) checkRate(payloadJSON string) string {
var req struct { ClientID string `json:"client_id"` }
json.Unmarshal([]byte(payloadJSON), &req)
window, exists := s.Clients[req.ClientID]
if !exists {
window = &ClientWindow{Timestamps: make([]uint64, 0)}
s.Clients[req.ClientID] = window
}
now := host.NowMs()
cutoff := now - s.WindowSizeMs
// Slide the window: remove expired timestamps
var active []uint64
for _, ts := range window.Timestamps {
if ts > cutoff { active = append(active, ts) }
}
window.Timestamps = active
// Check the limit
allowed := len(window.Timestamps) < s.MaxRequests
if allowed {
window.Timestamps = append(window.Timestamps, now)
window.Allowed++; s.TotalAllowed++
} else {
window.Denied++; s.TotalDenied++
}
s.TotalChecks++
remaining := s.MaxRequests - len(window.Timestamps)
if remaining < 0 { remaining = 0 }
data, _ := json.Marshal(map[string]any{
"allowed": allowed, "remaining": remaining,
"limit": s.MaxRequests, "client_id": req.ClientID,
})
return string(data)
}
// Register the actor for WASM export -- runs during _initialize,
// before the host calls any exported functions.
func init() {
plexspaces.Register(NewSlidingWindowLimiter())
}
func main() {}
The Go SDK pattern uses plexspaces.NewHost() to access all host functions (messaging, KV, tuple space, etc.) and plexspaces.Register() in the init() function to wire the actor to the WASM export interface. The comparison to Erlang/OTP maps directly:
Erlang/OTP
PlexSpaces Go
gen_server:start_link/3
Supervisor in app-config.toml
handle_call/3
Handle(from, msgType, payload)
#state{} record
Go struct with JSON tags
gen_server:call(Pid, Msg)
host.Ask(actorID, msgType, data)
application:start/2
app-config.toml
Build and Deploy
The Go build uses a three-step TinyGo pipeline: compile to core WASM, embed WIT metadata, then create a WASM component with a WASI adapter:
Each Go example includes a build.sh script that automates this pipeline and resolves the WASI adapter automatically.
Test Rate Limiting
# Check if a client request is allowed
curl -X POST http://localhost:8001/api/v1/actors/rate-limiter/ask \
-H "Content-Type: application/json" \
-d '{"message_type": "check_rate", "payload": {"client_id": "api-client-1"}}'
# Response: {"allowed": true, "remaining": 99, "limit": 100, "client_id": "api-client-1"}
# After 100 requests within the window:
# Response: {"allowed": false, "remaining": 0, "limit": 100, "client_id": "api-client-1"}
Rust: A Calculator with Maximum Performance
Rust produces the smallest, fastest WASM binaries. When you need every microsecond like high-frequency trading, real-time event processing, computational pipelines. Rust actors deliver near-native performance with binary sizes under 1MB.
The Actor Code
This calculator uses #![no_std] to eliminate the standard library entirely, producing a tiny, self-contained WASM module:
The resulting binary? Under 200KB. Compare that to a Python actor at 30-40MB or even a TypeScript actor at 1-2MB. When you deploy hundreds of actors per node, those size differences translate directly into memory savings and faster cold starts.
Deploying Applications
One of the patterns I find most compelling that I feel the serverless world has completely neglected is the idea of deploying whole applications, not just individual functions. If you have used Tomcat or JBoss, you understand what I mean. You package your application, hand it to the server, and the server takes care of running it, managing the process lifecycle, enforcing security policies, routing requests, collecting metrics, and handling restarts. You focus on business logic; the server handles the infrastructure cross-cuts. PlexSpaces brings this same model to WASM actors, but with Erlang/OTP’s supervision philosophy underneath. I call this the PlexSpaces Application Server model.
The Application Manifest
Instead of deploying actors one by one via API calls, you define an application bundle — a single manifest that describes your entire application topology: which actors to run, how they supervise each other, what resources they need, and what security policies apply to them.
[supervisor]
strategy = "one_for_one"
max_restarts = 10
max_restart_window_seconds = 60
# ChatRoom actor (Durable Object: one per room)
[[supervisor.children]]
id = "chat-room"
type = "worker"
restart = "permanent"
shutdown_timeout_seconds = 10
[supervisor.children.args]
max_history = "100"
# RateLimiter actor (Durable Object: per-user rate limiting)
[[supervisor.children]]
id = "rate-limiter"
type = "worker"
restart = "permanent"
shutdown_timeout_seconds = 5
[supervisor.children.args]
max_tokens = "5"
refill_rate_ms = "1000"
The runtime validates every WASM module against its declared WIT world, starts the supervision tree from the root down, and begins enforcing all security and resource policies — before your first actor processes its first message. PlexSpaces takes care of most cross cutting concerns like auth token validation, rate limiting, structured logging, trace context propagation, circuit breakers so that you can focus on the business logic.
Supervision and restarts. The manifest’s supervision tree is live. If actor crashes, the supervisor restarts it according to the declared strategy. If it exceeds max_restarts within max_restart_window_seconds, the supervisor escalates to its parent. This is exactly how Erlang/OTP gen_server supervision works.
Comparing Deployment Models
Capability
Traditional Microservices
AWS Lambda
PlexSpaces App Server
Deployment unit
Container image per service
Function zip per Lambda
Single .psa bundle for entire app
Supervision
Kubernetes restarts pods
None
Erlang-style supervision tree
Auth enforcement
API gateway / middleware
Custom authorizers
Runtime-level, declarative in manifest
Observability
Manual instrumentation
CloudWatch + X-Ray
Auto-instrumented, zero actor code
Resource limits
Container CPU/mem requests
Timeout + memory settings
Per-actor WASM-level enforcement
Multi-language
Per-container runtimes
Per-function runtimes
All actors in one WASM runtime
State
External (Redis/DB)
External
Built-in durable actor state
Cold start
Seconds
100ms–10s
~50?s (WASM)
FaaS and Serverless
Here is where PlexSpaces bridges the worlds of actor systems and serverless platforms. Every actor you deploy in any language doubles as a serverless function that you invoke over plain HTTP. No client SDK required. No message queue setup. Just HTTP.
HTTP Invocation Model
PlexSpaces exposes a FaaS-style API that routes HTTP requests to actors using a simple URL pattern:
/api/v1/actors/{tenant}/{namespace}/{actor_type}
The HTTP method determines the invocation pattern:
HTTP Method
Pattern
Behavior
GET
Request-reply (ask)
Sends query params as payload, waits for response
POST
Unicast message (tell)
Sends JSON body, returns immediately
PUT
Unicast message (tell)
Same as POST, for update semantics
DELETE
Request-reply (ask)
Sends query params, waits for confirmation
FaaS in Action
This Rust example shows a FaaS-style webhook handler that receives HTTP POST payloads and stores delivery history — the kind of thing you would build on AWS Lambda or Cloudflare Workers, but here using PlexSpaces SDK annotations:
Invoke this actor over HTTP — no SDK, no message queue, just curl:
# POST a webhook delivery (fire-and-forget)
curl -X POST http://localhost:8001/api/v1/actors/acme-corp/webhooks/webhook_handler \
-H "Content-Type: application/json" \
-d '{"event": "order.completed", "order_id": "ORD-12345"}'
# GET recent deliveries (request-reply)
curl "http://localhost:8001/api/v1/actors/acme-corp/webhooks/webhook_handler?action=list"
Multi-Tenant Isolation
The URL path embeds tenant and namespace for built-in multi-tenant isolation. Tenant acme-corp cannot access tenant globex-inc‘s actors. The framework enforces this boundary at the routing layer with JWT-based authentication:
# Tenant A's rate limiter
curl -X POST http://localhost:8001/api/v1/actors/acme-corp/api/rate-limiter \
-d '{"client_id": "user-123"}'
# Tenant B's rate limiter -- completely isolated state
curl -X POST http://localhost:8001/api/v1/actors/globex-inc/api/rate-limiter \
-d '{"client_id": "user-456"}'
How PlexSpaces Compares to Traditional FaaS
The critical difference: PlexSpaces actors retain state between invocations. Traditional FaaS platforms treat functions as stateless — you manage state externally in DynamoDB, Redis, or S3. PlexSpaces actors carry durable state inside the actor, persisted via journaling and checkpointing. This eliminates the “stateless function + external state store” tax that adds latency and complexity to every serverless application.
Capability
AWS Lambda
Cloudflare Workers
PlexSpaces FaaS
Cold start
100ms-10s
~5ms
~50us (WASM)
State
External (DynamoDB)
External (KV/D1)
Built-in (durable actors)
Polyglot
Per-runtime images
JS/WASM only
Rust, Go, TS, Python on same runtime
Coordination
SQS/Step Functions
Durable Objects
TupleSpace, process groups, workflows
Supervision
None
None
Erlang-style supervision trees
Isolation
Container/Firecracker
V8 isolates
WASM sandbox + optional Firecracker
PlexSpaces includes migration examples that show how to port existing Lambda functions, Step Functions workflows, Azure Durable Functions, Cloudflare Workers, and Orleans grains (See examples).
What WebAssembly Gives You
Let me address the obvious question: “Why not just use Docker?” Containers solve many problems well. But as Solomon Hykes, Docker’s creator, said in 2019 when WASI was first announced:
“If WASM+WASI existed in 2008, we wouldn’t have needed to create Docker. That’s how important it is. WebAssembly on the server is the future of computing. A standardized system interface was the missing link. Let’s hope WASI is up to the task!” — Solomon Hykes, March 2019
Startup time. A WASM module instantiates in microseconds. A container takes seconds. When you auto-scale actors in response to load spikes, microsecond cold starts mean your users never notice.
Memory footprint. A Rust WASM actor uses ~200KB. The equivalent Docker container starts at 50MB minimum (Alpine base image alone). On a single node, you run thousands of WASM actors where you might run dozens of containers.
Security isolation. WASM sandboxing is capability-based. A module cannot access the filesystem, network, or memory outside its sandbox unless the host explicitly grants each capability through WASI. Containers share a kernel and rely on namespace isolation — a fundamentally larger attack surface.
True polyglot. With containers, each language gets its own image, runtime, dependency tree, and deployment pipeline. With WASM, all languages produce the same artifact type, run on the same runtime, and share the same deployment pipeline.
Composability. The Component Model lets you link WASM modules from different languages into a single process. No network calls. No serialization overhead. Direct function invocation across language boundaries. Try that with Docker.
PlexSpaces actually supports both: WASM sandboxing for lightweight actors and Firecracker microVMs for workloads that need full hardware-level isolation. You pick the isolation model per workload, and the framework handles the rest.
Where This Is Heading
The WASM Ecosystem Roadmap
The ecosystem moves fast. Here are the milestones that matter:
Wasm 3.0 became the W3C standard in September 2025, standardizing nine production features including WasmGC, exception handling, tail calls, and SIMD
WASI 0.3 shipped in February 2026 with native async support — actors can now handle concurrent I/O without blocking
WASI 1.0 is on track for late 2026 or early 2027, providing the stability guarantees that enterprise adopters require
Wasmtime leads the runtime ecosystem with full Component Model and WASI 0.2 support
Wasmer 6.0 achieved ~95% of native speed on benchmarks
Docker now runs WASM components alongside containers in Docker Desktop and Docker Engine
The FaaS-Actor Convergence
The most consequential trend is the convergence of serverless FaaS platforms and stateful actor systems. Today these exist as separate categories — AWS Lambda handles stateless functions, Temporal handles durable workflows, Orleans handles virtual actors, and Erlang/OTP handles fault-tolerant supervision. PlexSpaces unifies them into a single abstraction. This convergence accelerates along three axes:
HTTP-native invocation. Every PlexSpaces actor is already a serverless function, callable over HTTP with automatic routing, multi-tenant isolation, and load balancing. As the WASM ecosystem matures, the cold start advantage (microseconds vs. seconds) makes WASM actors a compelling replacement for traditional Lambda functions, especially at the edge.
Durable serverless. Traditional FaaS treats functions as stateless. PlexSpaces combines serverless invocation with durable execution — actors retain state, the framework journals every message, and crash recovery replays the journal to restore exact state. This eliminates the “Lambda + DynamoDB + Step Functions” stack that every non-trivial serverless application ends up building.
Edge-native polyglot. WASM runs everywhere like cloud servers, edge nodes, IoT devices, even browsers. PlexSpaces actors compiled to WASM deploy to any environment that runs wasmtime. A Python ML model runs at the edge. A Rust event processor runs in the cloud. A TypeScript API actor runs in the CDN. All three communicate through the same framework, sharing state through tuple spaces and coordinating through process groups.
Get Started
PlexSpaces is open source. Clone the repository and start building:
git clone https://github.com/bhatti/PlexSpaces.git
cd PlexSpaces
# Quick setup (installs tools, builds, tests)
./scripts/setup.sh
# Or use Docker for the fastest path
docker pull plexobject/plexspaces:latest
docker run -d -p 8000:8000 -p 8001:8001 \
-e PLEXSPACES_NODE_ID=node1 \
plexobject/plexspaces:latest
# Explore the examples
ls examples/python/apps/ # calculator, bank_account, chat_room, nbody, ...
ls examples/typescript/apps/ # bank_account, migrating_cloudflare_workers, migrating_orleans
ls examples/go/apps/ # migrating_erlang_otp, migrating_cloudflare_workers, ...
ls examples/rust/apps/ # calculator, nbody, session_manager, ...
# Build and test everything
make all
Each example includes its own app-config.toml, build.sh script, and test instructions. The examples/ directory also contains migration guides from 24+ frameworks like Erlang/OTP, Temporal, Ray, Cloudflare Workers, Orleans, Restate, Azure Durable Functions, AWS Step Functions, wasmCloud, Dapr, and more.
I spent decades wrestling with the same distributed systems problems under different names on different stacks (see my previous blog). Fault tolerance, state management, multi-language support, coordination, serverless invocation, scaling. These problems never change, only the acronyms do. WebAssembly makes the polyglot piece real. The Component Model makes it composable. The application server model makes it deployable in a way that finally lets you focus on what you actually came to write: business logic.
I previously shared my experience with distributed systems over the last three decades that included IBM mainframes, BSD sockets, Sun RPC, CORBA, Java RMI, SOAP, Erlang actors, service meshes, gRPC, serverless functions, etc. Over the years, I kept solving the same problems in different languages, on different platforms, with different tooling. Each one of these frameworks taught me something essential but they also left something on the table. PlexSpaces pulls those lessons together into a single open-source framework: a polyglot application server that handles microservices, serverless functions, durable workflows, AI workloads, and high-performance computing using one unified actor abstraction. You write actors in Python, Rust, GO or TypeScript, compile them to WebAssembly, deploy them on-premises or in the cloud, and the framework handles persistence, fault tolerance, observability, and scaling. No service mesh. No vendor lock-in. Same binary on your laptop and in production.
Why Now?
Three things converged over the last few years that made this the right moment to build PlexSpaces:
WebAssembly matured. Though WebAssembly ecosystem is still evolving but WASI has stabilized enough to run real server workloads. Java promised “Write Once, Run Anywhere” — WASM actually delivers it. Docker’s creator Solomon Hykes captured it in 2019: “If WASM+WASI existed in 2008, we wouldn’t have needed to create Docker.” Today that future has arrived.
AI agents exploded. Every AI agent is fundamentally an actor: it maintains state (conversation history), processes messages (user queries), calls tools (side effects), and needs fault tolerance (LLM APIs fail). The actor model maps naturally to agent orchestration but existing frameworks either lack durability, lock you to one language, or require separate infrastructure.
Multi-cloud pressure intensified. I’ve watched teams at multiple companies build on AWS in production but struggle to develop locally. Bugs surface only after deployment because Lambda, DynamoDB, and SQS behave differently from their local mocks/simulators. Modern enterprises need code that runs identically on a developer’s laptop, on-premises, and in any cloud.
PlexSpaces addresses all three: polyglot via WASM, actor-native for AI workloads, and local-first by design.
The Lessons That Shaped PlexSpaces
Every era of distributed computing burned a lesson into my thinking. Here’s what stuck and how I applied each lesson to PlexSpaces.
Efficiency runs deep: When I programmed BSD sockets in C, I controlled every byte on the wire. That taught me to respect the transport layer. Applied: PlexSpaces uses gRPC and Protocol Buffers for binary communication not because JSON is bad, but because high-throughput systems deserve binary protocols with proper schemas.
Contracts prevent chaos: Sun RPC introduced me to XDR and rpcgen for defining a contract, generate the code. CORBA reinforced this with IDL. I have seen countless times where teams sprinkle Swagger annotations on code and assumes that they have APIs, which keep growing without any standards, developer experience or consistency. Applied: PlexSpaces follows a proto-first philosophy – every API lives in Protocol Buffers, every contract generates typed stubs across languages (See OpenAPI specs for grpc/http services).
Parallelism needs multiple primitives: During my PhD research, I built JavaNow – a parallel computing framework that combined Linda-style tuple spaces, MPI collective operations, and actor-based concurrency on networks of workstations. That research taught me something frameworks keep forgetting: different coordination problems need different primitives. You can’t force everything through message passing alone. Applied: PlexSpaces provides actors and tuple spaces and channels and process groups because real systems need all of them.
Developer experience decides adoption: Java RMI made remote objects feel local. JINI added service discovery. Then J2EE and EJB buried developer hearts under XML configuration. Applied: PlexSpaces SDK provides decorator-based development (Python), inheritance-based development (TypeScript), and annotation-based development (Rust) to eliminate boilerplate.
Simplicity defeats complexity every time: With SOAP, WSDL, EJB, J2EE, I watched the Java enterprise ecosystem collapse under its own weight. REST won not because it was more powerful, but because it was simpler. Applied: One actor abstraction with composable capabilities beats a zoo of specialized types.
Cross-cutting concerns belong in the platform: Spring and AOP taught me to handle observability, security, and throttling consistently. But microservices in polyglot environments broke that model. Service meshes like Istio and Dapr tried to fix it with sidecar proxies but it requires another networking hop, another layer of YAML to debug. Applied: PlexSpaces bakes these concerns directly into the runtime. No service mesh. No extra hops.
Serverless is the right idea with the wrong execution: AWS Lambda showed me the future: auto-scaling, built-in observability, zero server management. But Lambda also showed me the problem: vendor lock-in, cold starts, and the inability to run locally. Applied: PlexSpaces delivers serverless semantics that run identically on your laptop and in the cloud.
Application servers got one thing right: Despite all the complexity of J2EE, I loved one idea: the application server that hosts multiple applications. You deployed WAR files to Tomcat, and it handled routing, lifecycle, and shared services. That model survived even after EJB died. Applied: PlexSpaces revives this concept for the polyglot serverless era where you can deploy Python ML models, TypeScript webhooks, and Rust performance-critical code to the same node.
I also built formicary, a framework for durable executions with graph-based workflow processing. That experience directly shaped PlexSpaces’ workflow and durability abstractions.
What PlexSpaces Actually Does
PlexSpaces combines five foundational pillars into a unified distributed computing platform:
TupleSpace Coordination (Linda Model): Decouples producers and consumers through associative memory. Actors write tuples, read them by pattern, and never need to know who’s on the other side.
Durable Execution: Every actor operation gets journaled. When a node crashes, the framework replays the journal and restores state exactly. Side effects get cached during replay, so external calls don’t fire twice. Inspired by Restate and my earlier work on formicary.
WASM Runtime: Actors compile to WebAssembly and run in a sandboxed environment. Python, TypeScript, Rust with same deployment model, same security guarantees.
Firecracker Isolation: For workloads that need hardware-level isolation, PlexSpaces supports Firecracker microVMs alongside WASM sandboxing.
Core Abstractions: Actors, Behaviors, and Facets
One Actor to Rule Them All
PlexSpaces follows a design principle I arrived at after years of watching frameworks proliferate actor types: one powerful abstraction with composable capabilities beats multiple specialized types. Every actor in PlexSpaces maintains private state, processes messages sequentially (eliminating race conditions), operates transparently across local and remote boundaries, and recovers automatically through supervision.
Actor Lifecycle
Actors move through a well-defined lifecycle — one of the details that distinguishes PlexSpaces from simpler actor frameworks:
PlexSpaces supports Virtual actors (with VirtualActorFacet inspired by Orleans Actor Model) leverage this lifecycle automatically, which activate on first message, deactivate after idle timeout, and reactivate transparently on the next message. No manual lifecycle management.
Tell vs Ask: Two Message Patterns
PlexSpaces supports two fundamental communication patterns:
Tell (asynchronous): The sender dispatches a message and moves on. Use this for events, notifications, and one-way commands.
Ask (request-reply): The sender dispatches a request and waits for a response with a timeout. Use this for queries and operations that need confirmation.
from plexspaces import actor, handler, host
@actor
class OrderService:
@handler("place_order")
def place_order(self, order: dict) -> dict:
# Tell: fire-and-forget notification to analytics
host.tell("analytics-actor", "order_placed", order)
# Ask: request-reply to inventory service (5s timeout)
inventory = host.ask("inventory-actor", "check_stock",
{"sku": order["sku"]}, timeout_ms=5000)
if inventory["available"]:
return {"status": "confirmed", "order_id": order["id"]}
return {"status": "out_of_stock"}
Behaviors: Compile-Time Patterns
Behaviors define how an actor processes messages. You choose a behavior at compile time:
Behavior
Annotation
Pattern
Best For
Default
@actor
Message-based
General purpose
GenServer
@gen_server_actor
Request-reply
Stateful services, CRUD
GenEvent
@event_actor
Fire-and-forget
Event processing, logging
GenFSM
@fsm_actor
State machine
Order processing, approval flows
Workflow
@workflow_actor
Durable orchestration
Long-running processes
Facets: Runtime Capabilities
Facets attach dynamic capabilities to actors without changing the actor type. I wrote about the pattern of dynamic facets and runtime composition previously. This allows adding dynamic behavior through facets, combined with Erlang’s static behavior model. Think of facets as middleware that wraps your actor. They execute in priority order like security facets fire first, then logging, then metrics, then your business logic, then persistence:
Facets compose freely, e.g., add facets=["durability", "timer", "metrics"] and your actor gains persistence, scheduled execution, and Prometheus metrics with zero additional code.
Custom Facets: Extending the Framework
The facet system opens for extension. You can build domain-specific facets and register them with the framework:
use plexspaces_core::{Facet, FacetError, InterceptResult};
pub struct FraudDetectionFacet {
threshold: f64,
}
#[async_trait]
impl Facet for FraudDetectionFacet {
fn name(&self) -> &str { "fraud_detection" }
fn priority(&self) -> u32 { 200 } // Run after security, before domain logic
async fn before_method(
&mut self, method: &str, payload: &[u8]
) -> Result<InterceptResult, FacetError> {
let score = self.score_transaction(payload).await?;
if score > self.threshold {
return Err(FacetError::Custom("fraud_detected".into()));
}
Ok(InterceptResult::Continue)
}
}
Register it once, attach it to any actor by name. This extensibility distinguishes PlexSpaces from frameworks with fixed capability sets.
Hands-On: Building Actors in Three Languages
Let me show you how PlexSpaces works in practice across all three SDKs.
The SDK eliminates over 100 lines of WASM boilerplate. You declare state with state(), mark handlers with @handler, and return dictionaries. The framework handles serialization, lifecycle, and state management.
# Build Python actor to WebAssembly
plexspaces-py build counter_actor.py -o counter.wasm
# Deploy to a running node
curl -X POST http://localhost:8094/api/v1/deploy \
-F "namespace=default" \
-F "actor_type=counter" \
-F "wasm=@counter.wasm"
# Invoke via HTTP — FaaS-style (POST = tell, GET = ask)
curl -X POST "http://localhost:8080/api/v1/actors/default/default/counter" \
-H "Content-Type: application/json" \
-d '{"action":"increment","amount":5}'
# Request-reply on GET
curl "http://localhost:8080/api/v1/actors/default/default/counter" \
-H "Content-Type: application/json"
# => {"count": 5}
That’s it. No Kubernetes manifests. No Terraform. No sidecar containers. Deploy a WASM module, invoke it over HTTP. The same endpoint works as an AWS Lambda Function URL.
Durable Execution: Crash and Recover Without Losing State
Durable execution solves a problem I’ve encountered at every company I’ve worked for: what happens when a node crashes mid-operation?
PlexSpaces journals every actor operation, when messages received, side effects executed, state changes applied. When a node crashes and restarts, the framework loads the latest checkpoint and replays journal entries from that point. Side effects return cached results during replay, so external API calls don’t fire twice.
Example: A Durable Bank Account
from plexspaces import actor, state, handler
@actor(facets=["durability"])
class BankAccount:
balance: int = state(default=0)
transactions: list = state(default_factory=list)
@handler("deposit")
def deposit(self, amount: int = 0) -> dict:
self.balance += amount
self.transactions.append({
"type": "deposit", "amount": amount,
"balance_after": self.balance
})
return {"status": "ok", "balance": self.balance}
@handler("withdraw")
def withdraw(self, amount: int = 0) -> dict:
if amount > self.balance:
return {"status": "insufficient_funds", "balance": self.balance}
self.balance -= amount
self.transactions.append({
"type": "withdraw", "amount": amount,
"balance_after": self.balance
})
return {"status": "ok", "balance": self.balance}
@handler("replay")
def replay_transactions(self) -> dict:
"""Rebuild balance from transaction log to verify consistency."""
rebuilt = 0
for tx in self.transactions:
rebuilt += tx["amount"] if tx["type"] == "deposit" else -tx["amount"]
return {
"replayed": len(self.transactions),
"rebuilt_balance": rebuilt,
"current_balance": self.balance,
"consistent": rebuilt == self.balance
}
Adding facets=["durability"] activates journaling and checkpointing. If the node crashes after processing ten deposits, the framework restores all ten sono data loss, no duplicate charges. Periodic checkpoints accelerate recovery by 90%+ and the framework loads the latest snapshot and replays only recent entries.
Data-Parallel Actors: Worker Pools and Scatter-Gather
When I built JavaNow during my PhD, I implemented MPI-style scatter-gather and parallel map operations. PlexSpaces brings these patterns to production through ShardGroups adata-parallel actor pools inspired by the DPA paper. A ShardGroup partitions data across multiple actor shards and supports three core operations:
Bulk Update: Routes writes to the correct shard based on a partition key (hash, consistent hash, or range)
Parallel Map: Queries all shards simultaneously and collects results
Scatter-Gather: Broadcasts a query and aggregates responses with fault tolerance
Example: Data-Parallel Worker Pool with Scatter-Gather
This pattern comes from the PlexSpaces examples. Each worker actor in the ShardGroup holds a partition of state and processes tasks independently and the framework handles routing, fan-out, and aggregation:
The #[handler("*")] wildcard routes all messages to a single dispatch method — the worker decides what to do based on the action field. Each worker tracks its own processing statistics, so you can identify hot shards or slow workers.
The orchestration code shows all three data-parallel operations in sequence including bulk update, parallel map, and parallel reduce:
// Create a pool of 20 workers with hash-based partitioning
let pool_id = client.create_worker_pool(
"worker-pool-1", "worker", 20,
PartitionStrategy::PartitionStrategyHash,
HashMap::new(),
).await?;
// Bulk update: route 10,000 messages to the right shard by key
let mut updates = HashMap::new();
for i in 0..10_000 {
let key = format!("key-{:05}", i);
updates.insert(key.clone(), json!({ "action": "set", "key": key, "value": i }));
}
client.parallel_update(&pool_id, updates,
ConsistencyLevel::ConsistencyLevelEventual, false).await?;
// Parallel map: query every worker simultaneously
let results = client.parallel_map(&pool_id,
json!({ "action": "get_total_count" })).await?;
// => 20 responses, one per worker, each with its partition's total
// Parallel reduce: aggregate stats across all workers
let stats = client.parallel_reduce(&pool_id,
json!({ "action": "stats" }),
ShardGroupAggregationStrategy::ShardGroupAggregationConcat, 20).await?;
// => Combined stats: tasks_processed, avg_processing_time_ms per worker
parallel_update routes each key to its shard via consistent hashing: 10,000 messages fan out across 20 workers without the caller managing any routing logic. parallel_map broadcasts a query to every shard and collects results. parallel_reduce does the same but aggregates the responses using a configurable strategy (concat, sum, merge). This maps directly to distributed ML (partition model parameters across shards, push gradient updates through parallel_update, collect the full parameter set via parallel_map) or any workload that benefits from partitioned state with scatter-gather queries.
TupleSpace: Linda’s Associative Memory for Coordination
During my PhD work on JavaNow, I was blown away by the simplicity of Linda’s tuple space model for writing data flow based applications for coordination with different actors. The actors communicate through direct message passing, tuple spaces provide associative shared memory where producers write tuples, consumers read or take them with blocking or non-blocking patterns. This decouples components in three dimensions: spatial (actors don’t need references to each other), temporal (producers and consumers don’t need to run simultaneously), and pattern-based (consumers retrieve data by structure, not by address).
from plexspaces import actor, handler, host
import json
@actor
class OrderProducer:
@handler("create_order")
def create_order(self, order_id: str, items: list) -> dict:
# Write a tuple — any consumer can pick it up
host.ts_write(json.dumps(["order", order_id, "pending", items]))
return {"status": "created", "order_id": order_id}
@actor
class OrderProcessor:
@handler("process_next")
def process_next(self) -> dict:
# Take the next pending order (destructive read — removes from space)
pattern = json.dumps(["order", None, "pending", None]) # Wildcards
result = host.ts_take(pattern)
if result:
data = json.loads(result)
order_id = data[1]
# Process order, then write completion tuple
host.ts_write(json.dumps(["order", order_id, "completed", data[3]]))
return {"processed": order_id}
return {"status": "no_pending_orders"}
I use TupleSpace heavily for dataflow pipelines: each stage writes results as tuples, and downstream stages pick them up by pattern. Stages can run at different speeds, on different nodes, in different languages. The tuple space absorbs the mismatch.
Batteries Included: Everything You Need, Built In
At every company I’ve worked at, the first three months after adopting a framework go to integrating storage, messaging, and locks. PlexSpaces ships all of these as built-in services in the same codebase, no extra infrastructure, no service mesh.
PlexSpaces uses adapters pattern to plug different implementation of channels, object-registry, tuple-space based on config. For example, PlexSpaces auto-selects the best available backend for channel using a priority chain and availability (Kafka -> SQS -> NATS -> ProcessGroup -> UDP Multicast -> InMemory). Start developing with in-memory channels, deploy to production with Kafka without code changes. Actors using non-memory channels also support graceful shutdown: they stop accepting new messages but complete in-progress work.
Multi-Tenancy: Enterprise-Grade Isolation
PlexSpaces enforces two-level tenant isolation. The tenant_id comes from JWT tokens (HTTP) or mTLS certificates (gRPC). The namespace provides sub-tenant isolation for environments/applications. All queries filter by tenant automatically at the repository layer. This gives you secure multi-tenant deployments without trusting application code to enforce boundaries.
Example: Payment Processing with Built-In Services
from plexspaces import actor, handler, host
@actor(facets=["durability", "metrics"])
class PaymentProcessor:
@handler("process_refund")
def process_refund(self, tx_id: str, amount: int) -> dict:
# Distributed lock prevents duplicate refunds
lock_version = host.lock_acquire(f"refund:{tx_id}", 5000)
if not lock_version:
return {"error": "could_not_acquire_lock"}
try:
# Store refund record in built-in key-value store
host.kv_put(f"refund:{tx_id}", json.dumps({
"amount": amount, "status": "processed"
}))
return {"status": "refunded", "amount": amount}
finally:
host.lock_release(f"refund:{tx_id}", lock_version)
No Redis cluster to manage. No DynamoDB table to provision. The framework handles it.
Process Groups: Erlang pg2-Style Communication
Process groups provide distributed pub/sub and group messaging, which is one of Erlang’s most powerful patterns. Here’s a chat room that demonstrates joining, broadcasting, and member queries:
Groups support topic-based subscriptions within groups and scope automatically by tenant_id and namespace.
Polyglot Development: One Server, Many Languages
A single PlexSpaces node hosts actors written in different languages simultaneously: Python ML models, TypeScript webhook handlers, and Rust performance-critical paths sharing the same actor runtime, storage services, and observability stack:
Same WASM module deploys anywhere: no Docker images, no container registries, no “it works on my machine”:
# Build and deploy to on-premises
plexspaces-py build ml_model.py -o ml_model.wasm
curl -X POST http://on-prem:8094/api/v1/deploy \
-F "namespace=prod" -F "actor_type=ml_model" -F "wasm=@ml_model.wasm"
# Deploy to cloud — same command, same binary
curl -X POST http://cloud:8094/api/v1/deploy \
-F "namespace=prod" -F "actor_type=ml_model" -F "wasm=@ml_model.wasm"
Common Patterns
Over three decades, I’ve watched the same architectural patterns emerge at every company and every scale. PlexSpaces supports the most important ones natively.
Durable Workflows with Signals and Queries
Long-running processes with automatic recovery, external signals, and read-only queries — think order fulfillment, onboarding flows, or CI/CD pipelines:
from plexspaces import workflow_actor, state, run_handler, signal_handler, query_handler
@workflow_actor(facets=["durability"])
class OrderWorkflow:
order_id: str = state(default="")
status: str = state(default="pending")
steps_completed: list = state(default_factory=list)
@run_handler
def run(self, input_data: dict) -> dict:
"""Main execution — exclusive, one at a time."""
self.order_id = input_data.get("order_id", "")
self.status = "validating"
self.steps_completed.append("validation")
self.status = "charging"
self.steps_completed.append("payment")
self.status = "shipping"
self.steps_completed.append("shipment")
self.status = "completed"
return {"status": "completed", "order_id": self.order_id}
@signal_handler("cancel")
def on_cancel(self, data: dict) -> None:
"""External signals can alter workflow state."""
self.status = "cancelled"
@query_handler("status")
def get_status(self) -> dict:
"""Read-only queries can run concurrently with execution."""
return {"order_id": self.order_id, "status": self.status,
"steps": self.steps_completed}
Staged Event-Driven Architecture (SEDA)
Chain processing stages through channels. Each stage runs at its own pace, and channels provide natural backpressure:
Leader Election
Distributed locks elect a leader with lease-based failover. The leader holds a lock and renews it periodically. If the leader crashes, the lease expires and another candidate acquires leadership:
@actor
class LeaderElection:
candidate_id: str = state(default="")
lock_version: str = state(default="")
@handler("try_lead")
def try_lead(self, candidate_id: str = None) -> dict:
holder_id = candidate_id or self.candidate_id
result = host.lock_acquire("", "leader-election", holder_id, "leader", 30, 0)
if result and not result.startswith("ERROR"):
self.lock_version = json.loads(result).get("version", result)
return {"leader": True, "candidate_id": holder_id}
return {"leader": False}
Resource-Based Affinity
Label actors with hardware requirements (gpu: true, memory: high) and PlexSpaces schedules them on matching nodes. This maps naturally to ML training pipelines where different stages need different hardware.
Cellular Architecture
PlexSpaces organizes nodes into cells using the SWIM protocol (gossip-based node discovery). Cells provide fault isolation, geographic distribution, and low-latency routing to the nearest cell. Nodes within a cell share channels via the cluster_name configuration, enabling UDP multicast for low-latency cluster-wide messaging.
How PlexSpaces Compares
PlexSpaces doesn’t replace any single framework, it unifies patterns from many. Here’s what it borrows from each, and what limitation of each it addresses:
Framework
What PlexSpaces Borrows
Limitation PlexSpaces Addresses
Erlang/OTP
GenServer, supervision, “let it crash”
BEAM-only; no polyglot WASM
Akka
Actor model, message passing
No longer open source; JVM-only
Orleans
Virtual actors, grain lifecycle
.NET-only; no tuple spaces or HPC
Temporal
Durable workflows, replay
Requires separate server infrastructure
Restate
Durable execution, journaling
No full actor model; no HPC patterns
Ray
Distributed ML, parameter servers
Python-centric; no durable execution
AWS Lambda
Serverless invocation, auto-scaling
Vendor lock-in; no local dev parity
Azure Durable Functions
Durable orchestration
Azure-only; limited language support
Golem Cloud
WASM-based durability
No built-in storage/messaging/locks
Dapr
Sidecar service mesh, virtual actors
Extra networking hop; state management limits
Key Differentiators
No service mesh: Built-in observability, security, and throttling eliminate the extra networking hop
Local-first: Same code runs on your laptop and in production. No cloud-only surprises.
Polyglot via WASM: Write actors in Python, Rust, TypeScript. Same deployment model.
Batteries included: KV store, blob storage, locks, channels, process groups — all built in
One abstraction: Composable facets on a unified actor, not a zoo of specialized types
Application server model: Deploy multiple polyglot applications to a single node
Research-grade + production-ready: Linda tuple spaces, MPI patterns, and Erlang supervision in a single framework
Getting Started
Install and Run
# Docker (fastest)
docker run -p 8080:8080 -p 8000:8000 -p 8001:8001 plexobject/plexspaces:latest
# From source
git clone https://github.com/bhatti/PlexSpaces.git
cd PlexSpaces && make build
Explore more in the examples directory: bank accounts with durability, task queues with distributed locks, leader election, chat rooms with process groups, and more.
Lessons Learned
After decades of distributed systems, I keep returning to the same truths:
Efficiency matters. Respect the transport layer. Binary protocols with schemas outperform JSON for high-throughput systems.
Contracts prevent chaos. Define APIs before implementations. Generate code from schemas.
Simplicity defeats complexity. Every framework that collapsed like EJB, SOAP, CORBA did under the weight of accidental complexity. One powerful abstraction beats ten specialized ones.
Developer experience decides adoption. If your framework requires 100 lines of boilerplate for a counter, developers will choose the one that needs 15.
Local and production must match. Every bug I’ve seen that “only happens in production” stemmed from environmental differences.
Cross-cutting concerns belong in the platform. Scatter them across codebases and you get inconsistency. Centralize them in a service mesh and you get latency. Build them in.
Multiple coordination primitives solve multiple problems. Actors handle request-reply. Channels handle pub/sub. Tuple spaces handle coordination. Process groups handle broadcast. Real systems need all of them.
The distributed systems landscape keeps changing as WASM is maturing, AI agents are creating new coordination challenges, and enterprises are pushing back on vendor lock-in harder than ever. I believe the next generation of frameworks will converge on the patterns PlexSpaces brings together: polyglot runtimes, durable actors, built-in infrastructure, and local-first deployment. PlexSpaces distills years of lessons into a single framework. It’s the framework I wished existed at every company I’ve worked for that handles the infrastructure so I can focus on the problem.
Here are the programming languages I’ve used over the last three decades. From BASIC in the late 80s to Rust today, each one taught me something about solving problems with code.
Late 1980s – Early 1990s
I learned coding with BASIC/QUICKBASIC on Atari and later IBM XT computer in 1980s.
I learned other languages in college or on my own like C, Pascal, Prolog, Lisp, FORTRAN and PERL.
In college, I used Icon to build compilers.
My first job was mainframe work and I used COBOL and CICS for applications, JCL and REXX for scripting and SAS for data processing.
Later at a physics lab, I used C/C++, Fortran for applications and Python for scripting and glue language.
I used a number of 4GL languages like dBase, FoxPro, Paradox, Delphi. Later I used Visual Basic and PowerBuilder for building client applications.
I used SQL and PL/SQL throughout my career for relational databases.
Web Era Mid/Late 1990s
The web era introduced a number of new languages like HTML, Javascript, CSS, ColdFusion, and Java.
I used XML/XSLT/XPath/XQuery, PHP, VBScript and ActionScript.
I used RSS/SPARQL/RDF for buiding semantic web applications.
I used IDL/CORBA for building distributed systems.
Mobile/Services Era 2000s
I used Ruby for building web applications, Erlang/Elixir for building concurrent applications.
I used Groovy for writing tests and R for data analysis.
When iOS was released, I used Objective-C to build mobile applications.
In this era, functional languages gained popularity and I used Scala/Haskell/Clojure for some projects.
New Languages Era Mid 2010s
I started using Go for networking/concurrent applications.
I started using Swift for iOS applications and Kotlin for Android apps.
I initially used Flow language from Facebook but then started using TypeScript instead of JavaScript.
I used Dart for Flutter applications.
I used GraphQL for some of client friendly backend APIs.
I used Solidity for Ethereum smart contracts.
I used Lua as a glue language with Redis, HAProxy and other similar systems.
I used Rust and became my go to language for highly performant applications.
What Three Decades of Languages Taught Me
Every language is a bet on what matters most: Safety vs. speed vs. expressiveness vs. ecosystem vs. hiring.
Languages don’t die, they fade: I still see COBOL in production. I still debug Perl scripts. Legacy is measured in decades.
The fundamentals never change: Whether it’s BASIC or Rust, you’re still managing state, controlling flow, and abstracting complexity.
Polyglotism is a superpower: Each language teaches you a different way to think. Functional programming makes you better at OOP. Systems programming makes you better at scripting.
The best language is the one your team can maintain: I’ve seen beautiful Scala codebases become liabilities and ugly PHP applications become billion-dollar businesses.
What’s Next?
I’m watching Zig (Rust without the complexity?) and it’s on my list for next language to learn.
I’ve spent the last year building AI agents in enterprise environments. During this time, I’ve extensively applied emerging standards like Model Context Protocol (MCP) from Anthropic and the more recent Agent-to-Agent (A2A) Protocol for agent communication and coordination. What I’ve learned: there’s a massive gap between building a quick proof-of-concept with these protocols and deploying a production-grade system. The concerns that get overlooked in production deployments are exactly what will take you down at 3 AM:
Multi-tenant isolation with row-level security (because one leaked document = lawsuit)
JWT-based authentication across microservices (no shared sessions, fully stateless)
Real-time observability of agent actions (when agents misbehave, you need to know WHY)
Cost tracking and budgeting per user and model (because OpenAI bills compound FAST)
Graceful degradation when embeddings aren’t available (real data is messy)
Integration testing against real databases (mocks lie to you)
Disregarding security concerns can lead to incidents like the Salesloft breach where their AI chatbot inadvertently stored authentication tokens for hundreds of services, which exposed customer data across multiple platforms. More recently in October 2025, Filevine (a billion-dollar legal AI platform) exposed 100,000+ confidential legal documents through an unauthenticated API endpoint that returned full admin tokens to their Box filesystem. No authentication required, just a simple API call. I’ve personally witnessed security issues from inadequate AuthN/AuthZ controls and cost overruns exceeding hundreds of thousands of dollars, which are preventable with proper security and budget enforcement.
The good news is that MCP and A2A protocols provide the foundation to solve these problems. Most articles treat these as competing standards but they are complementary. In this guide, I’ll show you exactly how to combine MCP and A2A to build a system that handles real production concerns: multi-tenancy, authentication, cost control, and observability.
Reference Implementation
To demonstrate these concepts in action, I’ve built a reference implementation that showcases production-ready patterns.
Architecture Philosophy:
Three principles guided every decision:
Go for servers, Python for workflows – Use the right tool for each job. Go handles high-throughput protocol servers. Python handles AI workflows.
Database-level security – Multi-tenancy enforced via PostgreSQL row-level security (RLS), not application code. Impossible to bypass accidentally.
Stateless everything – Every service can scale horizontally. No sticky sessions, no shared state, no single points of failure.
All containerized, fully tested, and ready for production deployment.
But before we dive into the implementation, let’s understand the fundamental problem these protocols solve and why you need both.
Part 1: Understanding MCP and A2A
The Core Problem: Integration Chaos
Prior to MCP protocol in 2024, you had to build custom integration with LLM providers, data sources and AI frameworks. Every AI application had to reinvent authentication, data access, and orchestration, whichdoesn’t scale. MCP and A2A emerged to solve different aspects of this chaos:
The MCP Side: Standardized Tool Execution
Think of MCP as a standardized toolbox for AI models. Instead of every AI application writing custom integrations for databases, APIs, and file systems, MCP provides a JSON-RPC 2.0 protocol that models use to:
“MCP excels at synchronous, stateless tool execution. It’s perfect when you need an AI model to retrieve information, execute a function, and return results immediately.”
The server executes the tool and returns results. Simple, stateless, fast.
Why JSON-RPC 2.0? Because it’s:
Language-agnostic – Works with any language that speaks HTTP
Batch-capable – Multiple requests in one HTTP call
Error-standardized – Consistent error codes across implementations
Widely adopted – 20+ years of production battle-testing
The A2A Side: Stateful Workflow Orchestration
A2A handles what MCP doesn’t: multi-step, stateful workflows where agents collaborate. From the A2A Protocol docs:
“A2A is designed for asynchronous, stateful orchestration of complex tasks that require multiple steps, agent coordination, and long-running processes.”
A2A provides:
Task creation and management with persistent state
Real-time streaming of progress updates (Server-Sent Events)
Agent coordination across multiple services
Artifact management for intermediate results
Why Both Protocols Matter
Here’s a real scenario from my fintech work that illustrates why you need both:
Use Case: Compliance analyst needs to research a company across 10,000 documents, verify regulatory compliance, cross-reference with SEC filings, and generate an audit-ready report.
“Use MCP when you need fast, stateless tool execution. Use A2A when you need complex, stateful orchestration. Use both when building production systems.”
Part 2: Architecture
System Overview
Key Design Decisions
Protocol Servers (Go):
MCP Server – Secure document retrieval with pgvector and hybrid search. Go’s concurrency model handles 5,000+ req/sec, and its type safety catches integration bugs at compile time (not at runtime).
A2A Server – Multi-step workflow orchestration with Server-Sent Events for real-time progress tracking. Stateless design enables horizontal scaling.
AI Workflows (Python):
LangGraph Workflows – RAG, research, and hybrid pipelines. Python was the right choice here because the AI ecosystem (LangChain, embeddings, model integrations) lives in Python.
PostgreSQL with pgvector – Multi-tenant document storage with row-level security policies enforced at the database level (not application level)
Ollama – Local LLM inference for development and testing (no OpenAI API keys required)
DatabaseSecurity:
Application-level tenant filtering for database is not enough so row-level security policies are enforced:
// ? BAD: Application-level filtering (can be bypassed)
func GetDocuments(tenantID string) ([]Document, error) {
query := "SELECT * FROM documents WHERE tenant_id = ?"
// What if someone forgets the WHERE clause?
// What if there's a SQL injection?
// What if a bug skips this check?
}
-- ? GOOD: Database-level Row-Level Security (impossible to bypass)
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON documents
USING (tenant_id = current_setting('app.current_tenant_id')::uuid);
Every query automatically filters by tenant so there is no way to accidentally leak data. Even if your application has a bug, the database enforces isolation.
JWT Authentication
MCP server and UI share RSA keys for token verification, which provides:
Asymmetric: MCP server only needs public key (can’t forge tokens)
Rotation: Rotate private key without redeploying services
The reference implementation (hybrid_search.go) uses PostgreSQL’s full-text search (BM25-like) combined with pgvector:
// Hybrid search query using Reciprocal Rank Fusion
query := `
WITH bm25_results AS (
SELECT
id,
ts_rank_cd(
to_tsvector('english', title || ' ' || content),
plainto_tsquery('english', $1)
) AS bm25_score,
ROW_NUMBER() OVER (ORDER BY ts_rank_cd(...) DESC) AS bm25_rank
FROM documents
WHERE to_tsvector('english', title || ' ' || content) @@ plainto_tsquery('english', $1)
),
vector_results AS (
SELECT
id,
1 - (embedding <=> $2) AS vector_score,
ROW_NUMBER() OVER (ORDER BY embedding <=> $2) AS vector_rank
FROM documents
WHERE embedding IS NOT NULL
),
combined AS (
SELECT
COALESCE(b.id, v.id) AS id,
-- Reciprocal Rank Fusion score
(
COALESCE(1.0 / (60 + b.bm25_rank), 0) * $3 +
COALESCE(1.0 / (60 + v.vector_rank), 0) * $4
) AS combined_score
FROM bm25_results b
FULL OUTER JOIN vector_results v ON b.id = v.id
)
SELECT * FROM combined
ORDER BY combined_score DESC
LIMIT $7
`
Why Reciprocal Rank Fusion (RRF)? Because:
Score normalization: BM25 scores and vector similarities aren’t comparable
Rank-based: Uses position, not raw scores
Research-backed: Used by search engines (Elasticsearch, Vespa)
Tunable: Adjust k parameter (60 in our case) for different behaviors
Part 3: The MCP Server – Secure Document Retrieval
Understanding JSON-RPC 2.0
Before we dive into implementation, let’s understand why MCP chose JSON-RPC 2.0.
Here’s the complete hybrid search tool (hybrid_search.go) implementation with detailed comments:
// mcp-server/internal/tools/hybrid_search.go
type HybridSearchTool struct {
db database.Store
}
func (t *HybridSearchTool) Execute(ctx context.Context, args map[string]interface{}) (protocol.ToolCallResult, error) {
// 1. AUTHENTICATION: Extract tenant from JWT claims
// This happens at middleware level, but we verify here
tenantID, ok := ctx.Value(auth.ContextKeyTenantID).(string)
if !ok {
return protocol.ToolCallResult{IsError: true}, fmt.Errorf("tenant ID not found in context")
}
// 2. PARAMETER PARSING: Extract and validate arguments
query, _ := args["query"].(string)
if query == "" {
return protocol.ToolCallResult{IsError: true}, fmt.Errorf("query is required")
}
limit, _ := args["limit"].(float64)
if limit <= 0 {
limit = 10 // default
}
if limit > 50 {
limit = 50 // max cap
}
bm25Weight, _ := args["bm25_weight"].(float64)
vectorWeight, _ := args["vector_weight"].(float64)
// 3. WEIGHT NORMALIZATION: Ensure weights sum to 1.0
if bm25Weight == 0 && vectorWeight == 0 {
bm25Weight = 0.5
vectorWeight = 0.5
}
// 4. EMBEDDING GENERATION: Using Ollama for query embedding
var embedding []float32
if vectorWeight > 0 {
embedding = generateEmbedding(query) // Calls Ollama API
}
// 5. DATABASE QUERY: Execute hybrid search with RLS
params := database.HybridSearchParams{
Query: query,
Embedding: embedding,
Limit: int(limit),
BM25Weight: bm25Weight,
VectorWeight: vectorWeight,
}
results, err := t.db.HybridSearch(ctx, tenantID, params)
if err != nil {
return protocol.ToolCallResult{IsError: true}, err
}
// 6. RESPONSE FORMATTING: Convert to JSON for client
jsonData, _ := json.Marshal(results)
return protocol.ToolCallResult{
Content: []protocol.ContentBlock{{Type: "text", Text: string(jsonData)}},
IsError: false,
}, nil
}
The NULL Embedding Problem
Real-world data is messy. Not every document has an embedding. Here’s what happened:
Initial Implementation (Broken):
// ? This crashes with NULL embeddings
var embedding pgvector.Vector
err = tx.QueryRow(ctx, query, docID).Scan(
&doc.ID,
&doc.TenantID,
&doc.Title,
&doc.Content,
&doc.Metadata,
&embedding, // CRASH: can't scan <nil> into pgvector.Vector
&doc.CreatedAt,
&doc.UpdatedAt,
)
Error:
can't scan into dest[5]: unsupported data type: <nil>
The Fix (Correct):
// ? Use pointer types for nullable fields
var embedding *pgvector.Vector // Pointer allows NULL
err = tx.QueryRow(ctx, query, docID).Scan(
&doc.ID,
&doc.TenantID,
&doc.Title,
&doc.Content,
&doc.Metadata,
&embedding, // Can be NULL now
&doc.CreatedAt,
&doc.UpdatedAt,
)
// Handle NULL embeddings gracefully
if embedding != nil && embedding.Slice() != nil {
doc.Embedding = embedding.Slice()
} else {
doc.Embedding = nil // Explicitly set to nil
}
return doc, nil
Hybrid search handles this elegantly—documents without embeddings get vector_score = 0 but still appear in results if they match BM25:
-- Hybrid search handles NULL embeddings gracefully
WITH bm25_results AS (
SELECT id, ts_rank(to_tsvector('english', content), query) AS bm25_score
FROM documents
WHERE to_tsvector('english', content) @@ query
),
vector_results AS (
SELECT id, 1 - (embedding <=> $1) AS vector_score
FROM documents
WHERE embedding IS NOT NULL -- ? Skip NULL embeddings
)
SELECT
d.*,
COALESCE(b.bm25_score, 0) AS bm25_score,
COALESCE(v.vector_score, 0) AS vector_score,
($2 * COALESCE(b.bm25_score, 0) + $3 * COALESCE(v.vector_score, 0)) AS combined_score
FROM documents d
LEFT JOIN bm25_results b ON d.id = b.id
LEFT JOIN vector_results v ON d.id = v.id
WHERE COALESCE(b.bm25_score, 0) > 0 OR COALESCE(v.vector_score, 0) > 0
ORDER BY combined_score DESC
LIMIT $4;
Why this matters:
? Documents without embeddings still searchable (BM25)
? New documents usable immediately (embeddings generated async)
? System degrades gracefully (not all-or-nothing)
? Zero downtime for embedding model updates
Tenant Isolation in Action
Every MCP request sets the tenant context at the database transaction level:
// mcp-server/internal/database/postgres.go
func (db *DB) SetTenantContext(ctx context.Context, tx pgx.Tx, tenantID string) error {
// Note: SET commands don't support parameter binding
// TenantID is validated as UUID by JWT validator, so this is safe
query := fmt.Sprintf("SET LOCAL app.current_tenant_id = '%s'", tenantID)
_, err := tx.Exec(ctx, query)
return err
}
Combined with RLS policies, this ensures complete tenant isolation at the database level.
Real-world security test:
// Integration test: Verify tenant isolation
func TestTenantIsolation(t *testing.T) {
// Create documents for two tenants
tenant1Doc := createDocument(t, db, "tenant-1", "Secret Data A")
tenant2Doc := createDocument(t, db, "tenant-2", "Secret Data B")
// Query as tenant-1
ctx1 := contextWithTenant(ctx, "tenant-1")
results1, _ := db.ListDocuments(ctx1, "tenant-1", ListParams{Limit: 100})
// Query as tenant-2
ctx2 := contextWithTenant(ctx, "tenant-2")
results2, _ := db.ListDocuments(ctx2, "tenant-2", ListParams{Limit: 100})
// Assertions
assert.Contains(t, results1, tenant1Doc)
assert.NotContains(t, results1, tenant2Doc) // ? Cannot see other tenant
assert.Contains(t, results2, tenant2Doc)
assert.NotContains(t, results2, tenant1Doc) // ? Cannot see other tenant
}
Part 4: The A2A Server – Workflow Orchestration
Task Lifecycle
A2A manages stateful tasks through their entire lifecycle:
Server-Sent Events for Real-Time Updates
Why SSE instead of WebSockets?
Feature
SSE
WebSocket
Unidirectional
? Yes (server?client)
? No (bidirectional)
HTTP/2 multiplexing
? Yes
? No
Automatic reconnection
? Built-in
? Manual
Firewall-friendly
? Yes (HTTP)
?? Sometimes blocked
Complexity
? Simple
? Complex
Browser support
? All modern
? All modern
SSE is perfect for agent progress updates because:
One-way communication (server pushes updates)
Simple implementation
Automatic reconnection
Works through corporate firewalls
SSE provides real-time streaming without WebSocket complexity:
// a2a-server/internal/handlers/tasks.go
func (h *TaskHandler) StreamEvents(w http.ResponseWriter, r *http.Request) {
taskID := chi.URLParam(r, "taskId")
// Set SSE headers
w.Header().Set("Content-Type", "text/event-stream")
w.Header().Set("Cache-Control", "no-cache")
w.Header().Set("Connection", "keep-alive")
w.Header().Set("Access-Control-Allow-Origin", "*")
flusher, ok := w.(http.Flusher)
if !ok {
http.Error(w, "Streaming not supported", http.StatusInternalServerError)
return
}
// Stream task events
for {
event := h.taskManager.GetNextEvent(taskID)
if event == nil {
break // Task complete
}
// Format as SSE event
data, _ := json.Marshal(event)
fmt.Fprintf(w, "event: task_update\n")
fmt.Fprintf(w, "data: %s\n\n", data)
flusher.Flush()
if event.Status == "completed" || event.Status == "failed" {
break
}
}
}
Client-side consumption is trivial:
# streamlit-ui/pages/3_?_A2A_Tasks.py
def stream_task_events(task_id: str):
url = f"{A2A_BASE_URL}/tasks/{task_id}/events"
with requests.get(url, stream=True) as response:
for line in response.iter_lines():
if line.startswith(b'data:'):
data = json.loads(line[5:])
st.write(f"Update: {data['message']}")
yield data
LangGraph Workflow Integration
LangGraph workflows call MCP tools through the A2A server:
# orchestration/workflows/rag_workflow.py
class RAGWorkflow:
def __init__(self, mcp_url: str):
self.mcp_client = MCPClient(mcp_url)
self.workflow = self.build_workflow()
def build_workflow(self) -> StateGraph:
workflow = StateGraph(RAGState)
# Define workflow steps
workflow.add_node("search", self.search_documents)
workflow.add_node("rank", self.rank_results)
workflow.add_node("generate", self.generate_answer)
workflow.add_node("verify", self.verify_sources)
# Define edges (workflow graph)
workflow.add_edge(START, "search")
workflow.add_edge("search", "rank")
workflow.add_edge("rank", "generate")
workflow.add_edge("generate", "verify")
workflow.add_edge("verify", END)
return workflow.compile()
def search_documents(self, state: RAGState) -> RAGState:
"""Search for relevant documents using MCP hybrid search"""
# This is where MCP and A2A integrate!
results = self.mcp_client.hybrid_search(
query=state["query"],
limit=10,
bm25_weight=0.5,
vector_weight=0.5
)
state["documents"] = results
state["progress"] = f"Found {len(results)} documents"
# Emit progress event via A2A
emit_progress_event(state["task_id"], "search_complete", state["progress"])
return state
def rank_results(self, state: RAGState) -> RAGState:
"""Rank results by combined score"""
docs = sorted(
state["documents"],
key=lambda x: x["score"],
reverse=True
)[:5]
state["ranked_docs"] = docs
state["progress"] = "Ranked top 5 documents"
emit_progress_event(state["task_id"], "ranking_complete", state["progress"])
return state
def generate_answer(self, state: RAGState) -> RAGState:
"""Generate answer using retrieved context"""
context = "\n\n".join([
f"Document: {doc['title']}\n{doc['content']}"
for doc in state["ranked_docs"]
])
prompt = f"""Based on the following documents, answer the question.
Context:
{context}
Question: {state['query']}
Answer:"""
# Call Ollama for local inference
response = ollama.generate(
model="llama3.2",
prompt=prompt
)
state["answer"] = response["response"]
state["progress"] = "Generated final answer"
emit_progress_event(state["task_id"], "generation_complete", state["progress"])
return state
def verify_sources(self, state: RAGState) -> RAGState:
"""Verify sources are accurately cited"""
# Check each cited document exists in ranked_docs
cited_docs = extract_citations(state["answer"])
verified = all(doc in state["ranked_docs"] for doc in cited_docs)
state["verified"] = verified
state["progress"] = "Verified sources" if verified else "Source verification failed"
emit_progress_event(state["task_id"], "verification_complete", state["progress"])
return state
The workflow executes as a multi-step pipeline, with each step:
Calling MCP tools for data access
Updating state
Emitting progress events via A2A
Handling errors gracefully
Part 5: Production-Grade Features
1. Authentication & Security
JWT Token Generation (Streamlit UI):
# streamlit-ui/pages/1_?_Authentication.py
def generate_jwt_token(tenant_id: str, user_id: str, ttl: int = 3600) -> str:
"""Generate RS256 JWT token with proper claims"""
now = datetime.now(timezone.utc)
payload = {
"tenant_id": tenant_id,
"user_id": user_id,
"iat": now, # Issued at
"exp": now + timedelta(seconds=ttl), # Expiration
"nbf": now, # Not before
"jti": str(uuid.uuid4()), # JWT ID (for revocation)
"iss": "mcp-demo-ui", # Issuer
"aud": "mcp-server" # Audience
}
# Sign with RSA private key
with open("/app/certs/private_key.pem", "rb") as f:
private_key = serialization.load_pem_private_key(
f.read(),
password=None
)
token = jwt.encode(payload, private_key, algorithm="RS256")
return token
OpenTelemetry excels at infrastructure observability but lacks LLM-specific context. Langfuse provides deep LLM insights but doesn’t trace service-to-service calls. Together, they provide complete visibility.
Example: End-to-End Trace
Python Workflow (OpenTelemetry + Langfuse):
from opentelemetry import trace
from langfuse.decorators import observe
class RAGWorkflow:
def __init__(self):
# OTel for distributed tracing
self.tracer = setup_otel_tracing("rag-workflow")
# Langfuse for LLM tracking
self.langfuse = Langfuse(...)
@observe(name="search_documents") # Langfuse tracks this
def _search_documents(self, state):
# OTel: Create span for MCP call
with self.tracer.start_as_current_span("mcp.hybrid_search") as span:
span.set_attribute("search.query", state["query"])
span.set_attribute("search.top_k", 5)
# HTTP request auto-instrumented, propagates trace context
result = self.mcp_client.hybrid_search(
query=state["query"],
limit=5
)
span.set_attribute("search.result_count", len(documents))
return state
MCP Client (W3C Trace Context Propagation):
from opentelemetry.propagate import inject
def _make_request(self, method: str, params: Any = None):
headers = {'Content-Type': 'application/json'}
# Inject trace context into HTTP headers
inject(headers) # Adds 'traceparent' header
response = self.session.post(
f"{self.base_url}/mcp",
json=payload,
headers=headers # Trace continues in Go server
)
# Unit tests (fast, no dependencies)
cd mcp-server
go test -v ./...
# Integration tests (requires PostgreSQL)
./scripts/run-integration-tests.sh
The integration test script:
Checks if PostgreSQL is running
Waits for database ready
Runs all integration tests
Reports coverage
Output:
? Running MCP Server Integration Tests
========================================
? PostgreSQL is ready
? Running integration tests...
=== RUN TestGetDocument_WithNullEmbedding
--- PASS: TestGetDocument_WithNullEmbedding (0.05s)
=== RUN TestGetDocument_WithEmbedding
--- PASS: TestGetDocument_WithEmbedding (0.04s)
=== RUN TestHybridSearch_HandlesNullEmbeddings
--- PASS: TestHybridSearch_HandlesNullEmbeddings (0.12s)
=== RUN TestTenantIsolation
--- PASS: TestTenantIsolation (0.08s)
=== RUN TestConcurrentRetrievals
--- PASS: TestConcurrentRetrievals (2.34s)
PASS
coverage: 95.3% of statements
ok github.com/bhatti/mcp-a2a-go/mcp-server/internal/database 3.456s
? Integration tests completed!
Part 7: Real-World Use Cases
Use Case 1: Enterprise RAG Search
Scenario: Consulting firm managing 50,000+ contract documents across multiple clients. Each client (tenant) must have complete data isolation. Legal team needs to:
Search with exact terms (case citations, contract clauses)
Find semantically similar clauses (non-obvious connections)
Track who accessed what (audit trail)
Enforce budget limits per client matter
Solution: Hybrid search combining BM25 (keywords) and vector similarity (semantics).
# Client code
results = mcp_client.hybrid_search(
query="data breach notification requirements GDPR Article 33",
limit=10,
bm25_weight=0.6, # Favor exact keyword matches for legal terms
vector_weight=0.4 # But include semantic similarity
)
for result in results:
print(f"Document: {result['title']}")
print(f"BM25 Score: {result['bm25_score']:.2f}")
print(f"Vector Score: {result['vector_score']:.2f}")
print(f"Combined: {result['score']:.2f}")
print(f"Tenant: {result['tenant_id']}")
print("---")
? Finds documents with exact terms (“GDPR”, “Article 33”)
? Surfaces semantically similar docs (“privacy breach”, “data protection”)
? Tenant isolation ensures Client A can’t see Client B’s contracts
? Audit trail via structured logging
? Cost tracking per client matter
Use Case 2: Multi-Step Research Workflows
Scenario: Investment analyst needs to research a company across multiple data sources:
Company filings (10-K, 10-Q, 8-K)
Competitor analysis
Market trends
Financial metrics
Regulatory filings
News sentiment
Traditional RAG: Query each source separately, manually synthesize results.
With A2A + MCP: Orchestrate multi-step workflow with progress tracking.
# orchestration/workflows/research_workflow.py
class ResearchWorkflow:
def build_workflow(self):
workflow = StateGraph(ResearchState)
# Define research steps
workflow.add_node("search_company", self.search_company_docs)
workflow.add_node("search_competitors", self.search_competitors)
workflow.add_node("search_financials", self.search_financial_data)
workflow.add_node("analyze_trends", self.analyze_market_trends)
workflow.add_node("verify_facts", self.verify_with_sources)
workflow.add_node("generate_report", self.generate_final_report)
# Define workflow graph
workflow.add_edge(START, "search_company")
workflow.add_edge("search_company", "search_competitors")
workflow.add_edge("search_competitors", "search_financials")
workflow.add_edge("search_financials", "analyze_trends")
workflow.add_edge("analyze_trends", "verify_facts")
workflow.add_edge("verify_facts", "generate_report")
workflow.add_edge("generate_report", END)
return workflow.compile()
def search_company_docs(self, state: ResearchState) -> ResearchState:
"""Step 1: Search company documents via MCP"""
company = state["company_name"]
# Call MCP hybrid search
results = self.mcp_client.hybrid_search(
query=f"{company} business operations revenue products",
limit=20,
bm25_weight=0.5,
vector_weight=0.5
)
state["company_docs"] = results
state["progress"] = f"Found {len(results)} company documents"
# Emit progress via A2A SSE
emit_progress("search_company_complete", state["progress"])
return state
def search_competitors(self, state: ResearchState) -> ResearchState:
"""Step 2: Identify and search competitors"""
company = state["company_name"]
# Extract competitors from company docs
competitors = self.extract_competitors(state["company_docs"])
# Search each competitor
competitor_data = {}
for competitor in competitors:
results = self.mcp_client.hybrid_search(
query=f"{competitor} market share products revenue",
limit=10
)
competitor_data[competitor] = results
state["competitors"] = competitor_data
state["progress"] = f"Analyzed {len(competitors)} competitors"
emit_progress("search_competitors_complete", state["progress"])
return state
def search_financial_data(self, state: ResearchState) -> ResearchState:
"""Step 3: Extract financial metrics"""
company = state["company_name"]
# Search for financial documents
results = self.mcp_client.hybrid_search(
query=f"{company} revenue earnings profit margin cash flow",
limit=15,
bm25_weight=0.7, # Favor exact financial terms
vector_weight=0.3
)
# Extract key metrics
metrics = self.extract_financial_metrics(results)
state["financials"] = metrics
state["progress"] = f"Extracted {len(metrics)} financial metrics"
emit_progress("search_financials_complete", state["progress"])
return state
def verify_facts(self, state: ResearchState) -> ResearchState:
"""Step 5: Verify all facts with sources"""
# Check each claim has supporting document
claims = self.extract_claims(state["report_draft"])
verified_claims = []
for claim in claims:
sources = self.find_supporting_docs(claim, state)
if sources:
verified_claims.append({
"claim": claim,
"sources": sources,
"verified": True
})
state["verified_claims"] = verified_claims
state["progress"] = f"Verified {len(verified_claims)} claims"
emit_progress("verification_complete", state["progress"])
return state
Benefits:
? Multi-step orchestration with state management
? Real-time progress via SSE (analyst sees each step)
? Intermediate results saved as artifacts
? Each step calls MCP tools for data retrieval
? Final report with verified sources
? Cost tracking across all steps
Use Case 3: Budget-Controlled AI Assistance
Scenario: SaaS company (e.g., document management platform) offers AI features to customers based on tiered subscription: Without budget control: Customer on free tier makes 10,000 queries in one day.
With budget control:
# Before each request
tier = get_user_tier(user_id)
budget = BUDGET_TIERS[tier]["monthly_budget"]
allowed, remaining = cost_tracker.check_budget(user_id, budget)
if not allowed:
raise BudgetExceededError(
f"Monthly budget of ${budget} exceeded. "
f"Upgrade to {next_tier} for higher limits."
)
# Track the request
response = llm.generate(prompt)
cost = cost_tracker.track_request(
user_id=user_id,
model="llama3.2",
input_tokens=len(prompt.split()),
output_tokens=len(response.split())
)
# Alert when approaching limit
if remaining < 5.0: # $5 remaining
send_alert(user_id, f"Budget alert: ${remaining:.2f} remaining")
Real-world budget enforcement:
# streamlit-ui/pages/4_?_Cost_Tracking.py
def enforce_budget_limits():
"""Check budget before task creation"""
user_tier = st.session_state.get("user_tier", "free")
budget = BUDGET_TIERS[user_tier]["monthly_budget"]
# Calculate current spend
spent = cost_tracker.get_total_cost(user_id)
remaining = budget - spent
# Display budget status
col1, col2, col3 = st.columns(3)
with col1:
st.metric("Budget", f"${budget:.2f}")
with col2:
st.metric("Spent", f"${spent:.2f}",
delta=f"-${spent:.2f}", delta_color="inverse")
with col3:
progress = (spent / budget) * 100
st.metric("Remaining", f"${remaining:.2f}")
st.progress(progress / 100)
# Block if exceeded
if remaining <= 0:
st.error("? Monthly budget exceeded. Upgrade to continue.")
st.button("Upgrade to Pro ($25/month)", on_click=upgrade_tier)
return False
# Warn if close
if remaining < 5.0:
st.warning(f"?? Budget alert: Only ${remaining:.2f} remaining this month")
return True
Benefits:
? Prevent cost overruns per customer
? Fair usage enforcement across tiers
? Export data for billing/accounting
? Different limits per tier
? Automatic alerts before limits
? Graceful degradation (local models for free tier)
5,000+ req/sec means 432 million requests/day per instance
<100ms search means interactive UX
52MB memory means cost-effective scaling
Load Testing Results
# Using hey (HTTP load generator)
hey -n 10000 -c 100 -m POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"hybrid_search","arguments":{"query":"machine learning","limit":10}}}' \
http://localhost:8080/mcp
Summary:
Total: 19.8421 secs
Slowest: 0.2847 secs
Fastest: 0.0089 secs
Average: 0.1974 secs
Requests/sec: 503.98
Status code distribution:
[200] 10000 responses
Latency distribution:
10% in 0.0234 secs
25% in 0.0456 secs
50% in 0.1842 secs
75% in 0.3123 secs
90% in 0.4234 secs
95% in 0.4867 secs
99% in 0.5634 secs
Scaling Strategy
Horizontal Scaling:
MCP and A2A servers are stateless—scale with container replicas
Database read replicas for read-heavy workloads (search queries)
Redis cache for frequently accessed queries (30-second TTL)
Load balancer distributes requests (sticky sessions not needed)
Vertical Scaling:
Increase PostgreSQL resources for larger datasets
Add pgvector HNSW indexes for faster vector search
Tune connection pool sizes (PgBouncer)
When to scale what:
Symptom
Solution
High MCP server CPU
Add more MCP replicas
Slow database queries
Add read replicas
High memory on MCP
Check for memory leaks, add replicas
Cache misses
Increase Redis memory, tune TTL
Slow embeddings
Deploy dedicated embedding service
Part 10: Lessons Learned & Best Practices
1. Go for Protocol Servers
Go’s performance and type safety provides a good support for AI deployment in production.
2. PostgreSQL Row-Level Security
Database-level tenant isolation is non-negotiable for enterprise. Application-level filtering is too easy to screw up. With RLS, even if your application has a bug, the database enforces isolation.
3. Integration Tests Against Real Databases
Unit tests with mocks didn’t catch the NULL embedding issues. Integration tests did. Test against production-like environments.
4. Optional Langfuse
Making Langfuse optional (try/except imports) lets developers run locally without complex setup while enabling full observability in production.
5. Comprehensive Documentation
Document your design and testing process from day one.
Use both Langfuse and OpenTelemetry. OTel traces service flow, Langfuse tracks LLM behavior. They complement, not replace each other.
OpenTelemetry for infrastructure: Trace context propagation across Python ? Go ? Database gave complete visibility into request flow. The traceparent header auto-propagation through requests/httpx made it seamless.
Langfuse for LLM calls: Token counts, costs, and prompt tracking. Essential for budget control and debugging LLM behavior.
Prometheus + Jaeger: Prometheus for metrics dashboards (query “What’s our P95 latency?”), Jaeger for debugging specific slow traces (“Why was this request slow?”).
That’s 10 layers of production concerns. Miss one, and you have a security incident waiting to happen.
Distributed Systems Lessons Apply Here
AI agents are distributed systems. The problems from microservices apply, because agents make autonomous decisions with potentially unbounded costs. From my fault tolerance article, these patterns are essential:
Without timeouts:
embedding = ollama.embed(text) # Ollama down ? hangs forever ? system freezes
Tenant A: 10,000 req/sec ? Database crashes ? ALL tenants down
With rate limiting:
if !rateLimiter.Allow(tenantID) {
return ErrRateLimitExceeded // Other tenants unaffected
}
The Bottom Line
MCP and A2A are excellent protocols. They solve real problems:
? MCP standardizes tool execution
? A2A standardizes agent coordination
But protocols are not products. Building on MCP/A2A is like building on HTTP—the protocol is solved, but you still need web servers, frameworks, security layers, and monitoring tools.
This repository shows the other 90%:
Real authentication (not “TODO: add auth”)
Real multi-tenancy (database RLS, not app filtering)
Real observability (Langfuse integration, not “we should add logging”)
Real testing (integration tests, not just mocks)
Real deployment (K8s manifests, not “works on my laptop”)
Get Started
git clone https://github.com/bhatti/mcp-a2a-go
cd mcp-a2a-go
docker compose up -d
./scripts/run-integration-tests.sh
open http://localhost:8501
Building distributed systems means confronting failure modes that are nearly impossible to reproduce in development or testing environments. How do you test for metastable failures that only emerge under specific load patterns? How do you validate that your quorum-based system actually maintains consistency during network partitions? How do you catch cross-system interaction bugs when both systems work perfectly in isolation? Integration testing, performance testing, and chaos engineering all help, but they have limitations. For the past few years, I’ve been using simulation to validate boundary conditions that are hard to test in real environments. Interactive simulators let you tweak parameters, trigger failure scenarios, and see the consequences immediately through metrics and visualizations.
In this post, I will share four simulators I’ve built to explore the failure modes and consistency challenges that are hardest to test in real systems:
Metastable Failure Simulator: Demonstrates how retry storms create self-sustaining collapse
CAP/PACELC Consistency Simulator: Shows the real tradeoffs between consistency, availability, and latency
CRDT Simulator: Explores conflict-free convergence without coordination
Cross-System Interaction (CSI) Failure Simulator: Reveals how correct systems fail through their interactions
Each simulator is built on research findings and real-world incidents. The goal isn’t just to understand these failure modes intellectually, but to develop intuition through experimentation. All simulators available at: https://github.com/bhatti/simulators.
Part 1: Metastable Failures
The Problem: When Systems Attack Themselves
Metastable failures are particularly insidious because the initial trigger can be small and transient, but the system remains degraded long after the trigger is gone. Research in the metastable failures has shown that traditional fault tolerance mechanisms don’t protect against metastability because the failure is self-sustaining through positive feedback loops in retry logic and coordination overhead. The mechanics are deceptively simple:
A transient issue (network blip, brief CPU spike) causes some requests to slow down
Slow requests start timing out
Clients retry timed-out requests, adding more load
The system is now in a stable degraded state, even though the original trigger is gone
For example, AWS Kinesis experienced a 7+ hour outage in 2020 where a transient metadata mismatch triggered retry storms across the fleet. Even after the original issue was fixed, the retry behavior kept the system degraded. The recovery required externally rate-limiting client retries.
How the Simulator Works
The metastable failure simulator models this feedback loop using discrete event simulation (SimPy). Here’s what it simulates:
Server Model:
Base latency: Time to process a request with no contention
Concurrency slope: Additional latency per concurrent request (coordination cost)
Capacity: Maximum concurrent requests before queueing
# Latency grows linearly with active requests
def current_latency(self):
return self.base_latency + (self.active_requests * self.concurrency_slope)
Client Model:
Timeout threshold: When to give up on a request
Max retries: How many times to retry
Backoff strategy: Exponential backoff with jitter (configurable)
Load Patterns:
Constant: Steady baseline load
Spike: Sudden increase for a duration, then back to baseline
Ramp: Gradual increase and decrease
Key Parameters to Experiment With:
Parameter
What It Tests
Typical Values
server_capacity
How many concurrent requests before queueing
20-100
base_latency
Processing time without contention
0.1-1.0s
concurrency_slope
Coordination overhead per request
0.001-0.05s
timeout
When clients give up
1-10s
max_retries
Retry attempts before failure
0-5
backoff_enabled
Whether to add jitter and delays
True/False
What You Can Learn:
Trigger a metastable failure: Set spike load high, timeout low, disable backoff ? watch P99 latency stay high after spike ends
See recovery with backoff: Same scenario but enable exponential backoff ? system recovers when spike ends
Understand the tipping point: Gradually increase concurrency slope ? observe when retry amplification begins
Test admission control: Set low server capacity ? see benefit of failing fast vs queueing
The simulator tracks success rate, retry count, timeout count, and latency percentiles over time, letting you see exactly when the system tips into metastability and whether it recovers. With this simulator you can validate various prevention strategies such as:
Exponential backoff with jitter spreads retries over time
Adaptive retry budgets limit total fleet-wide retries
Circuit breakers detect patterns and stop retry storms
Load shedding rejects requests before queues explode
Part 2: CAP and PACELC
The CAP theorem correctly states that during network partitions, you must choose between consistency and availability. However, as Daniel Abadi and others have pointed out, this only addresses partition scenarios. Most systems spend 99.99% of their time in normal operation, where the real tradeoff is between latency and consistency. This is where PACELC comes in:
If Partition happens: choose Availability or Consistency
Else (normal operation): choose Latency or Consistency
PACELC provides a more complete framework for understanding real-world distributed databases:
PA/EL Systems (DynamoDB, Cassandra, Riak):
Partition ? Choose Availability (serve stale data)
Normal ? Choose Latency (1-2ms reads from any replica)
Use when: Shopping carts, session stores, high write throughput needed
Normal ? Choose Consistency (5-100ms for quorum coordination)
Use when: Financial transactions, inventory, anything that can’t be wrong
PA/EC Systems (MongoDB):
Partition ? Choose Availability (with caveats – unreplicated writes go to rollback)
Normal ? Choose Consistency (strong reads/writes in baseline)
Use when: Mixed workloads with mostly consistent needs
PC/EL Systems (PNUTS):
Partition ? Choose Consistency
Normal ? Choose Latency (async replication)
Use when: Read-heavy with timeline consistency acceptable
Quorum Consensus: Strong Consistency with Coordination
When R + W > N (read quorum + write quorum > total replicas), the read and write sets must overlap in at least one node. This overlap ensures that any read sees at least one node with the latest write, providing linearizability.
Example with N=5, R=3, W=3:
Write to replicas {1, 2, 3}
Read from replicas {2, 3, 4}
Overlap at {2, 3} guarantees we see the latest value
Critical Nuances:
R + W > N alone is NOT sufficient for linearizability in practice. You need additional mechanisms: readers must perform read repair synchronously before returning results, and writers must read the latest state from a quorum before writing. “Last write wins” based on wall-clock time breaks linearizability due to clock skew. Sloppy quorums like those used in Dynamo are NOT linearizable because the nodes in the quorum can change during failures. Even R = W = N doesn’t guarantee consistency if cluster membership changes. Google Spanner uses atomic clocks and GPS to achieve strong consistency globally, with TrueTime API providing less than 1ms clock uncertainty at the 99th percentile as of 2023.
How the Simulator Works
The CAP/PACELC simulator lets you explore these tradeoffs by configuring different consistency models and observing their behavior during normal operation and network partitions.
System Model:
N replica nodes, each with local storage
Configurable schema for data (to test compatibility)
Network latency between nodes (WAN vs LAN)
Optional partition mode (splits cluster)
Consistency Levels:
Strong (R+W>N): Quorum reads and writes, linearizable
Linearizable (R=W=N): All nodes must respond, highest consistency
Weak (R=1, W=1): Single node, eventual consistency
More nodes = more fault tolerance but higher coordination cost
consistency_level
Strong/Eventual/etc
Directly controls latency vs consistency tradeoff
base_latency
Node processing time
Baseline performance
network_latency
Inter-node delay
WAN (50-150ms) vs LAN (1-10ms) dramatically affects quorum cost
partition_active
Network partition
Tests CAP behavior (A vs C during partition)
write_ratio
Read/write mix
Write-heavy shows coordination bottleneck
What You Can Learn:
Latency cost of consistency:
Run with Strong (R=3,W=3) at network_latency=5ms ? ~15ms operations
Same at network_latency=100ms ? ~300ms operations
Switch to Weak (R=1,W=1) ? single-digit milliseconds regardless
CAP during partitions:
Enable partition with Strong consistency ? operations fail (choosing C over A)
Enable partition with Eventual ? stale reads but available (choosing A over C)
Quorum size tradeoffs:
Linearizable (R=W=N) ? single node failure breaks everything
Strong (R=W=3 of N=5) ? can tolerate 2 node failures
Measure failure rate vs consistency guarantees
Geographic distribution:
Network latency 10ms (same datacenter) ? quorum cost moderate
Network latency 150ms (cross-continent) ? quorum cost severe
Observe when you should use eventual consistency for geo-distribution
The simulator tracks write/read latencies, inconsistent reads, failed operations, and success rates, giving you quantitative data on the tradeoffs.
Key Insights from Simulation
The simulator reveals that most architectural decisions are driven by normal operation latency, not partition handling. If you’re building a global system with 150ms cross-region latency, strong consistency means every operation takes 150ms+ for quorum coordination. That’s often unacceptable for user-facing features. This is why hybrid approaches are becoming standard: use strong consistency for critical invariants (financial transactions, inventory), eventual consistency for everything else (user profiles, preferences).
Part 3: CRDTs
CRDTs (Conflict-Free Replicated Data Types) provide strong eventual consistency (SEC) through mathematical guarantees, not probabilistic convergence. They work without coordination, consensus, or concurrency control. CRDTs rely on operations being commutative (order doesn’t matter), merge functions being associative and idempotent (forming a semilattice), and updates being monotonic according to a partial order.
Example: G-Counter (Grow-Only Counter)
class GCounter:
def __init__(self, replica_id):
self.counts = {} # replica_id -> count
def increment(self, amount=1):
# Each replica tracks its own increments
self.counts[self.replica_id] = self.counts.get(self.replica_id, 0) + amount
def value(self):
# Total is sum of all replicas
return sum(self.counts.values())
def merge(self, other):
# Take max of each replica's count
for replica_id, count in other.counts.items():
self.counts[replica_id] = max(self.counts.get(replica_id, 0), count)
Why this works:
Each replica only increments its own counter (no conflicts)
Merge takes max (idempotent: max(a,a) = a)
Order doesn’t matter: max(max(a,b),c) = max(a,max(b,c))
Eventually all replicas see all increments ? convergence
CRDT Types
There are two main approaches: State-based CRDTs (CvRDTs) send full local state and require merge functions to be commutative, associative, and idempotent. Operation-based CRDTs (CmRDTs) transmit only update operations and require reliable delivery in causal order. Delta-state CRDTs combine the advantages by transmitting compact deltas.
Four CRDTs in the Simulator:
G-Counter: Increment only, perfect for metrics
PN-Counter: Increment and decrement (two G-Counters)
OR-Set: Add/remove elements, concurrent add wins
LWW-Map: Last-write-wins with timestamps
Production systems using CRDTs include Redis Enterprise (CRDBs), Riak, Azure Cosmos DB for distributed data types, and Automerge/Yjs for collaborative editing like Google Docs. SoundCloud uses CRDTs in their audio distribution platform.
Important Limitations
CRDTs only provide eventual consistency, NOT strong consistency or linearizability. Different replicas can see concurrent operations in different orders temporarily. Not all operations are naturally commutative, and CRDTs cannot solve problems requiring atomic coordination like preventing double-booking without additional mechanisms.
The “Shopping Cart Problem”: You can use an OR-Set for shopping cart items, but if two clients concurrently remove the same item, your naive implementation might remove both. The CRDT guarantees convergence to a consistent state, but that state might not match user expectations.
Byzantine fault tolerance is also a concern as traditional CRDTs assume all devices are trustworthy. Malicious devices can create permanent inconsistencies.
How the Simulator Works
The CRDT simulator demonstrates convergence through gossip-based replication. You can watch replicas diverge and converge as they exchange state.
Simulation Model:
Multiple replica nodes, each with independent CRDT state
Operations applied to random replicas (simulating distributed clients)
Periodic “merges” (gossip protocol) with probability merge_probability
Network delay between merges
Tracks convergence: do all replicas have identical state?
CRDT Implementations: Each CRDT type has its own semantics:
# G-Counter: Each replica has its own count, merge takes max
def merge(self, other):
for replica_id, count in other.counts.items():
self.counts[replica_id] = max(self.counts.get(replica_id, 0), count)
# OR-Set: Elements have unique tags, add always beats remove
def add(self, element, unique_tag):
self.elements[element].add(unique_tag)
def remove(self, element, observed_tags):
self.elements[element] -= observed_tags # Only remove what was observed
# LWW-Map: Latest timestamp wins
def set(self, key, value, timestamp):
current = self.entries.get(key)
if current is None or timestamp > current[1]:
self.entries[key] = (value, timestamp, self.replica_id)
Key Parameters:
Parameter
What It Tests
Values
crdt_type
Different convergence semantics
G-Counter, PN-Counter, OR-Set, LWW-Map
n_replicas
Number of nodes
2-8
n_operations
Total updates
10-100
merge_probability
Gossip frequency
0.0-1.0
network_delay
Time for state exchange
0.0-2.0s
What You Can Learn:
Convergence speed:
Set merge_probability=0.1 ? slow convergence, replicas stay diverged
Set merge_probability=0.8 ? fast convergence
Understand gossip frequency vs consistency window tradeoff
OR-Set semantics:
Watch concurrent add/remove ? add wins
See how unique tags prevent unintended deletions
Compare with naive set implementation
LWW-Map data loss:
Two replicas set same key concurrently with different values
One value “wins” based on timestamp (or replica ID tie-break)
Data loss is possible – not suitable for all use cases
Network partition tolerance:
Low merge probability simulates partition
Replicas diverge but operations still succeed (AP in CAP)
After “partition heals” (merges resume), all converge
No coordination needed, no operations failed
The simulator visually shows replica states over time and convergence status, making abstract CRDT theory concrete.
Key Insights from Simulation
CRDTs trade immediate consistency for availability and partition tolerance. The theoretical guarantees are proven: if all replicas receive all updates (eventual delivery), they will converge to the same state (strong convergence).
But the simulator reveals the practical challenges:
Merge semantics don’t always match user intent (LWW can lose data)
Tombstones can grow indefinitely (OR-Set needs garbage collection)
Causal ordering adds complexity (need vector clocks for some CRDTs)
Not suitable for operations requiring coordination (uniqueness constraints, atomic updates)
Research from EuroSys 2023 found that 20% of catastrophic cloud incidents and 37% of failures in major open-source distributed systems are CSI failures – where both systems work correctly in isolation but fail when connected. This is the NASA Mars Climate Orbiter problem: one team used metric units, another used imperial. Both systems worked perfectly. The spacecraft burned up in Mars’s atmosphere because of their interaction.
Why CSI Failures Are Different
Not dependency failures: The downstream system is available, it just can’t process what upstream sends.
Not library bugs: Libraries are single-address-space and well-tested. CSI failures cross system boundaries where testing is expensive.
Not component failures: Each system passes its own test suite. The bug only emerges through interaction.
CSI failures manifest across three planes: Data plane (51% – schema/metadata mismatches), Management plane (32% – configuration incoherence), and Control plane (17% – API semantic violations).
For example, study of Apache Spark-Hive integration found 15 distinct discrepancies in simple write-read testing. Hive stored timestamps as long (milliseconds since epoch), Spark expected Timestamp type. Both worked in isolation, failed when integrated. Kafka and Flink encoding mismatch: Kafka set compression.type=lz4, Flink couldn’t decompress due to old LZ4 library. Configuration was silently ignored in Flink, leading to data corruption for 2 weeks before detection.
Why Testing Doesn’t Catch CSI Failures
Analysis of Spark found only 6% of integration tests actually test cross-system interaction. Most “integration tests” test multiple components of the same system. Cross-system testing is expensive and often skipped. The problem compounds with modern architectures:
Microservices: More system boundaries to test
Multi-cloud: Different clouds with different semantics
Serverless: Fine-grained composition increases interaction surface area
How the Simulator Works
The CSI failure simulator models two systems exchanging data, with configurable discrepancies in schemas, encodings, and configurations.
System Model:
Two systems (upstream ? downstream)
Each has its own schema definition (field types, encoding, nullable fields)
Each has its own configuration (timeouts, retry counts, etc.)
Data flows from System A to System B with potential conversion failures
Failure Scenarios:
Metadata Mismatch (Hive/Spark):
System A: timestamp: long
System B: timestamp: Timestamp
Failure: Type coercion fails ~30% of the time
Schema Conflict (Producer/Consumer):
System A: encoding: latin-1
System B: encoding: utf-8
Failure: Silent data corruption
Configuration Incoherence (ServiceA/ServiceB):
System A: max_retries=3, timeout=30s
System B expects: max_retries=5, timeout=60s
Failure: ~40% of requests fail due to premature timeout
API Semantic Violation (Upstream/Downstream):
Upstream assumes: synchronous, thread-safe
Downstream is: asynchronous, not thread-safe
Failure: Race conditions, out-of-order processing
Type Confusion (SystemA/SystemB):
System A: amount: float
System B: amount: decimal
Failure: Precision loss in financial calculations
Implementation Details:
class DataSchema:
def __init__(self, schema_id, fields, encoding, nullable_fields):
self.fields = fields # field_name -> type
self.encoding = encoding
def is_compatible(self, other):
# Check field types and encoding
return (self.fields == other.fields and
self.encoding == other.encoding)
class DataRecord:
def serialize(self, target_schema):
# Attempt type coercion
for field, value in self.data.items():
expected_type = target_schema.fields[field]
actual_type = self.schema.fields[field]
if expected_type != actual_type:
# 30% failure on type mismatch (simulating real world)
if random.random() < 0.3:
return None # Serialization failure
# Check encoding compatibility
if self.schema.encoding != target_schema.encoding:
if random.random() < 0.2: # 20% silent corruption
return None
Key Parameters:
Parameter
What It Tests
failure_scenario
Type of CSI failure (metadata, schema, config, API, type)
duration
Simulation length
request_rate
Load (requests per second)
The simulator doesn’t have many tunable parameters because CSI failures are about specific incompatibilities, not gradual degradation. Each scenario models a real-world pattern.
What You Can Learn:
Failure rates: CSI failures often manifest in 20-40% of requests (not 100%)
Some requests happen to have compatible data
Makes debugging harder (intermittent failures)
Failure location:
Research shows 69% of CSI fixes go in the upstream system, often in connector modules that are less than 5% of the codebase
Simulator shows which system fails (usually downstream)
Silent vs loud failures:
Type mismatches often crash (loud, easy to detect)
Encoding mismatches corrupt silently (hard to detect)
The simulator demonstrates that cross-system integration testing is essential but often skipped. Unit tests of each system won’t catch these failures.
Prevention strategies validated by simulation:
Write-Read Testing: Write with System A, read with System B, verify integrity
Schema Registry: Single source of truth for data schemas, enforced across systems
Configuration Coherence Checking: Validate that shared configs match
Contract Testing: Explicit, machine-checkable API contracts
Hybrid Consistency Models
Modern systems increasingly use mixed consistency: RedBlue Consistency (2012) marks operations as needing strong consistency (red) or eventual consistency (blue). Replicache (2024) has the server assign final total order while clients do optimistic local updates with rebase. For example: Calendar Application
# Strong consistency for room reservations (prevent double-booking)
def book_conference_room(room_id, time_slot):
with transaction(consistency='STRONG'):
if room.is_available(time_slot):
room.book(time_slot)
return True
return False
# CRDTs for collaborative editing (participant lists, notes)
def update_meeting_notes(meeting_id, notes):
# LWW-Map CRDT, eventual consistency
meeting.notes.merge(notes)
# Eventual consistency for preferences
def update_user_calendar_color(user_id, color):
# Who cares if this propagates slowly?
user_prefs[user_id] = color
Recent theoretical work on the CALM theorem proves that coordination-free consistency is achievable for certain problem classes. Research in 2025 provided mathematical definitions of when coordination is and isn’t required, separating coordination from computation.
What the Simulators Teach Us
Running all four simulators reveals the consistency spectrum:
No “best” consistency model exists:
Quorums are best when you need linearizability and can tolerate latency
CRDTs are best when you need high availability and can tolerate eventual consistency
Neither approach “bypasses” CAP – they make different tradeoffs
Real systems use hybrid models with different consistency for different operations
Practical Lessons
1. Design for Recovery, Not Just Prevention
The metastable failure simulator shows you can’t prevent all failures. Your retry logic, backoff strategy, and circuit breakers are more important than your happy path code. Validated strategies include:
Exponential backoff with jitter (spread retries over time)
Adaptive retry budgets (limit total fleet-wide retries)
Circuit breakers (detect patterns, stop storms)
Load shedding (fail fast rather than queue to death)
2. Understand the Consistency Spectrum
The CAP/PACELC simulator demonstrates that consistency is not binary. You need to understand:
What consistency level do you actually need? (Most operations don’t need linearizability)
What’s the latency cost? (Quorum reads in cross-region deployment can be 100x slower)
What happens during partitions? (Can you sacrifice availability or must you serve stale data?)
Decision framework:
Use strong consistency for: money, inventory, locks, compliance
Use eventual consistency for: feeds, catalogs, analytics, caches
Use hybrid models for: most real-world applications
3. Test Cross-System Interactions
The CSI failure simulator reveals that 86% of fixes go into connector modules that are less than 5% of your codebase. This is where bugs hide. Essential tests include:
Write-read tests (write with System A, read with System B)
Round-trip tests (serialize/deserialize across boundaries)
Version compatibility matrix (test combinations)
Schema validation (machine-checkable contracts)
4. Leverage CRDTs Where Appropriate
The CRDT simulator shows that conflict-free convergence is possible for specific problem types. But you need to:
Understand the semantic limitations (LWW can lose data)
Design merge behavior carefully (does it match user intent?)
git clone https://github.com/bhatti/simulators
cd simulators
pip install -r requirements.txt
Requirements:
Python 3.7+
streamlit (web UI)
simpy (discrete event simulation)
plotly (interactive visualizations)
numpy, pandas (data analysis)
Running Individual Simulators
# Metastable failure simulator
streamlit run metastable_simulator.py
# CAP/PACELC consistency simulator
streamlit run cap_consistency_simulator.py
# CRDT simulator
streamlit run crdt_simulator.py
# CSI failure simulator
streamlit run csi_failure_simulator.py
Running All Simulators
python run_all_simulators.py
Conclusion
Building distributed systems means confronting failure modes that are expensive or impossible to reproduce in real environments:
Metastable failures require specific load patterns and timing
Consistency tradeoffs need multi-region deployments to observe
CRDT convergence requires orchestrating concurrent operations across replicas
CSI failures need exact schema/config mismatches that don’t exist in test environments
Simulators bridge the gap between theoretical understanding and practical intuition:
Cheaper than production testing: No cloud costs, no multi-region setup, instant feedback
Safer than production experiments: Crash the simulator, not your service
More complete than unit tests: See emergent behaviors, not just component correctness
Faster iteration: Tweak parameters, re-run in seconds, build intuition through experimentation
What You Can’t Learn Without Simulation
When does retry amplification tip into metastability? (Depends on coordination slope, timeout, backoff)
How much does quorum coordination actually cost? (Depends on network latency, replica count, workload)
Do your CRDT semantics match user expectations? (Depends on merge behavior, conflict resolution)
Will your schema changes break integration? (Depends on type coercion, encoding, version skew)
The goal isn’t to prevent all failures, that’s impossible. The goal is to understand, anticipate, and recover from the failures that will inevitably occur.
References
Key research papers and resources used in this post:
I started writing network code in the early 1990s on IBM mainframes, armed with nothing but Assembly and COBOL. Today, I build distributed AI agents using gRPC, RAG pipelines, and serverless functions. Between these worlds lie decades of technological evolution and an uncomfortable realization: we keep relearning the same lessons. Over the years, I’ve seen simple ideas triumph over complex ones. The technology keeps changing, but the problems stay the same. Network latency hasn’t gotten faster relative to CPU speed. Distributed systems are still hard. Complexity still kills projects. And every new generation has to learn that abstractions leak. I’ll show you the technologies I’ve used, the mistakes I’ve made, and most importantly, what the past teaches us about building better systems in the future.
The Mainframe Era
CICS and 3270 Terminals
I started my career on IBM mainframes running CICS, which was used to build online applications accessed through 3270 “green screen” terminals. It used LU6.2 (Logical Unit 6.2) protocol, part of IBM’s Systems Network Architecture (SNA) to provide peer-to-peer communication. Here’s what a typical CICS application looked like in COBOL:
The CICS environment handled all the complexity—transaction management, terminal I/O, file access, and inter-system communication. For the user interface, I used Basic Mapping Support (BMS), which was notoriously finicky. You had to define screen layouts in a rigid format specifying exactly where each field appeared on the 24×80 character grid:
CUSTMAP DFHMSD TYPE=&SYSPARM, X
MODE=INOUT, X
LANG=COBOL, X
CTRL=FREEKB
DFHMDI SIZE=(24,80)
CUSTID DFHMDF POS=(05,20), X
LENGTH=08, X
ATTRB=(UNPROT,NUM), X
INITIAL='________'
CUSTNAME DFHMDF POS=(07,20), X
LENGTH=30, X
ATTRB=PROT
This was so painful that I wrote my own tool to convert simple text-based UI templates into BMS format. Looking back, this was my first foray into creating developer tools. Key lesson I learned from the mainframe era was that developer experience mattered. Cumbersome tools slow down development and introduce errors.
Moving to UNIX
Berkeley Sockets
After working on mainframes for a couple of years, I saw the mainframes were already in decline and I then transitioned to C and UNIX systems, which I studied previously in my college. I learned about Berkeley Sockets, which was a lot more powerful and you had complete control over the network. Here’s a simple TCP server in C using Berkeley Sockets:
As you can see, you had to track a lot of housekeeping like socket creation, binding, listening, accepting, reading, writing, and meticulous error handling at every step. Memory management was entirely manual—forget to close() a file descriptor and you’d leak resources. If you make a mistake with recv() buffer sizes and you’d overflow memory. I also experimented with Fast Sockets from UC Berkeley, which used kernel bypass techniques for lower latency and offered better performance.
Key lesson I learned was that low-level control comes at a steep cost. The cognitive load of managing these details makes it nearly impossible to focus on business logic.
Sun RPC and XDR
When working for a physics lab with a large computing facilities consists of Sun workstations, Solaris, and SPARC processors, I discovered Sun RPC (Remote Procedure Call) with XDR (External Data Representation). XDR solved a critical problem: how do you exchange data between machines with different architectures? A SPARC processor uses big-endian byte ordering, while x86 uses little-endian. XDR provided a canonical, architecture-neutral format for representing data. Here’s an XDR definition file (types.x):
/* Define a structure for customer data */
struct customer {
int customer_id;
string name<30>;
float balance;
};
/* Define the RPC program */
program CUSTOMER_PROG {
version CUSTOMER_VERS {
int ADD_CUSTOMER(customer) = 1;
customer GET_CUSTOMER(int) = 2;
} = 1;
} = 0x20000001;
You’d run rpcgen on this file:
$ rpcgen types.x
This generated the client stub, server stub, and XDR serialization code automatically. Here’s what the server implementation looked like:
This was my first introduction to Interface Definition Languages (IDL) and I found that defining the contract once and generating code automatically reduces errors. This pattern would reappear in CORBA, Protocol Buffers, and gRPC.
Parallel Computing
During my graduate and post-graduate studies in mid 1990s while working full time, I researched into the parallel and distributed computing. I worked with MPI (Message Passing Interface) and IBM’s MPL on SP1/SP2 systems. MPI provided collective operations like broadcast, scatter, gather, and reduce (predecessor to Hadoop like map/reduce). Here’s a simple MPI example that computes the sum of an array in parallel:
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#define ARRAY_SIZE 1000
int main(int argc, char** argv) {
int rank, size;
int data[ARRAY_SIZE];
int local_sum = 0, global_sum = 0;
int chunk_size, start, end;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
// Initialize data on root
if (rank == 0) {
for (int i = 0; i < ARRAY_SIZE; i++) {
data[i] = i + 1;
}
}
// Broadcast data to all processes
MPI_Bcast(data, ARRAY_SIZE, MPI_INT, 0, MPI_COMM_WORLD);
// Each process computes sum of its chunk
chunk_size = ARRAY_SIZE / size;
start = rank * chunk_size;
end = (rank == size - 1) ? ARRAY_SIZE : start + chunk_size;
for (int i = start; i < end; i++) {
local_sum += data[i];
}
// Reduce all local sums to global sum
MPI_Reduce(&local_sum, &global_sum, 1, MPI_INT,
MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
printf("Global sum: %d\n", global_sum);
}
MPI_Finalize();
return 0;
}
For my post-graduate project, I built JavaNOW (Java on Networks of Workstations), which was inspired by Linda’s tuple spaces and MPI’s collective operations, but implemented in pure Java for portability. The key innovation was our Actor-inspired model. Instead of heavyweight processes communicating through message passing, I used lightweight Java threads with an Entity Space (distributed associative memory) where “actors” could put and get entities asynchronously. Here’s a simple example:
public class SumTask extends ActiveEntity {
public Object execute(Object arg, JavaNOWAPI api) {
Integer myId = (Integer) arg;
EntitySpace workspace = new EntitySpace("RESULTS");
// Compute partial sum
int partialSum = 0;
for (int i = myId * 100; i < (myId + 1) * 100; i++) {
partialSum += i;
}
// Store result in EntitySpace
return new Integer(partialSum);
}
}
// Main application
public class ParallelSum extends JavaNOWApplication {
public void master() {
EntitySpace workspace = new EntitySpace("RESULTS");
// Spawn parallel tasks
for (int i = 0; i < 10; i++) {
ActiveEntity task = new SumTask(new Integer(i));
getJavaNOWAPI().eval(workspace, task, new Integer(i));
}
// Collect results
int totalSum = 0;
for (int i = 0; i < 10; i++) {
Entity result = getJavaNOWAPI().get(
workspace, new Entity(new Integer(i)));
totalSum += ((Integer)result.getEntityValue()).intValue();
}
System.out.println("Total sum: " + totalSum);
}
public void slave(int id) {
// Slave nodes wait for work
}
}
Since then, I have seen the Actor model have gained a wide adoption. For example, today’s serverless functions (AWS Lambda, Azure Functions, Google Cloud Functions) and modern frameworks like Akka, Orleans, and Dapr all embrace Actor-inspired patterns.
Novell and CGI
I also briefly worked with Novell’s IPX (Internetwork Packet Exchange) protocol, which had painful APIs. Here’s a taste of IPX socket programming (simplified):
When the web emerged in early 1990s, I built applications using CGI (Common Gateway Interface) with Perl and C. I deployed these on Apache HTTP Server, which was the first production-quality open source web server and quickly became the dominant web server of the 1990s. Apache used process-driven concurrency where it forked a new process for each request or maintained a pool of pre-forked processes. CGI was conceptually simple: the web server launched a new UNIX process for every request, passing input via stdin and receiving output via stdout. Here’s a simple Perl CGI script:
#!/usr/bin/perl
use strict;
use warnings;
use CGI;
my $cgi = CGI->new;
print $cgi->header('text/html');
print "<html><body>\n";
print "<h1>Hello from CGI!</h1>\n";
my $name = $cgi->param('name') || 'Guest';
print "<p>Welcome, $name!</p>\n";
# Simulate database query
my $user_count = 42;
print "<p>Total users: $user_count</p>\n";
print "</body></html>\n";
Later, I migrated to more performant servers: Tomcat for Java servlets, Jetty as an embedded server, and Netty for building custom high-performance network applications. These servers used asynchronous I/O and lightweight threads (or even non-blocking event loops in Netty‘s case).
Key Lesson I learned was that scalability matters. The CGI model’s inability to maintain persistent connections or share state made it unsuitable for modern web applications. The shift from process-per-request to thread pools and then to async I/O represented fundamental improvements in how we handle concurrency.
Java Adoption
When Java was released in 1995, I adopted it wholeheartedly. It saved developers from manual memory management using malloc() and free() debugging. Network programming became far more approachable:
import java.io.*;
import java.net.*;
public class SimpleServer {
public static void main(String[] args) throws IOException {
int port = 8080;
try (ServerSocket serverSocket = new ServerSocket(port)) {
System.out.println("Server listening on port " + port);
while (true) {
try (Socket clientSocket = serverSocket.accept();
BufferedReader in = new BufferedReader(
new InputStreamReader(clientSocket.getInputStream()));
PrintWriter out = new PrintWriter(
clientSocket.getOutputStream(), true)) {
String request = in.readLine();
System.out.println("Received: " + request);
out.println("Message received");
}
}
}
}
}
Java Threads
I had previously used pthreads in C, which were hard to use but Java’s threading model was far simpler:
public class ConcurrentServer {
public static void main(String[] args) throws IOException {
ServerSocket serverSocket = new ServerSocket(8080);
while (true) {
Socket clientSocket = serverSocket.accept();
// Spawn thread to handle client
new Thread(new ClientHandler(clientSocket)).start();
}
}
static class ClientHandler implements Runnable {
private Socket socket;
public ClientHandler(Socket socket) {
this.socket = socket;
}
public void run() {
try (BufferedReader in = new BufferedReader(
new InputStreamReader(socket.getInputStream()));
PrintWriter out = new PrintWriter(
socket.getOutputStream(), true)) {
String request = in.readLine();
// Process request
out.println("Response");
} catch (IOException e) {
e.printStackTrace();
} finally {
try { socket.close(); } catch (IOException e) {}
}
}
}
}
public class ThreadSafeCounter {
private int count = 0;
public synchronized void increment() {
count++;
}
public synchronized int getCount() {
return count;
}
}
This was so much easier than managing mutexes, condition variables, and semaphores in C!
Java RMI: Remote Objects Made
When Java added RMI (1997), distributed objects became practical. You could invoke methods on objects running on remote machines almost as if they were local. Define a remote interface:
import java.rmi.Remote;
import java.rmi.RemoteException;
public interface Calculator extends Remote {
int add(int a, int b) throws RemoteException;
int multiply(int a, int b) throws RemoteException;
}
Implement it:
import java.rmi.server.UnicastRemoteObject;
import java.rmi.RemoteException;
public class CalculatorImpl extends UnicastRemoteObject
implements Calculator {
public CalculatorImpl() throws RemoteException {
super();
}
public int add(int a, int b) throws RemoteException {
return a + b;
}
public int multiply(int a, int b) throws RemoteException {
return a * b;
}
}
Server:
import java.rmi.Naming;
import java.rmi.registry.LocateRegistry;
public class Server {
public static void main(String[] args) {
try {
LocateRegistry.createRegistry(1099);
Calculator calc = new CalculatorImpl();
Naming.rebind("Calculator", calc);
System.out.println("Server ready");
} catch (Exception e) {
e.printStackTrace();
}
}
}
Client:
import java.rmi.Naming;
public class Client {
public static void main(String[] args) {
try {
Calculator calc = (Calculator) Naming.lookup(
"rmi://localhost/Calculator");
int result = calc.add(5, 3);
System.out.println("5 + 3 = " + result);
} catch (Exception e) {
e.printStackTrace();
}
}
}
I found that RMI was constrained and everything had to extend Remote, and you were stuck with Java-to-Java communication. Key lesson I learned was that abstractions that feel natural to developers get adopted.
JINI: RMI with Service Discovery
At a travel booking company in the mid 2000s, I used JINI, which Sun Microsystems pitched as “RMI on steroids.” JINI extended RMI with automatic service discovery, leasing, and distributed events. The core idea: services could join a network, advertise themselves, and be discovered by clients without hardcoded locations. Here’s a JINI service interface and registration:
import net.jini.core.lookup.ServiceRegistrar;
import net.jini.discovery.LookupDiscovery;
import net.jini.lease.LeaseRenewalManager;
import java.rmi.Remote;
import java.rmi.RemoteException;
// Service interface
public interface BookingService extends Remote {
String searchFlights(String origin, String destination)
throws RemoteException;
boolean bookFlight(String flightId, String passenger)
throws RemoteException;
}
// Service provider
public class BookingServiceProvider implements DiscoveryListener {
public void discovered(DiscoveryEvent event) {
ServiceRegistrar[] registrars = event.getRegistrars();
for (ServiceRegistrar registrar : registrars) {
try {
BookingService service = new BookingServiceImpl();
Entry[] attributes = new Entry[] {
new Name("FlightBookingService")
};
ServiceItem item = new ServiceItem(null, service, attributes);
ServiceRegistration reg = registrar.register(
item, Lease.FOREVER);
// Auto-renew lease
leaseManager.renewUntil(reg.getLease(), Lease.FOREVER, null);
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
Client discovery and usage:
public class BookingClient implements DiscoveryListener {
public void discovered(DiscoveryEvent event) {
ServiceRegistrar[] registrars = event.getRegistrars();
for (ServiceRegistrar registrar : registrars) {
try {
ServiceTemplate template = new ServiceTemplate(
null, new Class[] { BookingService.class }, null);
ServiceItem item = registrar.lookup(template);
if (item != null) {
BookingService booking = (BookingService) item.service;
String flights = booking.searchFlights("SFO", "NYC");
booking.bookFlight("FL123", "John Smith");
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
Though, JINI provided automatic discovery, leasing and location transparency but it was too complex and only supported Java ecosystem. The ideas were sound and reappeared later in service meshes (Consul, Eureka) and Kubernetes service discovery. I learned that service discovery is essential for dynamic systems, but the implementation must be simple.
CORBA
I used CORBA (Common Object Request Broker Architecture) for many years in 1990s when building intelligent traffic Systems. CORBA promised the language-independent, platform-independent distributed objects. You could write a service in C++, invoke it from Java, and have clients in Python using the same IDL. Here’s a simple CORBA IDL definition:
This generated client stubs and server skeletons for your target language. I built a message-oriented middleware (MOM) system with CORBA that collected traffic data from road sensors and provided real-time traffic information.
C++ server implementation:
#include "TrafficService_impl.h"
#include <iostream>
#include <vector>
class TrafficServiceImpl : public POA_TrafficMonitor::TrafficService {
private:
std::vector<TrafficMonitor::SensorData> data_store;
public:
void reportData(const TrafficMonitor::SensorData& data) {
data_store.push_back(data);
std::cout << "Received data from sensor "
<< data.sensor_id << std::endl;
}
TrafficMonitor::SensorDataList* getRecentData(CORBA::Long minutes) {
TrafficMonitor::SensorDataList* result =
new TrafficMonitor::SensorDataList();
// Filter data from last N minutes
time_t cutoff = time(NULL) - (minutes * 60);
for (const auto& entry : data_store) {
if (entry.timestamp >= cutoff) {
result->length(result->length() + 1);
(*result)[result->length() - 1] = entry;
}
}
return result;
}
CORBA::Float getAverageSpeed() {
if (data_store.empty()) return 0.0;
float sum = 0.0;
for (const auto& entry : data_store) {
sum += entry.speed;
}
return sum / data_store.size();
}
};
Java client:
import org.omg.CORBA.*;
import TrafficMonitor.*;
public class TrafficClient {
public static void main(String[] args) {
try {
// Initialize ORB
ORB orb = ORB.init(args, null);
// Get reference to service
org.omg.CORBA.Object obj =
orb.string_to_object("corbaname::localhost:1050#TrafficService");
TrafficService service = TrafficServiceHelper.narrow(obj);
// Report sensor data
SensorData data = new SensorData();
data.sensor_id = 101;
data.speed = 65.5f;
data.timestamp = (int)(System.currentTimeMillis() / 1000);
service.reportData(data);
// Get average speed
float avgSpeed = service.getAverageSpeed();
System.out.println("Average speed: " + avgSpeed + " mph");
} catch (Exception e) {
e.printStackTrace();
}
}
}
However, CORBA specification was massive and different ORB (Object Request Broker) implementations like Orbix, ORBacus, and TAO couldn’t reliably interoperate despite claiming CORBA compliance. The binary protocol, IIOP, had subtle incompatibilities. CORBA did introduce valuable concepts:
Interceptors for cross-cutting concerns (authentication, logging, monitoring)
IDL-first design that forced clear interface definitions
Language-neutral protocols that actually worked (sometimes)
I learned that standards designed by committee are often over-engineer. CORBA, SOAP tried to solve every problem for everyone and ended up being optimal for no one.
SOAP and WSDL
I used SOAP (Simple Object Access Protocol) and WSDL (Web Services Description Language) on a number of projects in early 2000s that emerged as the standard for web services. The pitch: XML-based, platform-neutral, and “simple.” Here’s a WSDL definition:
You can look at all that XML overhead! A simple request became hundreds of bytes of markup. As SOAP was designed by committee (IBM, Oracle, Microsoft), it tried to solve every possible enterprise problem: transactions, security, reliability, routing, orchestration. I learned that simplicity beats features and SOAP collapsed under its own weight.
Java Servlets and Filters
With Java 1.1, it added support for Servlets that provided a much better model than CGI. Instead of spawning a process per request, servlets were Java classes instantiated once and reused across requests:
You could chain filters for compression, logging, transformation, rate limiting with clean separation of concerns without touching business logic. I previously had experienced with CORBA interceptors for injecting cross-cutting business logic and the filter pattern solved similar cross-cutting concerns problem. This pattern would reappear in service meshes and API gateways.
Enterprise Java Beans
I used Enterprise Java Beans (EJB) in late 1990s and early 2000s that attempted to make distributed objects transparent. Its key idea was that use regular Java objects and let the application server handle all the distribution, persistence, transactions, and security. Here’s what an EJB 2.x entity bean looked like:
// Remote interface
public interface Customer extends EJBObject {
String getName() throws RemoteException;
void setName(String name) throws RemoteException;
double getBalance() throws RemoteException;
void setBalance(double balance) throws RemoteException;
}
// Home interface
public interface CustomerHome extends EJBHome {
Customer create(Integer id, String name) throws CreateException, RemoteException;
Customer findByPrimaryKey(Integer id) throws FinderException, RemoteException;
}
// Bean implementation
public class CustomerBean implements EntityBean {
private Integer id;
private String name;
private double balance;
public String getName() { return name; }
public void setName(String name) { this.name = name; }
public double getBalance() { return balance; }
public void setBalance(double balance) { this.balance = balance; }
// Container callbacks
public void ejbActivate() {}
public void ejbPassivate() {}
public void ejbLoad() {}
public void ejbStore() {}
public void setEntityContext(EntityContext ctx) {}
public void unsetEntityContext() {}
public Integer ejbCreate(Integer id, String name) {
this.id = id;
this.name = name;
this.balance = 0.0;
return null;
}
public void ejbPostCreate(Integer id, String name) {}
}
The N+1 Selects Problem and Network Fallacy
The fatal flaw: EJB pretended network calls were free. I watched teams write code like this:
CustomerHome home = // ... lookup
Customer customer = home.findByPrimaryKey(customerId);
// Each getter is a remote call!
String name = customer.getName(); // Network call
double balance = customer.getBalance(); // Network call
Worse, I saw code that made remote calls in loops:
Collection customers = home.findAll();
double totalBalance = 0.0;
for (Customer customer : customers) {
// Remote call for EVERY iteration!
totalBalance += customer.getBalance();
}
This violated the first Fallacy of Distributed Computing: The network is reliable. It’s also not zero latency. What looked like simple property access actually made HTTP calls to a remote server. I had previously built distributed and parallel applications, so I understood network latency. But it blindsided most developers because EJB deliberately hid it.
I learned that you can’t hide distribution. Network calls are fundamentally different from local calls. Latency, failure modes, and semantics are different. Transparency is a lie.
REST Standard
Before REST became mainstream, I experimented with “Plain Old XML” (POX) over HTTP by just sending XML documents via HTTP POST without all the SOAP ceremony:
import requests
import xml.etree.ElementTree as ET
# Create XML request
root = ET.Element('getCustomer')
ET.SubElement(root, 'customerId').text = '12345'
xml_data = ET.tostring(root, encoding='utf-8')
# Send HTTP POST
response = requests.post(
'http://api.example.com/customer',
data=xml_data,
headers={'Content-Type': 'application/xml'}
)
# Parse response
response_tree = ET.fromstring(response.content)
name = response_tree.find('name').text
This was simpler than SOAP, but still ad-hoc. Then REST (Representational State Transfer), based on Roy Fielding’s 2000 dissertation offered a principled approach:
Use HTTP methods semantically (GET, POST, PUT, DELETE)
Resources have URLs
Stateless communication
Hypermedia as the engine of application state (HATEOAS)
Here’s a RESTful API in Python with Flask:
from flask import Flask, jsonify, request
app = Flask(__name__)
# In-memory data store
customers = {
'12345': {'id': '12345', 'name': 'John Smith', 'balance': 5000.00}
}
@app.route('/customers/<customer_id>', methods=['GET'])
def get_customer(customer_id):
customer = customers.get(customer_id)
if customer:
return jsonify(customer), 200
return jsonify({'error': 'Customer not found'}), 404
@app.route('/customers', methods=['POST'])
def create_customer():
data = request.get_json()
customer_id = data['id']
customers[customer_id] = data
return jsonify(data), 201
@app.route('/customers/<customer_id>', methods=['PUT'])
def update_customer(customer_id):
if customer_id not in customers:
return jsonify({'error': 'Customer not found'}), 404
data = request.get_json()
customers[customer_id].update(data)
return jsonify(customers[customer_id]), 200
@app.route('/customers/<customer_id>', methods=['DELETE'])
def delete_customer(customer_id):
if customer_id in customers:
del customers[customer_id]
return '', 204
return jsonify({'error': 'Customer not found'}), 404
if __name__ == '__main__':
app.run(debug=True)
In practice, most APIs called “REST” weren’t truly RESTful and didn’t implement HATEOAS or use HTTP status codes correctly. But even “REST-ish” APIs were far simpler than SOAP. Key lesson I leared was that REST succeeded because it built on HTTP, something every platform already supported. No new protocols, no complex tooling. Just URLs, HTTP verbs, and JSON.
JSON Replaces XML
With adoption of REST, I saw a decline of XML Web Services (JAX-WS) and I used JAX-RS for REST services that supported JSON payload. XML required verbose markup:
You have to encode references manually, unlike some XML schemas that support IDREF.
Erlang/OTP
I learned about actor model in college and built a framework based on actors and Linda memory model. In the mid-2000s, I encountered Erlang that used actors for building distributed systems. Erlang was designed in the 1980s at Ericsson for building telecom switches and is based on following design:
“Let it crash” philosophy
No shared memory between processes
Lightweight processes (not OS threads—Erlang processes)
If a process crashed, the supervisor automatically restarted it and the system self-healed. A key lesson I learned from actor model and Erlang was that a shared mutable state is the enemy. Message passing with isolated state is simpler, more reliable, and easier to reason about. Today, AWS Lambda, Azure Durable Functions, and frameworks like Akka all embrace the Actor model.
Distributed Erlang
Erlang made distributed computing almost trivial. Processes on different nodes communicated identically to local processes:
% On node1@host1
RemotePid = spawn('node2@host2', module, function, [args]),
RemotePid ! {message, data}.
% On node2@host2 - receives the message
receive
{message, Data} ->
io:format("Received: ~p~n", [Data])
end.
The VM handled all the complexity of node discovery, connection management, and message routing. Today’s serverless functions are actors and kubernetes pods are supervised processes.
Asynchronous Messaging
As systems grew more complex, asynchronous messaging became essential. I worked extensively with Oracle Tuxedo, IBM MQSeries, WebLogic JMS, WebSphere MQ, and later ActiveMQ, MQTT / AMQP, ZeroMQ and RabbitMQ primarily for inter-service communication and asynchronous processing. Here’s a JMS producer in Java:
import javax.jms.*;
import javax.naming.*;
public class OrderConsumer implements MessageListener {
public static void main(String[] args) throws Exception {
Context ctx = new InitialContext();
ConnectionFactory factory =
(ConnectionFactory) ctx.lookup("ConnectionFactory");
Queue queue = (Queue) ctx.lookup("OrderQueue");
Connection connection = factory.createConnection();
Session session = connection.createSession(
false, Session.AUTO_ACKNOWLEDGE);
MessageConsumer consumer = session.createConsumer(queue);
consumer.setMessageListener(new OrderConsumer());
connection.start();
System.out.println("Waiting for messages...");
Thread.sleep(Long.MAX_VALUE); // Keep running
}
public void onMessage(Message message) {
try {
TextMessage textMessage = (TextMessage) message;
System.out.println("Received order: " +
textMessage.getText());
// Process order
processOrder(textMessage.getText());
} catch (JMSException e) {
e.printStackTrace();
}
}
private void processOrder(String orderJson) {
// Business logic here
}
}
Asynchronous messaging is essential for building resilient, scalable systems. It decouples producers from consumers, provides natural backpressure, and enables event-driven architectures.
Spring Framework and Aspect-Oriented Programming
In early 2000, I used aspect oriented programming (AOP) to inject cross cutting concerns like logging, security, monitoring, etc. Here is a typical example:
@Aspect
@Component
public class LoggingAspect {
private static final Logger logger =
LoggerFactory.getLogger(LoggingAspect.class);
@Before("execution(* com.example.service.*.*(..))")
public void logBefore(JoinPoint joinPoint) {
logger.info("Executing: " +
joinPoint.getSignature().getName());
}
@AfterReturning(
pointcut = "execution(* com.example.service.*.*(..))",
returning = "result")
public void logAfterReturning(JoinPoint joinPoint, Object result) {
logger.info("Method " +
joinPoint.getSignature().getName() +
" returned: " + result);
}
@Around("@annotation(com.example.Monitored)")
public Object measureTime(ProceedingJoinPoint joinPoint)
throws Throwable {
long start = System.currentTimeMillis();
Object result = joinPoint.proceed();
long time = System.currentTimeMillis() - start;
logger.info(joinPoint.getSignature().getName() +
" took " + time + " ms");
return result;
}
}
I later adopted Spring Framework that revolutionized Java development with dependency injection and aspect-oriented programming (AOP):
// Spring configuration
@Configuration
public class AppConfig {
@Bean
public CustomerService customerService() {
return new CustomerServiceImpl(customerRepository());
}
@Bean
public CustomerRepository customerRepository() {
return new DatabaseCustomerRepository(dataSource());
}
@Bean
public DataSource dataSource() {
DriverManagerDataSource ds = new DriverManagerDataSource();
ds.setDriverClassName("com.mysql.jdbc.Driver");
ds.setUrl("jdbc:mysql://localhost/mydb");
return ds;
}
}
// Service class
@Service
public class CustomerServiceImpl implements CustomerService {
private final CustomerRepository repository;
@Autowired
public CustomerServiceImpl(CustomerRepository repository) {
this.repository = repository;
}
@Transactional
public void updateBalance(String customerId, double newBalance) {
Customer customer = repository.findById(customerId);
customer.setBalance(newBalance);
repository.save(customer);
}
}
Spring Remoting
Spring added its own remoting protocols. HTTP Invoker serialized Java objects over HTTP:
// Server configuration
@Configuration
public class ServerConfig {
@Bean
public HttpInvokerServiceExporter customerService() {
HttpInvokerServiceExporter exporter =
new HttpInvokerServiceExporter();
exporter.setService(customerServiceImpl());
exporter.setServiceInterface(CustomerService.class);
return exporter;
}
}
// Client configuration
@Configuration
public class ClientConfig {
@Bean
public HttpInvokerProxyFactoryBean customerService() {
HttpInvokerProxyFactoryBean proxy =
new HttpInvokerProxyFactoryBean();
proxy.setServiceUrl("http://localhost:8080/customer");
proxy.setServiceInterface(CustomerService.class);
return proxy;
}
}
I learned that AOP addressed cross-cutting concerns elegantly for monoliths. But in microservices, these concerns moved to the infrastructure layer like service meshes, API gateways, and sidecars.
Proprietary Protocols
When working for large companies like Amazon, I encountered Amazon Coral, which is a proprietary RPC framework influenced by CORBA. Coral used an IDL to define service interfaces and supported multiple languages:
The IDL compiler generated client and server code for Java, C++, and other languages. Coral handled serialization, versioning, and service discovery. When I later worked for AWS, I used Smithy that was successor Coral, which Amazon open-sourced. Here is a similar example of a Smithy contract:
I learned IDL-first design remains valuable. Smithy learned from CORBA, Protocol Buffers, and Thrift.
Long Polling, WebSockets, and Real-Time
In late 2000s, I built real-time applications for streaming financial charts and technical data. I used long polling where the client made a request that the server held open until data was available:
// Client-side long polling
function pollServer() {
fetch('/api/events')
.then(response => response.json())
.then(data => {
console.log('Received event:', data);
updateUI(data);
// Immediately poll again
pollServer();
})
.catch(error => {
console.error('Polling error:', error);
// Retry after delay
setTimeout(pollServer, 5000);
});
}
pollServer();
Server-side (Node.js):
const express = require('express');
const app = express();
let pendingRequests = [];
app.get('/api/events', (req, res) => {
// Hold request open
pendingRequests.push(res);
// Timeout after 30 seconds
setTimeout(() => {
const index = pendingRequests.indexOf(res);
if (index !== -1) {
pendingRequests.splice(index, 1);
res.json({ type: 'heartbeat' });
}
}, 30000);
});
// When an event occurs
function broadcastEvent(event) {
pendingRequests.forEach(res => {
res.json(event);
});
pendingRequests = [];
}
WebSockets
I also used WebSockets for real time applications that supported true bidirectional communication. However, earlier browsers didn’t fully support them so I used long polling as a fallback when websockets were not supported:
// Server (Node.js with ws library)
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });
wss.on('connection', (ws) => {
console.log('Client connected');
// Send initial data
ws.send(JSON.stringify({
type: 'INIT',
data: getInitialData()
}));
// Handle messages
ws.on('message', (message) => {
const msg = JSON.parse(message);
if (msg.type === 'SUBSCRIBE') {
subscribeToSymbol(ws, msg.symbol);
}
});
ws.on('close', () => {
console.log('Client disconnected');
unsubscribeAll(ws);
});
});
// Stream live data
function streamPriceUpdate(symbol, price) {
wss.clients.forEach((client) => {
if (client.readyState === WebSocket.OPEN) {
if (isSubscribed(client, symbol)) {
client.send(JSON.stringify({
type: 'PRICE_UPDATE',
symbol: symbol,
price: price,
timestamp: Date.now()
}));
}
}
});
}
I learned that different problems need different protocols. REST works for request-response. WebSockets excel for real-time bidirectional communication.
Vert.x and Hazelcast for High-Performance Streaming
For a production streaming chart system handling high-volume market data, I used Vert.x with Hazelcast. Vert.x is a reactive toolkit built on Netty that excels at handling thousands of concurrent connections with minimal resources. Hazelcast provided distributed caching and coordination across multiple Vert.x instances. Market data flowed into Hazelcast distributed topics, Vert.x instances subscribed to these topics and pushed updates to connected WebSocket clients. If WebSocket wasn’t supported, we fell back to long polling automatically.
import io.vertx.core.Vertx;
import io.vertx.core.http.HttpServer;
import io.vertx.core.http.ServerWebSocket;
import com.hazelcast.core.Hazelcast;
import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.core.ITopic;
import com.hazelcast.core.Message;
import com.hazelcast.core.MessageListener;
import java.util.concurrent.ConcurrentHashMap;
import java.util.Set;
public class MarketDataServer {
private final Vertx vertx;
private final HazelcastInstance hazelcast;
private final ConcurrentHashMap<String, Set<ServerWebSocket>> subscriptions;
public MarketDataServer() {
this.vertx = Vertx.vertx();
this.hazelcast = Hazelcast.newHazelcastInstance();
this.subscriptions = new ConcurrentHashMap<>();
// Subscribe to market data topic
ITopic<MarketData> topic = hazelcast.getTopic("market-data");
topic.addMessageListener(new MessageListener<MarketData>() {
public void onMessage(Message<MarketData> message) {
broadcastToSubscribers(message.getMessageObject());
}
});
}
public void start() {
HttpServer server = vertx.createHttpServer();
server.webSocketHandler(ws -> {
String path = ws.path();
if (path.startsWith("/stream/")) {
String symbol = path.substring(8);
handleWebSocketConnection(ws, symbol);
} else {
ws.reject();
}
});
// Long polling fallback
server.requestHandler(req -> {
if (req.path().startsWith("/poll/")) {
String symbol = req.path().substring(6);
handleLongPolling(req, symbol);
}
});
server.listen(8080, result -> {
if (result.succeeded()) {
System.out.println("Market data server started on port 8080");
}
});
}
private void handleWebSocketConnection(ServerWebSocket ws, String symbol) {
subscriptions.computeIfAbsent(symbol, k -> ConcurrentHashMap.newKeySet())
.add(ws);
ws.closeHandler(v -> {
Set<ServerWebSocket> sockets = subscriptions.get(symbol);
if (sockets != null) {
sockets.remove(ws);
}
});
// Send initial snapshot from Hazelcast cache
IMap<String, MarketData> cache = hazelcast.getMap("market-snapshot");
MarketData data = cache.get(symbol);
if (data != null) {
ws.writeTextMessage(data.toJson());
}
}
private void handleLongPolling(HttpServerRequest req, String symbol) {
String lastEventId = req.getParam("lastEventId");
// Hold request until data available or timeout
long timerId = vertx.setTimer(30000, id -> {
req.response()
.putHeader("Content-Type", "application/json")
.end("{\"type\":\"heartbeat\"}");
});
// Register one-time listener
subscriptions.computeIfAbsent(symbol + ":poll",
k -> ConcurrentHashMap.newKeySet())
.add(new PollHandler(req, timerId));
}
private void broadcastToSubscribers(MarketData data) {
String symbol = data.getSymbol();
// WebSocket subscribers
Set<ServerWebSocket> sockets = subscriptions.get(symbol);
if (sockets != null) {
String json = data.toJson();
sockets.forEach(ws -> {
if (!ws.isClosed()) {
ws.writeTextMessage(json);
}
});
}
// Update Hazelcast cache for new subscribers
IMap<String, MarketData> cache = hazelcast.getMap("market-snapshot");
cache.put(symbol, data);
}
public static void main(String[] args) {
new MarketDataServer().start();
}
}
Publishing market data to Hazelcast from data feed:
public class MarketDataPublisher {
private final HazelcastInstance hazelcast;
public void publishUpdate(String symbol, double price, long volume) {
MarketData data = new MarketData(symbol, price, volume,
System.currentTimeMillis());
// Publish to topic - all Vert.x instances receive it
ITopic<MarketData> topic = hazelcast.getTopic("market-data");
topic.publish(data);
}
}
Hazelcast Distribution: Market data shared across multiple Vert.x instances without a central message broker
Horizontal Scaling: Adding Vert.x instances automatically joined the Hazelcast cluster
Low Latency: Sub-millisecond message propagation within the cluster
Automatic Fallback: Clients detected WebSocket support; older browsers used long polling
Facebook Thrift and Google Protocol Buffers
I experimented with Facebook Thrift and Google Protocol Buffers that provided IDL-based RPC with multiple protocols: Here is an example of Protocol Buffers:
Python server with gRPC (which uses Protocol Buffers):
import grpc
from concurrent import futures
import customer_pb2
import customer_pb2_grpc
class CustomerServicer(customer_pb2_grpc.CustomerServiceServicer):
def GetCustomer(self, request, context):
return customer_pb2.Customer(
customer_id=request.customer_id,
name="John Doe",
balance=5000.00
)
def UpdateBalance(self, request, context):
print(f"Updating balance for {request.customer_id} " +
f"to {request.new_balance}")
return customer_pb2.UpdateBalanceResponse(success=True)
def ListCustomers(self, request, context):
customers = [
customer_pb2.Customer(customer_id=1, name="Alice", balance=1000),
customer_pb2.Customer(customer_id=2, name="Bob", balance=2000),
]
return customer_pb2.CustomerList(customers=customers)
def serve():
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
customer_pb2_grpc.add_CustomerServiceServicer_to_server(
CustomerServicer(), server)
server.add_insecure_port('[::]:50051')
server.start()
print("Server started on port 50051")
server.wait_for_termination()
if __name__ == '__main__':
serve()
I learned that binary protocols offer significant efficiency gains. JSON is human-readable and convenient for debugging, but in high-performance scenarios, binary protocols like Protocol Buffers reduce payload size and serialization overhead.
Serverless and Lambda: Functions as a Service
Around 2015, AWS Lambda introduced serverless computing where you wrote functions, and AWS handled all the infrastructure:
Serverless was powerful with no servers to manage, automatic scaling, pay-per-invocation pricing. It felt like the Actor model I’d worked for my research that offered small, stateless, event-driven functions.
However, I also encountered several problems with serverless:
Cold starts: First invocation could be slow (though it has improved with recent updates)
Timeouts: Functions had maximum execution time (15 minutes for Lambda)
State management: Functions were stateless; you needed external state stores
Orchestration: Coordinating multiple functions was complex
The ping-pong anti-pattern emerged where Lambda A calls Lambda B, which calls Lambda C, which calls Lambda D. This created hard-to-debug systems with unpredictable costs. AWS Step Functions and Azure Durable Functions addressed orchestration:
gRPC had one major gotcha in Kubernetes: connection persistence breaks load balancing. I documented this exhaustively in my blog post The Complete Guide to gRPC Load Balancing in Kubernetes and Istio. HTTP/2 multiplexes multiple requests over a single TCP connection. Once that connection is established to one pod, all requests go there. Kubernetes Service load balancing happens at L4 (TCP), so it doesn’t see individual gRPC calls and it only sees one connection. I used Istio’s Envoy sidecar, which operates at L7 and routes each gRPC call independently:
I learned that modern protocols solve old problems but introduce new ones. gRPC is excellent, but you must understand how it interacts with infrastructure. Production systems require deep integration between application protocol and deployment environment.
Modern Messaging and Streaming
I have been using Apache Kafka for many years that transformed how we think about data. It’s not just a message queue instead it’s a distributed commit log:
from kafka import KafkaProducer, KafkaConsumer
import json
# Producer
producer = KafkaProducer(
bootstrap_servers='localhost:9092',
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
order = {
'order_id': '12345',
'customer_id': '67890',
'amount': 99.99,
'timestamp': time.time()
}
producer.send('orders', value=order)
producer.flush()
# Consumer
consumer = KafkaConsumer(
'orders',
bootstrap_servers='localhost:9092',
auto_offset_reset='earliest',
value_deserializer=lambda m: json.loads(m.decode('utf-8')),
group_id='order-processors'
)
for message in consumer:
order = message.value
print(f"Processing order: {order['order_id']}")
# Process order
Kafka’s provided:
Durability: Messages are persisted to disk
Replayability: Consumers can reprocess historical events
Partitioning: Horizontal scalability through partitions
Consumer groups: Multiple consumers can process in parallel
Key Lesson: Event-driven architectures enable loose coupling and temporal decoupling. Systems can be rebuilt from the event log. This is Event Sourcing—a powerful pattern that Kafka makes practical at scale.
Agentic RPC: MCP and Agent-to-Agent Protocol
Over the last year, I have been building Agentic AI applications using Model Context Protocol (MCP) and more recently Agent-to-Agent (A2A) protocol. Both use JSON-RPC 2.0 underneath. After decades of RPC evolution, from Sun RPC to CORBA to gRPC, we’ve come full circle to JSON-RPC for AI agents. I recently built a daily minutes assistant that aggregates information from multiple sources into a morning briefing. After decades of RPC evolution, from Sun RPC to CORBA to gRPC, it has come full circle to JSON-RPC for AI agents.
Service Discovery
A2A immediately reminded me of Sun’s Network Information Service (NIS), originally called Yellow Pages that I used in early 1990s. NIS provided a centralized directory service for Unix systems to look up user accounts, host names, and configuration data across a network. I saw this pattern repeated throughout the decades:
CORBA Naming Service (1990s): Objects registered themselves with a hierarchical naming service, and clients discovered them by name
JINI (late 1990s): Services advertised themselves via multicast, and clients discovered them through lookup registrars (as I described earlier in the JINI section)
UDDI (2000s): Universal Description, Discovery, and Integration for web services—a registry where SOAP services could be published and discovered
Consul, Eureka, etcd (2010s): Modern service discovery for microservices
Kubernetes DNS/Service Discovery (2010s-present): Built-in service registry and DNS-based discovery
Model Context Protocol (MCP)
MCP lets AI agents discover and invoke tools provided by servers. I recently built a daily minutes assistant that aggregates information from multiple sources into a morning briefing. Here’s the MCP server that exposes tools to the AI agent:
from mcp.server import Server
import mcp.types as types
from typing import Any
import asyncio
class DailyMinutesServer:
def __init__(self):
self.server = Server("daily-minutes")
self.setup_handlers()
def setup_handlers(self):
@self.server.list_tools()
async def handle_list_tools() -> list[types.Tool]:
return [
types.Tool(
name="get_emails",
description="Fetch recent emails from inbox",
inputSchema={
"type": "object",
"properties": {
"hours": {
"type": "number",
"description": "Hours to look back"
},
"limit": {
"type": "number",
"description": "Max emails to fetch"
}
}
}
),
types.Tool(
name="get_hackernews",
description="Fetch top Hacker News stories",
inputSchema={
"type": "object",
"properties": {
"limit": {
"type": "number",
"description": "Number of stories"
}
}
}
),
types.Tool(
name="get_rss_feeds",
description="Fetch latest RSS feed items",
inputSchema={
"type": "object",
"properties": {
"feed_urls": {
"type": "array",
"items": {"type": "string"}
}
}
}
),
types.Tool(
name="get_weather",
description="Get current weather forecast",
inputSchema={
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
)
]
@self.server.call_tool()
async def handle_call_tool(
name: str,
arguments: dict[str, Any]
) -> list[types.TextContent]:
if name == "get_emails":
result = await email_connector.fetch_recent(
hours=arguments.get("hours", 24),
limit=arguments.get("limit", 10)
)
elif name == "get_hackernews":
result = await hn_connector.fetch_top_stories(
limit=arguments.get("limit", 10)
)
elif name == "get_rss_feeds":
result = await rss_connector.fetch_feeds(
feed_urls=arguments["feed_urls"]
)
elif name == "get_weather":
result = await weather_connector.get_forecast(
location=arguments["location"]
)
else:
raise ValueError(f"Unknown tool: {name}")
return [types.TextContent(
type="text",
text=json.dumps(result, indent=2)
)]
Each connector is a simple async module. Here’s the Hacker News connector:
import aiohttp
from typing import List, Dict
class HackerNewsConnector:
BASE_URL = "https://hacker-news.firebaseio.com/v0"
async def fetch_top_stories(self, limit: int = 10) -> List[Dict]:
async with aiohttp.ClientSession() as session:
# Get top story IDs
async with session.get(f"{self.BASE_URL}/topstories.json") as resp:
story_ids = await resp.json()
# Fetch details for top N stories
stories = []
for story_id in story_ids[:limit]:
async with session.get(
f"{self.BASE_URL}/item/{story_id}.json"
) as resp:
story = await resp.json()
stories.append({
"title": story.get("title"),
"url": story.get("url"),
"score": story.get("score"),
"by": story.get("by"),
"time": story.get("time")
})
return stories
RSS and weather connectors follow the same pattern—simple, focused modules that the MCP server orchestrates.
JSON-RPC Under the Hood
MCP is that it’s just JSON-RPC 2.0 over stdio or HTTP. Here’s what a tool call looks like on the wire:
Agent synthesizes everything into a concise morning briefing
The AI decides which tools to call, in what order, based on the user’s preferences. I don’t hardcode the workflow.
Agent-to-Agent Protocol (A2A)
While MCP focuses on tool calling, A2A addresses agent-to-agent discovery and communication. It’s the modern equivalent of NIS/Yellow Pages for agents. Agents register their capabilities in a directory, and other agents discover and invoke them. A2A also uses JSON-RPC 2.0, but adds a discovery layer. Here’s how an agent registers itself:
Though, I appreciate the simplicity of MCP and A2A but here’s what worries me: both protocols largely ignore decades of hard-won lessons about security. The Salesloft breach in 2024 showed exactly what happens: their AI chatbot stored authentication tokens for hundreds of services. MCP and A2A give us standard protocols for tool calling and agent coordination, which is valuable. But they create a false sense of security while ignoring fundamentals we solved decades ago:
Authentication: How do we verify an agent’s identity?
Authorization: What capabilities should this agent have access to?
Credential rotation: How do we handle token expiration and renewal?
Observability: How do we trace agent interactions for debugging and auditing?
Principle of least privilege: How do we ensure agents only access what they need?
Rate limiting: How do we prevent a misbehaving agent from overwhelming services?
The community needs to address this before A2A and MCP see widespread enterprise adoption.
Lessons Learned
1. Complexity is the Enemy
Every failed technology I’ve used failed because of complexity. CORBA, SOAP, EJB—they all collapsed under their own weight. Successful technologies like REST, gRPC, Kafka focused on doing one thing well.
Implication: Be suspicious of solutions that try to solve every problem. Prefer composable, focused tools.
2. Network Calls Are Expensive
The first Fallacy of Distributed Computing haunts us still: The network is not reliable. It’s also not zero latency, infinite bandwidth, or secure. I’ve watched this lesson be relearned in every generation:
EJB entity beans made chatty network calls
Microservices make chatty REST calls
GraphQL makes chatty database queries
Implication: Design APIs to minimize round trips. Batch operations. Cache aggressively. Monitor network latency religiously. (See my blog on fault tolerance in microservices for details.)
3. Statelessness Scales
Stateless services scale horizontally. But real applications need state—session data, shopping carts, user preferences. The solution isn’t to make services stateful instead it’s to externalize state:
Session stores (Redis, Memcached)
Databases (PostgreSQL, DynamoDB)
Event logs (Kafka)
Distributed caches
Implication: Keep service logic stateless. Push state to specialized systems designed for it.
4. The Actor Model Is Underappreciated
My research with actors and Linda memory model convinced me that the Actor model simplifies concurrent and distributed systems. Today’s serverless functions are essentially actors. Frameworks like Akka, Orleans, and Dapr embrace it. Actors eliminate shared mutable shared state, which the source of most concurrency bugs.
Implication: For event-driven systems, consider Actor-based frameworks. They map naturally to distributed problems.
5. Observability
Modern distributed systems require extensive instrumentation. You need:
Structured logging with correlation IDs
Metrics for performance and health
Distributed tracing to follow requests across services
Alarms with proper thresholds
Implication: Instrument your services from day one. Observability is infrastructure, not a nice-to-have. (See my blog posts on fault tolerance and load shedding for specific metrics.)
6. Throttling and Load Shedding
Every production system eventually faces traffic spikes or DDoS attacks. Without throttling and load shedding, your system will collapse. Key techniques:
Rate limiting by client/user/IP
Admission control based on queue depth
Circuit breakers to fail fast
Backpressure to slow down producers
Implication: Build throttling and load shedding into your architecture early. They’re harder to retrofit. (See my comprehensive blog post on this topic.)
7. Idempotency
Network failures mean requests may be retried. If your operations aren’t idempotent, you’ll process payments twice, create duplicate orders, and corrupt data (See my blog on idempotency topic). Make operations idempotent:
Use idempotency keys
Check if operation already succeeded
Design APIs to be safely retryable
Implication: Every non-read operation should be idempotent. It saves you from a world of hurt.
8. External and Internal APIs Should Differ
I have learned that external APIs need a good UX and developer empathy so that APIs are intuitive, consistent, well-documented. Internal APIs can optimize for performance, reliability, and operational needs. Don’t expose your internal architecture to external consumers. Use API gateways to translate between external contracts and internal services.
Implication: Design external APIs for developers using them. Design internal APIs for operational excellence.
9. Standards Beat Proprietary Solutions
Novell IPX failed because it was proprietary. Sun RPC succeeded as an open standard. REST thrived because it built on HTTP. gRPC uses open standards (HTTP/2, Protocol Buffers).
Implication: Prefer open standards. If you must use proprietary tech, understand the exit strategy.
10. Developer Experience Matters
Technologies with great developer experience get adopted. Java succeeded because it was easier than C++. REST beat SOAP because it was simpler. Kubernetes won because it offered a powerful abstraction.
Implication: Invest in developer tools, documentation, and ergonomics. Friction kills momentum.
Upcoming Trends
WebAssembly: The Next Runtime
WebAssembly (Wasm) is emerging as a universal runtime. Code written in Rust, Go, C, or AssemblyScript compiles to Wasm and runs anywhere. Platforms like wasmCloud, Fermyon, and Lunatic are building Actor-based systems on Wasm. Combined with the Component Model and WASI (WebAssembly System Interface), Wasm offers near-native performance, strong sandboxing, and portability. It might replace Docker containers for some workloads. Solomon Hykes, creator of Docker, famously said:
“If WASM+WASI existed in 2008, we wouldn’t have needed to create Docker. That’s how important it is. WebAssembly on the server is the future of computing. A standardized system interface was the missing link. Let’s hope WASI is up to the task!” — Solomon Hykes, March 2019
WebAssembly isn’t ready yet. Critical gaps:
WASI maturity: Still evolving (Preview 2 in development)
Async I/O: Limited compared to native runtimes
Database drivers: Many don’t support WASM
Networking: WASI sockets still experimental
Ecosystem tooling: Debugging, profiling still primitive
Service Meshes
Istio, Linkerd, Dapr move cross-cutting concerns out of application code:
Authentication/authorization
Rate limiting
Circuit breaking
Retries with exponential backoff
Distributed tracing
Metrics collection
Tradeoff: Complexity shifts from application code to infrastructure. Teams need deep Kubernetes and service mesh expertise.
The Edge Is Growing
Edge computing brings computation closer to users. CDNs like Cloudflare Workers and Fastly Compute@Edge run code globally with single-digit millisecond latency. This requires new thinking like eventual consistency, CRDTs (Conflict-free Replicated Data Types), and geo-distributed state management.
AI Agents and Multi-Agent Systems
I’m currently building agentic AI systems using LangGraph, RAG, and MCP. These are inherently distributed and agents communicate asynchronously, maintain local state, and coordinate through message passing. It’s the Actor model again.
What’s Missing
Despite all this progress, we still struggle with:
Distributed transactions: Two-phase commit doesn’t scale; SAGA patterns are complex
Testing distributed systems: Mocking services, simulating failures, and reproducing production bugs remain hard. I have written a number of tools for mock testing.
Observability at scale: Tracing millions of requests generates too much data
Cost management: Cloud bills spiral as systems grow
Cognitive load: Modern systems require expertise in dozens of technologies
Conclusion
I’ve been writing network code for decades and have used dozens of protocols, frameworks, and paradigms. Here is what I have learned:
Design for Failure from Day One (Systems built with circuit breakers, retries, timeouts, and graceful degradation from the start).
Other tips from evolution of remote services include:
Design systems as message-passing actors from the start. Whether that’s Erlang processes, Akka actors, Orleans grains, or Lambda functions—embrace isolated state and message passing.
Invest in Observability with structured logging with correlation IDs, instrumented metrics, distributed tracing and alarms.
Separate External and Internal APIs. Use REST or GraphQL for external APIs (with versioning) and use gRPC or Thrift for internal communication (efficient).
Build Throttling and Load Shedding by rate limiting by client/user/IP at the edge and implement admission control at the service level (See my blog on Effective Load Shedding and Throttling).
Make Everything Idempotent as networks fail and requests get retried. Use idempotency keys for all mutations.
Choose Boring Technology (See Choose Boring Technology). For your core infrastructure, use proven tech (PostgreSQL, Redis, Kafka).
Test for Failure. Most code only handles the happy path. Production is all about unhappy paths.
Make chaos engineering part of CI/CD and use property-based testing (See my blog on property-based testing).
The technologies change like mainframes to serverless, Assembly to Go, CICS to Kubernetes. But the underlying principles remain constant. We oscillate between extremes:
Each swing teaches us something. CORBA was too complex, but IDL-first design is valuable. REST was liberating, but binary protocols are more efficient. Microservices enable agility, but operational complexity explodes. The sweet spot is usually in the middle. Modular monoliths with clear boundaries. REST for external APIs, gRPC for internal communication. Some synchronous calls, some async messaging.
Here are a few trends that I see becoming prevalent:
WebAssembly may replace containers for some workloads: Faster startup, better security with platforms like wasmCloud and Fermyon.
Service meshes are becoming invisible: Currently they are too complex. Ambient mesh (no sidecars) and eBPF-based routing are gaining wider adoption.
The Actor model will eat the world: Serverless functions are actors and durable functions are actor orchestration.
Edge computing will force new patterns: We can’t rely on centralized state and may need CRDTs and eventual consistency.
AI agents will need distributed coordination. Multi-agent systems = distributed systems and may need message passing between agents.
The best engineers don’t just learn the latest framework, they study the history, understand the trade-offs, and recognize when old ideas solve new problems. The future of distributed systems won’t be built by inventing entirely new paradigms instead it’ll be built by taking the best ideas from the past, learning from the failures, and applying them with better tools.
TL;DR: Tested open-source LLM serving (vLLM) on GCP L4 GPUs. Achieved 93% cost savings vs OpenAI GPT-4, 100% routing accuracy, and 91% cache hit rates. Prototype proves feasibility; production requires 5-7 months additional work (security, HA, ops). All code at github.com/bhatti/vllm-tutorial.
Background
Last year, our CEO mandated “AI adoption” across the organization and everyone had access to LLMs through an internal portal that used Vertex AI. However, there was a little training or best practices. I saw engineers using the most expensive models for simple queries, no cost tracking, zero observability into what was being used, and no policies around data handling. People tried AI, built some demos and got mixed results.
This mirrors what’s happening across the industry. Recent research shows 95% of AI pilots fail at large companies, and McKinsey found 42% of companies abandoned generative AI projects citing “no significant bottom line impact.” The 5% that succeed do something fundamentally different: they treat AI as infrastructure requiring proper tooling, not just API access.
This experience drove me to explore better approaches. I built prototypes using vLLM and open-source tools, tested them on GCP L4 GPUs, and documented what actually works. This blog shares those findings with real code, benchmarks, and lessons from building production-ready AI infrastructure. Every benchmark ran on actual hardware (GCP L4 GPUs), every pattern emerged from solving real problems, and all code is available at github.com/bhatti/vllm-tutorial.
Why Hosted LLM Access Isn’t Enough
Even with managed services like Vertex AI or Bedrock, enterprise AI needs additional layers that most organizations overlook:
Cost Management
No intelligent routing between models (GPT-4 for simple definitions that Phi-2 could handle)
No per-user, per-team budgets or limits
No cost attribution or chargeback
Result: Unpredictable expenses, no accountability
Observability
Can’t track which prompts users send
Can’t identify failing queries or quality degradation
Can’t measure actual usage patterns
Result: Flying blind when issues occur
Security & Governance
Data flows through third-party infrastructure
No granular access controls beyond API keys
Limited audit trails for compliance
Result: Compliance gaps, security risks
Performance Control
Can’t deploy custom fine-tuned models
No A/B testing between models
Limited control over routing logic
Result: Vendor lock-in, inflexibility
The Solution: vLLM with Production Patterns
After evaluating options, I built prototypes using vLLM—a high-performance inference engine for running open-source LLMs (Llama, Mistral, Phi) on your infrastructure. Think of vLLM as NGINX for LLMs: battle-tested, optimized runtime that makes production deployments feasible.
Production error handling (retries, circuit breakers, fallbacks)
System Architecture
Here’s the complete system architecture I’ve built and tested:
Production AI requires three monitoring layers:
Layer 1: Infrastructure (Prometheus + Grafana)
GPU utilization, memory usage
Request rate, error rate, latency (P50, P95, P99)
Integration via /metrics endpoint that vLLM exposes
Grafana dashboards visualize trends and trigger alerts
Layer 2: Application Metrics
Time to First Token (TTFT), tokens per second
Cost per request, model distribution
Budget tracking (daily, monthly limits)
Custom Prometheus metrics embedded in application code
Layer 3: LLM Observability (Langfuse)
Full prompt/response history for debugging
Cost attribution per user/team
Quality tracking over time
Essential for understanding what users actually do
Here’s what I’ve built and tested:
Setting Up Your Environment: GCP L4 GPU Setup
Before we dive into the concepts, let’s get your environment ready. I’m using GCP L4 GPUs because they offer the best price/performance for this workload ($0.45/hour), but the code works on any CUDA-capable GPU.
# Test vLLM installation
python -c "import vllm; print(f'vLLM version: {vllm.__version__}')"
# Quick functionality test
python examples/01_basic_vllm.py
Expected output:
Loading model microsoft/phi-2...
Model loaded in 8.3 seconds
Generating response...
Generated 50 tokens in 987ms
Throughput: 41.5 tokens/sec
? vLLM is working!
Quick Start
Before we dive deep, let’s get something running:
Clone the repo:
git clone https://github.com/bhatti/vllm-tutorial.git
cd vllm-tutorial
If you have a GPU available:
# Follow setup instructions in README
python examples/01_basic_vllm.py
No GPU? Run the benchmarks locally:
# See the actual results from GCP L4 testing
cat benchmarks/results/01_throughput_results.json
from typing import Callable
from dataclasses import dataclass
import time
@dataclass
class RetryConfig:
"""Retry configuration"""
max_retries: int = 3
initial_delay: float = 1.0
max_delay: float = 60.0
exponential_base: float = 2.0
def retry_with_backoff(config: RetryConfig = RetryConfig()):
"""
Decorator: Retry with exponential backoff
Example:
@retry_with_backoff()
def generate_text(prompt):
return llm.generate(prompt)
"""
def decorator(func: Callable) -> Callable:
def wrapper(*args, **kwargs):
delay = config.initial_delay
for attempt in range(config.max_retries):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == config.max_retries - 1:
raise # Last attempt, re-raise
error_type = classify_error(e)
# Don't retry on invalid input
if error_type == ErrorType.INVALID_INPUT:
raise
print(f"?? Attempt {attempt + 1} failed: {error_type.value}")
print(f" Retrying in {delay:.1f}s...")
time.sleep(delay)
# Exponential backoff
delay = min(delay * config.exponential_base, config.max_delay)
raise RuntimeError(f"Failed after {config.max_retries} retries")
return wrapper
return decorator
# Usage
@retry_with_backoff(RetryConfig(max_retries=3, initial_delay=1.0))
def generate_with_retry(prompt: str):
"""Generate with automatic retry on failure"""
return llm.generate(prompt)
# This will retry up to 3 times with exponential backoff
result = generate_with_retry("Analyze earnings report")
Pattern 2: Circuit Breaker
When a service starts failing repeatedly, stop calling it:
from datetime import datetime, timedelta
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing, reject requests
HALF_OPEN = "half_open" # Testing recovery
class CircuitBreaker:
"""
Circuit breaker for fault tolerance
Prevents cascading failures by stopping calls to
failing services
"""
def __init__(
self,
failure_threshold: int = 5,
timeout: int = 60,
expected_exception: type = Exception
):
self.failure_threshold = failure_threshold
self.timeout = timeout
self.expected_exception = expected_exception
self.failure_count = 0
self.last_failure_time = None
self.state = CircuitState.CLOSED
def call(self, func: Callable, *args, **kwargs):
"""Execute function with circuit breaker protection"""
if self.state == CircuitState.OPEN:
# Check if timeout elapsed
if datetime.now() - self.last_failure_time > timedelta(seconds=self.timeout):
self.state = CircuitState.HALF_OPEN
print("? Circuit breaker: HALF_OPEN (testing recovery)")
else:
raise RuntimeError("Circuit breaker OPEN - service unavailable")
try:
result = func(*args, **kwargs)
# Success - reset if recovering
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.CLOSED
self.failure_count = 0
print("? Circuit breaker: CLOSED (service recovered)")
return result
except self.expected_exception as e:
self.failure_count += 1
self.last_failure_time = datetime.now()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
print(f"? Circuit breaker: OPEN (threshold {self.failure_threshold} reached)")
raise
# Usage
circuit_breaker = CircuitBreaker(failure_threshold=5, timeout=60)
def generate_protected(prompt: str):
"""Generate with circuit breaker protection"""
return circuit_breaker.call(llm.generate, prompt)
# If llm.generate fails 5 times, circuit breaker opens
# Requests fail fast for 60 seconds
# Then one test request (half-open)
# If successful, normal operation resumes
This prevents:
Thundering herd problem
Resource exhaustion
Long timeouts on every request
Pattern 3: Rate Limiting
Protect your system from overload:
import time
class RateLimiter:
"""
Token bucket rate limiter
Limits requests per second to prevent overload
"""
def __init__(self, max_requests: int, time_window: float = 1.0):
self.max_requests = max_requests
self.time_window = time_window
self.tokens = max_requests
self.last_update = time.time()
def acquire(self, tokens: int = 1) -> bool:
"""Try to acquire tokens, return True if allowed"""
now = time.time()
elapsed = now - self.last_update
# Refill tokens based on elapsed time
self.tokens = min(
self.max_requests,
self.tokens + (elapsed / self.time_window) * self.max_requests
)
self.last_update = now
if self.tokens >= tokens:
self.tokens -= tokens
return True
else:
return False
def wait_for_token(self, tokens: int = 1):
"""Wait until token is available"""
while not self.acquire(tokens):
time.sleep(0.1)
# Usage
rate_limiter = RateLimiter(max_requests=100, time_window=1.0)
@app.post("/generate")
async def generate(request: GenerateRequest):
# Check rate limit
if not rate_limiter.acquire():
raise HTTPException(
status_code=429,
detail="Rate limit exceeded (100 req/sec)"
)
# Process request
result = llm.generate(request.prompt)
return result
Why this matters:
Prevents DoS (accidental or malicious)
Protects GPU from overload
Ensures fair usage
Pattern 4: Fallback Strategies
When primary fails, don’t just error—degrade gracefully:
def generate_with_fallback(prompt: str) -> str:
"""
Try multiple strategies before failing
Strategy 1: Primary model (Llama-3-8B)
Strategy 2: Cached response (if available)
Strategy 3: Simpler model (Phi-2)
Strategy 4: Template response
"""
# Try primary model
try:
return llm_primary.generate(prompt)
except Exception as e:
print(f"?? Primary model failed: {e}")
# Fallback 1: Check cache
cached_response = cache.get(prompt)
if cached_response:
print("? Returning cached response")
return cached_response
# Fallback 2: Try simpler model
try:
print("? Falling back to Phi-2")
return llm_simple.generate(prompt)
except Exception as e2:
print(f"?? Fallback model also failed: {e2}")
# Fallback 3: Template response
return (
"I apologize, but I'm unable to process your request right now. "
"Please try again in a few minutes, or contact support if the issue persists."
)
# User never sees "Internal Server Error"
# They always get SOME response
Graceful degradation examples:
Can’t generate full analysis? Return summary
Can’t use complex model? Use simple model
Can’t generate? Return cached response
Everything failing? Return polite error message
Pattern 5: Timeout Handling
Don’t let requests hang forever:
import signal
class TimeoutError(Exception):
pass
def timeout_handler(signum, frame):
raise TimeoutError("Request timed out")
def generate_with_timeout(prompt: str, timeout_seconds: int = 30):
"""Generate with timeout"""
# Set timeout
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(timeout_seconds)
try:
result = llm.generate(prompt)
# Cancel timeout
signal.alarm(0)
return result
except TimeoutError:
print(f"? Request timed out after {timeout_seconds}s")
return "Request timed out. Please try a shorter prompt."
# Or using asyncio
import asyncio
async def generate_with_timeout_async(prompt: str, timeout_seconds: int = 30):
"""Generate with async timeout"""
try:
result = await asyncio.wait_for(
llm.generate_async(prompt),
timeout=timeout_seconds
)
return result
except asyncio.TimeoutError:
return "Request timed out. Please try a shorter prompt."
Why timeouts matter:
Prevent resource leaks
Free up GPU for other requests
Give users fast feedback
Combined Example
Here’s how I combine all patterns:
from fastapi import FastAPI, HTTPException
from circuitbreaker import CircuitBreaker, CircuitBreakerError
app = FastAPI()
# Initialize components
circuit_breaker = CircuitBreaker(failure_threshold=5, timeout=60)
rate_limiter = RateLimiter(max_requests=100, time_window=1.0)
cache = ResponseCache(ttl=3600)
@app.post("/generate")
@retry_with_backoff(max_retries=3)
async def generate(request: GenerateRequest):
"""
Generate with full error handling:
- Rate limiting
- Circuit breaker
- Retry with backoff
- Timeout
- Fallback strategies
- Caching
"""
# Rate limiting
if not rate_limiter.acquire():
raise HTTPException(status_code=429, detail="Rate limit exceeded")
# Check cache first
cached = cache.get(request.prompt)
if cached:
return {"text": cached, "cached": True}
try:
# Circuit breaker protection
result = circuit_breaker.call(
generate_with_timeout,
request.prompt,
timeout_seconds=30
)
# Cache successful response
cache.set(request.prompt, result)
return {"text": result, "status": "success"}
except CircuitBreakerError:
# Circuit breaker open - return fallback
return {
"text": "Service temporarily unavailable. Using cached response.",
"status": "degraded",
"fallback": True
}
except TimeoutError:
raise HTTPException(status_code=504, detail="Request timed out")
except Exception as e:
# Log error
logger.error(f"Generation failed: {e}")
# Return graceful error
return {
"text": "I apologize, but I'm unable to process your request.",
"status": "error",
"fallback": True
}
What this provides:
? Prevents overload (rate limiting)
? Fast failure (circuit breaker)
? Automatic recovery (retry)
? Resource protection (timeout)
? Graceful degradation (fallback)
? Performance (caching)
Deployment Recommendations
While my testing remained at POC level, these patterns prepare for production deployment:
Before deploying:
Load Testing
Test with expected peak load (10-100x normal traffic)
Measure P95 latency under load (<500ms target)
Verify error rate stays <1%
Confirm GPU memory stable (no leaks)
Production Deployment Checklist
Before going live, verify:
Infrastructure:
[ ] GPU drivers installed and working (nvidia-smi)
[ ] Alert destinations set (PagerDuty, Slack, email)
[ ] Langfuse set up (if using LLM observability)
Testing:
[ ] Health check returns 200 OK
[ ] Can generate completions via API
[ ] Metrics endpoint returning data
[ ] Error handling works (try invalid input)
[ ] Budget limits enforced (if configured)
[ ] Load test passed (see next section)
Security:
[ ] API authentication enabled
[ ] Rate limiting configured
[ ] HTTPS enforced (no HTTP)
[ ] CORS policies set
[ ] Input validation in place
[ ] Secrets not in git (use env variables)
Operations:
[ ] Backup strategy for logs
[ ] Model cache backed up
[ ] Runbook written (how to handle incidents)
[ ] On-call rotation defined
[ ] SLAs documented
[ ] Disaster recovery plan
Real-World Results
Testing on GCP L4 GPUs with 11 queries produced these validated results:
End-to-End Integration Test Results
Test configuration:
Model: Phi-2 (2.7B parameters)
Quantization: None (FP16 baseline)
Prefix caching: Enabled
Budget: $10/day
Hardware: GCP L4 GPU
Results:
Metric
Value
Total Requests
11
Success Rate
100% (11/11) ?
Total Tokens Generated
2,200
Total Cost
$0.000100
Average Latency
5,418ms
Cache Hit Rate
90.9% ?
Budget Utilization
0.001%
Model distribution:
Phi-2: 54.5% (6 requests)
Llama-3-8B: 27.3% (3 requests)
Mistral-7B: 18.2% (2 requests)
What this proves: ? Intelligent routing works (3 models selected correctly) ? Budget enforcement works (under budget, no overruns) ? Prefix caching works (91% hit rate = huge savings) ? Multi-model support works (distributed correctly) ? Observability works (all metrics collected)
Cost Comparison
Let me show you the exact cost calculations:
Per-request costs (from actual test):
Request 1 (uncached): $0.00002038
Requests 2-11 (cached): $0.00000414 average
Total: $0.00010031 for 11 requests
Average: $0.0000091 per request
Extrapolated monthly costs (10,000 requests/day):
Configuration
Daily Cost
Monthly Cost
Savings
Without caching
$0.91
$27.30
Baseline
With caching (91% hit rate)
$0.18
$5.46
80%
With quantization (AWQ)
$0.09
$2.73
90%
All optimizations
$0.09
$2.73
90%
Add in infrastructure costs:
GCP L4 GPU: $0.45/hour = $328/month
Total monthly cost:
- Infrastructure: $328
- API costs: $2.73
- Total: $330.73/month for 10,000 requests/day
Compare to OpenAI:
OpenAI GPT-4:
- Input: $0.03 per 1K tokens
- Output: $0.06 per 1K tokens
- Average request: 100 tokens in + 100 tokens out = $0.009
- 10,000 requests/day = $90/day = $2,700/month
Savings: $2,369/month (88% cheaper!)
After building and testing this platform, I understand why enterprise AI differs from giving developers ChatGPT access and why 95% of initiatives fail. Here is why these layers matter:
Cost tracking isn’t about being cheap—it’s about accountability. Finance won’t approve next year’s AI budget without ROI proof.
Intelligent routing prevents the death spiral: early excitement ? everyone uses the expensive model ? costs spiral ? finance pulls the plug ? initiative dies.
Observability builds trust. When executives ask “Is AI working?”, you need data: success rates, cost per department, quality trends. Without metrics, you get politics and cancellation.
Error handling and budgets are professional table stakes. Enterprises can’t have systems that randomly fail or spend unpredictably.
Here are things missing from the prototype:
Security: No SSO, PII detection, audit logs for compliance, encryption at rest, security review
High Availability: Single instance, no load balancer, no failover, no disaster recovery
Operations: No CI/CD, secrets management, log aggregation, incident playbooks
Scale: No auto-scaling, multi-region, or load testing beyond 100 concurrent
Governance: No approval workflows, per-user limits, content filtering, A/B testing
I have learned that vLLM works, open models are competitive, the tooling is mature. This POC proves that the patterns work and the savings are real. The 5% that succeed treat AI as infrastructure requiring proper tooling. The 95% that fail treat it as magic requiring only faith.
Try it yourself: All code at github.com/bhatti/vllm-tutorial. Clone it, test it, prove it works in your environment. Then build the business case for production investment.
Over the last year, I have been applying Agentic AI to various problems at work and to improve personal productivity. For example, every morning, I faced the same challenge: information overload.
My typical morning looked like this:
? Check emails and sort out what’s important
? Check my calendar and figure out which ones are critical
? Skim HackerNews, TechCrunch, newsletters for any important insight
? Check Slack for any critical updates
?? Look up weather ? Should I bring an umbrella or jacket?
? Already lost 45 minutes just gathering information!
I needed an AI assistant that could digest all this information while I shower, then present me with a personalized 3-minute brief highlighting what actually matters. Also, following were key constraints for this assistant:
? Complete privacy – My emails and calendar shouldn’t leave my laptop and I didn’t want to run any MCP servers in cloud that could expose my private credentials
? Zero ongoing costs – Running complex Agentic workflow on the hosted environments could easily cost me hundreds of dollars a month
? Fast iteration – Test changes instantly during development
? Flexible deployment – Start local, deploy to cloud when ready
I will walk through my journey of building Daily Minutes with Claude Code – a fully functional agentic AI system that runs on my laptop using local LLMs, saves me 30 minutes every morning.
Agentic Building Blocks
I applied following building blocks to create this system:
MCP (Model Context Protocol) – connecting to data sources discoverable by AI
RAG (Retrieval-Augmented Generation) – give AI long-term memory
ReAct Pattern – teach AI to reason before acting
RLHF (Reinforcement Learning from Human Feedback) – teach AI from my preferences
Let me walk you through how I built each piece, the problems I encountered, and how I solved them.
High-level Architecture
After several iterations, I landed on a clean 3-layer architecture:
Why this architecture worked for me:
Layer 1 (Data Sources) – I used MCP to make connectors pluggable. When I later wanted to add RSS feeds, I just registered a new tool – no changes to the AI logic.
Layer 2 (Intelligence) – This is where the magic happens. The ReAct agent reasons about what data it needs, LangGraph orchestrates fetching from multiple sources in parallel, RAG provides historical context, and RLHF learns from my feedback.
Layer 3 (UI) – I kept the UI simple and fast. It reads from a database cache, so it loads instantly – no waiting for AI to process.
How the Database Cache Works
This is a key architectural decision that made the UI lightning-fast:
# src/services/startup_service.py
async def preload_daily_data():
"""Background job that generates brief and caches in database."""
# 1. Fetch all data in parallel (LangGraph orchestration)
data = await langgraph_orchestrator.fetch_all_sources()
# 2. Generate AI brief (ReAct agent with RAG)
brief = await brief_generator.generate(
emails=data['emails'],
calendar=data['calendar'],
news=data['news'],
weather=data['weather']
)
# 3. Cache everything in SQLite
await db.set_cache('daily_brief_data', brief.to_dict(), ttl=3600) # 1 hour TTL
await db.set_cache('news_data', data['news'], ttl=3600)
await db.set_cache('emails_data', data['emails'], ttl=3600)
logger.info("? All data preloaded and cached")
# src/ui/components/daily_brief.py
def render_daily_brief_section():
"""UI just reads from cache - no AI processing!"""
# Fast read from database (milliseconds, not seconds)
if 'data' in st.session_state and st.session_state.data.get('daily_brief'):
brief_data = st.session_state.data['daily_brief']
_display_persisted_brief(brief_data) # Instant!
else:
st.info("Run `make preload` to generate your first brief.")
Why this architecture rocks:
? UI loads in <500ms (reading from SQLite cache)
? Background refresh (run make preload or schedule with cron)
? Persistent (brief survives app restarts)
? Testable (can test UI without LLM calls)
Part 1: Setting Up My Local AI Stack
First, I needed to get Ollama running locally. This took me about 30 minutes.
Installing Ollama
# On macOS (what I use)
brew install ollama
# Start the service
ollama serve
# Pull the models I chose
ollama pull qwen2.5:7b # Main LLM - fast on my M3 Mac
ollama pull nomic-embed-text # For RAG embeddings
Why I chose Qwen 2.5 (7B):
? Runs fast on my M3 MacBook Pro (no GPU needed)
? Good reasoning capabilities for summarization
? Small enough to iterate quickly (responses in 2-3 seconds)
? Free and private – data never leaves my laptop
Later, I can swap to GPT-4 or Claude with just a config change when I deploy to production.
Testing My Setup
I wanted to make sure Ollama was working before going further:
# Quick test
PYTHONPATH=. python -c "
import asyncio
from src.services.ollama_service import get_ollama_service
async def test():
ollama = get_ollama_service()
result = await ollama.generate('Explain RAG in one sentence.')
print(result)
asyncio.run(test())
"
# Output I got:
# RAG (Retrieval-Augmented Generation) enhances LLM responses by retrieving
# relevant information from a knowledge base before generating answers.
? First milestone: Local AI working!
Part 2: Building MCP Connectors
Instead of hard coding data fetching like this:
# ? My first attempt (brittle)
async def get_daily_data():
news = await fetch_hackernews()
weather = await fetch_weather()
# Later I wanted to add RSS feeds... had to modify this function
# Then I wanted Slack... modified again
# This was getting messy fast!
I decided to use MCP (Model Context Protocol) to register data sources as “tools” so that the AI can discover and call by name:
Building News Connector
I started with HackerNews since I check it every morning:
# src/connectors/hackernews.py
class HackerNewsConnector:
"""Fetches top stories from HackerNews API."""
async def execute_async(self, max_stories: int = 10):
"""The main method MCP will call."""
# 1. Fetch top story IDs
response = await self.client.get(
"https://hacker-news.firebaseio.com/v0/topstories.json"
)
story_ids = response.json()[:max_stories]
# 2. Fetch each story (I fetch these in parallel for speed)
articles = []
for story_id in story_ids:
story = await self._fetch_story(story_id)
articles.append(self._convert_to_article(story))
return articles
Key learning: Keep connectors simple. They should do ONE thing: fetch data and return it in a standard format.
Registering with MCP Server
Then I registered this connector with my MCP server:
# src/services/mcp_server.py
class MCPServer:
"""The tool registry that AI agents query."""
def _register_tools(self):
# Register HackerNews
self.tools["fetch_hackernews"] = MCPTool(
name="fetch_hackernews",
description="Fetch top tech stories from HackerNews with scores and comments",
parameters={
"max_stories": {
"type": "integer",
"description": "How many stories to fetch (1-30)",
"default": 10
}
},
executor=HackerNewsConnector()
)
This allows my AI to discover this tool and call it without me writing any special integration code!
Testing MCP Discovery
# I tested if the AI could discover my tools
PYTHONPATH=. python -c "
from src.services.mcp_server import get_mcp_server
mcp = get_mcp_server()
print('Available tools:')
for tool in mcp.list_tools():
print(f' ? {tool[\"name\"]}: {tool[\"description\"]}')
"
# Output I got:
# Available tools:
# ? fetch_hackernews: Fetch top tech stories from HackerNews...
# ? get_current_weather: Get current weather conditions...
# ? fetch_rss_feeds: Fetch articles from configured RSS feeds...
Later, when I wanted to add RSS feeds, I just created a new connector and registered it. The AI automatically discovered it – no changes needed to my ReAct agent or LangGraph workflows!
Part 3: Building RAG Pipeline
As LLM have limited context window, RAG (Retrieval-Augmented Generation) can be used to create an AI semantic memory by:
Converting text to vectors (embeddings)
Storing vectors in a database (ChromaDB)
Searching by meaning, not just keywords
Building RAG Service
I then implemented RAG service as follows:
# src/services/rag_service.py
class RAGService:
"""Semantic memory using ChromaDB."""
def __init__(self):
# Initialize ChromaDB (stores on disk)
self.client = chromadb.Client(Settings(
persist_directory="./data/chroma_data"
))
# Create collection for my articles
self.collection = self.client.get_or_create_collection(
name="daily_minutes"
)
# Ollama for creating embeddings
self.ollama = get_ollama_service()
async def add_document(self, content: str, metadata: dict):
"""Store a document with its vector embedding."""
# 1. Convert text to vector (this is the magic!)
embedding = await self.ollama.create_embeddings(content)
# 2. Store in ChromaDB with metadata
self.collection.add(
documents=[content],
embeddings=[embedding],
metadatas=[metadata],
ids=[hashlib.md5(content.encode()).hexdigest()]
)
async def search(self, query: str, max_results: int = 5):
"""Semantic search - find by meaning!"""
# 1. Convert query to vector
query_embedding = await self.ollama.create_embeddings(query)
# 2. Find similar documents (cosine similarity)
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=max_results
)
return results
I then tested it:
# I stored an article about EU AI regulations
await rag.add_document(
content="European Union announces comprehensive AI safety regulations "
"focusing on transparency, accountability, and privacy protection.",
metadata={"type": "article", "topic": "ai_safety"}
)
# Later, I searched using different words
results = await rag.search("privacy rules for artificial intelligence")
This shows that RAG isn’t just storing text – it understands meaning through vector mathematics.
What I Store in RAG
Over time, I started storing other data like emails, todos, events, etc:
# 1. News articles (for historical context)
await rag.add_article(article)
# 2. Action items from emails
await rag.add_todo(
"Complete security training by Nov 15",
source="email",
priority="high"
)
# 3. Meeting context
await rag.add_document(
"Q4 Planning Meeting - need to prepare budget estimates",
metadata={"type": "meeting", "date": "2025-02-01"}
)
# 4. User preferences (this feeds into RLHF later!)
await rag.add_document(
"User marked 'AI safety' topics as important",
metadata={"type": "preference", "category": "ai_safety"}
)
With this AI memory, it can answer questions like:
“What do I need to prepare for tomorrow’s meeting?”
“What AI safety articles did I read this week?”
“What are my pending action items?”
Part 4: Building the ReAct Agent
In my early prototyping, the implementation just executed blindly:
This wasted time fetching data I didn’t need. I wanted my AI to reason first, then act so I applied ReAct (Reasoning + Acting), which works in a loop:
THOUGHT: AI reasons about what to do next
ACTION: AI executes a tool/function
OBSERVATION: AI observes the result
Repeat until goal achieved
Implementing My ReAct Agent
Here is how it ReAct agent was built:
# src/agents/react_agent.py
class ReActAgent:
"""Agent that thinks before acting."""
async def run(self, goal: str):
"""Execute goal using ReAct loop."""
steps = []
observations = []
for step_num in range(1, self.max_steps + 1):
# 1. THOUGHT: Ask AI what to do next
thought = await self._generate_thought(goal, steps, observations)
# Check if we're done
if "FINAL ANSWER" in thought:
return self._extract_answer(thought)
# 2. ACTION: Parse what action to take
action = self._parse_action(thought)
# Example: {"action": "call_tool", "tool": "fetch_hackernews"}
# 3. EXECUTE: Run the action via MCP
observation = await self._execute_action(action)
observations.append(observation)
# Record this step for debugging
steps.append({
"thought": thought,
"action": action,
"observation": observation
})
return {"steps": steps, "answer": "Max steps reached"}
The hardest part was writing the prompts that made the AI reason properly:
async def _generate_thought(self, goal, steps, observations):
"""Generate next reasoning step."""
prompt = f"""Goal: {goal}
Previous steps:
{self._format_steps(steps)}
Available actions:
- query_rag(query): Search my semantic memory
- call_tool(name, params): Execute an MCP tool
- FINAL ANSWER: When you have everything needed
Think step-by-step. What should I do next?
Format your response as:
THOUGHT: <your reasoning>
ACTION: <action to take>
"""
return await self.ollama.generate(prompt, temperature=0.7)
I added debug logging to see the AI’s reasoning:
? Goal: Generate my daily brief
Step 1:
? THOUGHT: I need to gather news, check weather, and see user preferences
? ACTION: call_tool("fetch_hackernews", max_stories=10)
?? OBSERVATION: Fetched 10 articles about AI, privacy, and tech
Step 2:
? THOUGHT: Got news. User preferences would help prioritize.
? ACTION: query_rag("user interests and preferences")
?? OBSERVATION: User cares about AI safety, security, privacy
Step 3:
? THOUGHT: Should filter articles to user's interests
? ACTION: call_tool("get_current_weather", location="Seattle")
?? OBSERVATION: 70°F, Partly cloudy
Step 4:
? THOUGHT: I have news (filtered by user interests), weather. Ready to generate.
? ACTION: FINAL ANSWER
? Generated personalized brief highlighting AI safety articles
# src/models/email.py
class ImportanceScoringMixin:
"""Learn from user feedback."""
importance_score: float = 0.5 # AI's base score
boost_labels: Set[str] = set() # Words user marked important
filter_labels: Set[str] = set() # Words user wants to skip
def apply_rlhf_boost(self, content_text: str) -> float:
"""Adjust score based on learned preferences."""
adjusted = self.importance_score
content_lower = content_text.lower()
# Boost if content matches important keywords
for label in self.boost_labels:
if label.lower() in content_lower:
adjusted += 0.1 # Bump up priority!
# Penalize if content matches skip keywords
for label in self.filter_labels:
if label.lower() in content_lower:
adjusted -= 0.2 # Push down priority!
# Keep in valid range [0, 1]
return max(0.0, min(1.0, adjusted))
Note: Code examples are simplified for clarity. See GitHub for the full production implementation.
Adding Feedback UI
In my Streamlit dashboard, I added ?/? buttons:
# User sees an email
for email in emails:
col1, col2, col3 = st.columns([8, 1, 1])
with col1:
st.write(f"**{email.subject}**")
st.info(email.snippet)
with col2:
if st.button("?", key=f"important_{email.id}"):
# Extract what made this important
keywords = await extract_keywords(email.subject + email.body)
# Add to boost labels
user_profile.boost_labels.update(keywords)
st.success(f"? Learned: You care about {', '.join(keywords)}")
with col3:
if st.button("?", key=f"skip_{email.id}"):
# Learn to deprioritize these
keywords = await extract_keywords(email.subject)
user_profile.filter_labels.update(keywords)
st.success(f"? Will deprioritize: {', '.join(keywords)}")
Part 6: Orchestrating with LangGraph
Instead of fetching contents from all data sources sequential for the daily minutes:
Note: WorkflowState is a shared dictionary that nodes pass data through – like a clipboard for the workflow. The analyze node parses the user’s request and decides which data sources are needed.
Implementing Node Functions
Each node is just an async function:
async def _fetch_news(self, state: WorkflowState):
"""Fetch news in parallel."""
try:
articles = await self.mcp.execute_tool(
"fetch_hackernews",
{"max_stories": 10}
)
state["news_articles"] = articles
except Exception as e:
state["errors"].append(f"News fetch failed: {e}")
state["news_articles"] = []
return state
async def _search_context(self, state: WorkflowState):
"""Search RAG for relevant context."""
query = state["user_request"]
results = await self.rag.search(query, max_results=5)
# Build context string
context = "\n".join([r['content'] for r in results])
state["context"] = context
return state
Running the Workflow
# Execute the complete workflow
result = await orchestrator.run("Generate my daily brief")
# I get back:
{
"news_articles": [...], # 10 articles
"emails": [...], # 5 unread
"calendar_events": [...], # 3 events today
"context": "...", # RAG context
"summary": "...", # Generated brief
"processing_time": 5.2 # Seconds (not 11!)
}
The LLM Factory Pattern – How I Made It Cloud-Ready
Following code snippet shows how does the system seamlessly switch between local Ollama and cloud providers:
# src/services/llm_factory.py
def get_llm_service():
"""Factory pattern - works with any LLM provider."""
provider = os.getenv("LLM_PROVIDER", "ollama")
if provider == "ollama":
return OllamaService(
base_url=os.getenv("OLLAMA_BASE_URL", "http://localhost:11434"),
model=os.getenv("OLLAMA_MODEL", "qwen2.5:7b")
)
elif provider == "openai":
return OpenAIService(
api_key=os.getenv("OPENAI_API_KEY"),
model=os.getenv("OPENAI_MODEL", "gpt-4-turbo")
)
elif provider == "google":
# Like in my previous Vertex AI article!
return VertexAIService(
project_id=os.getenv("GCP_PROJECT_ID"),
model="gemini-1.5-flash"
)
raise ValueError(f"Unknown provider: {provider}")
# All services implement the same interface:
class BaseLLMService:
async def generate(self, prompt: str, **kwargs) -> str:
"""Generate text from prompt."""
raise NotImplementedError
async def create_embeddings(self, text: str) -> List[float]:
"""Create vector embeddings."""
raise NotImplementedError
The ReAct agent, RAG service, and Brief Generator all use get_llm_service() – they don’t care which provider is running!
Part 7: The Challenges I Faced
Building this system wasn’t smooth. Here are the biggest challenges:
Challenge 1: LLM Generating Vague Summaries
Problem: My early briefs were terrible:
? "Today's news features a mix of technology updates and various topics."
This was useless! I needed specifics.
Solution: I rewrote my prompts with explicit rules:
# ? Better prompt with strict rules
prompt = f"""Generate a daily brief following these STRICT rules:
PRIORITY ORDER (most important first):
1. Urgent emails or action items
2. Today's calendar events
3. Market/business news
4. Tech news
TLDR FORMAT (exactly 3 bullets, be SPECIFIC):
* Bullet 1: Most urgent email/action (include WHO, WHAT, WHEN)
Example: "Client escalation from Acme Corp affecting 50K users - response needed by 2pm"
* Bullet 2: Most important calendar event today (include TIME and WHAT TO PREPARE)
Example: "2pm: Board meeting - prepare Q4 revenue slides"
* Bullet 3: Top market/business news (include NUMBERS/SPECIFICS)
Example: "Federal Reserve raises rates 0.5% to 5.25% - affects tech hiring"
AVOID THESE PHRASES (they're too vague):
? "mix of updates"
? "various topics"
? "continues to make progress"
? "interesting developments"
USE SPECIFIC DETAILS:
? Names (people, companies)
? Numbers (percentages, dollar amounts, deadlines)
? Times (when something happened or needs to happen)
Content to summarize:
{content}
Generate: TLDR (3 bullets), Summary (5-6 detailed sentences), Key Insights (5 bullets)
"""
Result: Went from vague ? specific, actionable briefs!
Solution: Split and render each bullet separately:
# ? Doesn't work
st.info(tldr)
# ? Works!
tldr_lines = [line.strip() for line in tldr.split('\n') if line.strip()]
for bullet in tldr_lines:
st.markdown(bullet)
Challenge 3: AI Prioritizing News Over Personal Tasks
Problem: My brief focused on tech news, ignored my urgent emails:
Solution: I restructured my prompt to explicitly label priority:
# src/services/brief_scheduler.py
async def _generate_daily_brief(emails, calendar, news, weather):
"""Generate prioritized daily brief with structured prompt."""
# Separate market vs tech news (market is higher priority)
market_news = [n for n in news if 'market' in n.tags]
tech_news = [n for n in news if 'market' not in n.tags]
# Sort emails by RLHF-boosted importance score
important_emails = sorted(
emails,
key=lambda e: e.apply_rlhf_boost(e.subject + e.snippet),
reverse=True
)[:5] # Top 5 only
# Build structured prompt with clear priority
prompt = f"""
**SECTION 1: IMPORTANT EMAILS (HIGHEST PRIORITY - use for TLDR bullet #1)**
{format_emails(important_emails)}
**SECTION 2: TODAY'S CALENDAR (SECOND PRIORITY - use for TLDR bullet #2)**
{format_calendar(calendar)}
**SECTION 3: MARKET NEWS (THIRD PRIORITY - use for TLDR bullet #3)**
{format_market_news(market_news)}
**SECTION 4: TECH NEWS (LOWEST PRIORITY - summarize briefly)**
{format_tech_news(tech_news)}
**SECTION 5: WEATHER**
{format_weather(weather)}
Generate a daily brief following this EXACT priority order:
1. Email action items FIRST
2. Calendar events SECOND
3. Market/business news THIRD
4. Tech news LAST (brief mention only)
TLDR must have EXACTLY 3 bullets using content from sections 1, 2, 3 (not section 4).
"""
return await llm.generate(prompt)
Result: My urgent email moved to bullet #1 where it belongs! The AI now respects the priority structure.
Challenge 4: RAG Returning Irrelevant Results
Problem: Semantic search sometimes returned weird matches:
Query: "AI safety regulations"
Result: Article about "safe AI models for healthcare" (wrong context!)
Solution: I added metadata filtering and better embeddings:
? Fast: No network latency, responses in 2-3 seconds
? Private: My emails never touch the internet
? Offline: Works on planes, cafes without WiFi
Trade-offs I accept:
?? Slower than GPT-4
?? Less capable reasoning (7B vs 175B+ parameters)
?? Manual updates (pull new Ollama models myself)
Production
# .env.production
LLM_PROVIDER=openai # Just change this line!
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4-turbo
DATABASE_URL=postgresql://... # Scalable DB
REDIS_URL=redis://prod-cluster:6379 # Distributed cache
The magic: Same code, different LLM!
# src/services/llm_factory.py
def get_llm_service():
"""Factory pattern - works with any LLM."""
provider = os.getenv("LLM_PROVIDER", "ollama")
if provider == "ollama":
return OllamaService()
elif provider == "openai":
return OpenAIService()
elif provider == "anthropic":
return ClaudeService()
elif provider == "google":
return VertexAIService() # Like in my previous article!
raise ValueError(f"Unknown provider: {provider}")
Part 11: Testing Everything
I used TDD extensively to build each feature so that it’s easy to debug if something is not working:
Unit Tests
# Test MCP tool registration
pytest tests/unit/test_mcp_server.py -v
# Test RAG semantic search
pytest tests/unit/test_rag_service.py -v
# Test ReAct reasoning
pytest tests/unit/test_react_agent.py -v
# Test RLHF scoring
pytest tests/unit/test_rlhf_scoring.py -v
# Run all unit tests
pytest tests/unit/ -v
# 516 passed in 45.23s ?
Integration Tests
Also, in some cases unit tests couldn’t fully validate so I wrote integration tests to test persistence logic with sqlite database or generating real analysis from news:
# tests/integration/test_brief_quality.py
async def test_tldr_has_three_bullets():
"""TLDR must have exactly 3 bullets."""
brief = await db.get_cache('daily_brief_data')
tldr = brief.get('tldr', '')
bullets = [line for line in tldr.split('\n') if line.strip().startswith('•')]
assert len(bullets) == 3, f"Expected 3 bullets, got {len(bullets)}"
assert "email" in bullets[0].lower() or "urgent" in bullets[0].lower()
assert "calendar" in bullets[1].lower() or "meeting" in bullets[1].lower()
async def test_no_generic_phrases():
"""Brief should not contain vague phrases."""
brief = await db.get_cache('daily_brief_data')
summary = brief.get('summary', '')
bad_phrases = ["mix of updates", "various topics", "continues to"]
for phrase in bad_phrases:
assert phrase not in summary.lower(), f"Found generic phrase: {phrase}"
Manual Testing (My Daily Workflow)
# 1. Fetch data and generate brief
make preload
# Output I see:
# ? Fetching news from HackerNews... (10 articles)
# ? Fetching weather... (70°F, Sunny)
# ? Analyzing articles with AI... (15 articles)
# ? Generating daily brief... (Done in 18.3s)
# ? Brief saved to database
# 2. Launch UI
streamlit run src/ui/streamlit_app.py
# 3. Check brief quality
# - Is TLDR specific? (not vague)
# - Are priorities correct? (email > calendar > news)
# - Are action items extracted? (from emails)
# - Did RLHF work? (boosted my preferences)
Note: You can schedule preload via cron, e.g., I run it at 6am daily so that brief is ready when I wake up.
Conclusion
Building this Daily Minutes assistant changed how I start my day by giving me a personalized 3-minute brief highlighting what truly matters. Agentic AI excels at automating complex workflows that require judgment, not just execution. The ReAct agent reasons through prioritization. RAG provides contextual memory across weeks of interactions. RLHF learns from my feedback, getting smarter about what I care about. LangGraph orchestrates parallel execution across multiple data sources. These building blocks work together to handle decisions that traditionally needed human attention.
I’m sharing this as a proof of concept, not a finished product. The code works, saves me real time, and demonstrates these techniques effectively. But I’m still iterating. The OAuth integration and error handling needs improvements. The RLHF scoring could be more sophisticated. The ReAct agent sometimes overthinks simple tasks. I’m adding these improvements gradually, testing each change against my daily routine.
The real lesson? Start small, validate with real use, then scale with confidence. I used Claude Code to build this in spare time over a couple weeks. You can do the same—clone the repo, adapt it to your workflow, and see where agentic AI saves you time.
Try It Yourself
# Clone my repo
git clone https://github.com/bhatti/daily-minutes
cd daily-minutes
# Install dependencies
pip install -r requirements.txt
# Setup Ollama
ollama pull qwen2.5:7b
ollama pull nomic-embed-text
# Generate your first brief
make preload
# Launch dashboard
streamlit run src/ui/streamlit_app.py
I spent over a decade in FinTech building the systems traders rely on every day like high-performance APIs streaming real-time charts, technical indicator calculators processing millions of data points per second, and comprehensive analytical platforms ingesting SEC 10-Ks and 10-Qs into distributed databases. We used to parse XBRL filings, ran news/sentiment analysis on earnings calls using early NLP models to detect market anomalies.
Over the past couple of years, I’ve been building AI agents and creating automated workflows that tackle complex problems using agentic AI. I’m also revisiting challenges I hit while building trading tools for fintech companies. For example, the AI I’m working with now reasons about which analysis to run. It grasps context, retrieves information on demand, and orchestrates complex workflows autonomously. It applies Black-Scholes when needed, switches to technical analysis when appropriate, and synthesizes insights from multiple sources—no explicit rules required.
The best part is that I’m running this entire system on my laptop using Ollama and open-source models. Zero API costs during development. When I need production scale, I can switch to cloud APIs with a few lines of code. I will walk you through this journey of rebuilding financial analysis with agentic AI – from traditional algorithms to thinking machines and from rigid pipelines to adaptive workflows.
Why This Approach Changes Everything
Traditional financial systems process data. Agentic AI systems understand objectives and figure out how to achieve them. That’s the fundamental difference that took me a while to fully grasp. And unlike my old systems that required separate codebases for each type of analysis, this one uses the same underlying patterns for everything.
The Money-Saving Secret: Local Development with Ollama
Here’s something that would have saved my startup thousands: you can build and test sophisticated AI systems entirely locally using Ollama. No API keys, no usage limits, no surprise bills.
# This runs entirely on your machine - zero external API calls
from langchain_ollama import OllamaLLM as Ollama
# Local LLM for development and testing
dev_llm = Ollama(
model="llama3.2:latest", # 3.2GB model that runs on most laptops
temperature=0.7,
base_url="http://localhost:11434" # Your local Ollama instance
)
# When ready for production, switch to cloud providers
from langchain_openai import ChatOpenAI
prod_llm = ChatOpenAI(
model="gpt-4",
temperature=0.7
)
# The beautiful part? Same interface, same code
def analyze_stock(llm, ticker):
# This function works with both local and cloud LLMs
prompt = f"Analyze {ticker} stock fundamentals"
return llm.invoke(prompt)
During development, I run hundreds of experiments daily without spending a cent. Once the prompts and workflows are refined, switching to cloud APIs is literally changing one line of code.
Understanding ReAct: How AI Learns to Think Step-by-Step
ReAct (Reasoning and Acting) was the first pattern that made me realize we weren’t just building chatbots anymore. Let me show you exactly how it works with real code from my system.
The Human Thought Process We’re Mimicking
When I manually analyzed stocks, my mental process looked something like this:
“I need to check if Apple is overvalued”
“Let me get the current P/E ratio”
“Hmm, 28.5 seems high, but what’s the industry average?”
“Tech sector average is 25, so Apple is slightly premium”
“But wait, what’s their growth rate?”
“15% annual growth… that PEG ratio of 1.9 suggests fair value”
“Let me check recent news for any red flags…”
ReAct agents follow this exact pattern. Here’s the actual implementation:
class ReActAgent:
"""ReAct Agent that demonstrates reasoning traces"""
# This is the actual prompt from the project
REACT_PROMPT = """You are a financial analysis agent that uses the ReAct framework to solve problems.
You have access to the following tools:
{tools_description}
Use the following format EXACTLY:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, must be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin! Remember to ALWAYS follow the format exactly.
Question: {question}
Thought: {scratchpad}"""
def _parse_response(self, response: str) -> Tuple[str, str, str, bool]:
"""Parse LLM response to extract thought, action, and input"""
response = response.strip()
# Check for final answer
if "Final Answer:" in response:
parts = response.split("Final Answer:")
thought = parts[0].strip()
final_answer = parts[1].strip()
return thought, "final_answer", final_answer, True
# Parse using regex from actual implementation
thought_match = re.search(r"Thought:\s*(.+?)(?=Action:|$)", response, re.DOTALL)
action_match = re.search(r"Action:\s*(.+?)(?=Action Input:|$)", response, re.DOTALL)
input_match = re.search(r"Action Input:\s*(.+?)(?=Observation:|$)", response, re.DOTALL)
thought = thought_match.group(1).strip() if thought_match else "Thinking..."
action = action_match.group(1).strip() if action_match else "unknown"
action_input = input_match.group(1).strip() if input_match else ""
return thought, action, action_input, False
I can easily trace through reasoning to debug how AI reached its conclusion.
RAG: Solving the Hallucination Problem Once and For All
Early in my experiments, I had to deal with a bit of hallucinations when querying financial data with AI so I applied RAG (Retrieval-Augmented Generation) to give AI access to a searchable library of documents.
How RAG Actually Works
You can think of RAG like having a research assistant who, instead of relying on memory, always checks the source documents before answering:
class RAGEngine:
"""
This engine solved my hallucination problems by grounding
all responses in actual documents. It's like giving the AI
access to your company's document database.
"""
def __init__(self):
# Initialize embeddings - this converts text to searchable vectors
# Using Ollama's local embedding model (free!)
self.embeddings = OllamaEmbeddings(
model="nomic-embed-text:latest" # 274MB model, runs fast
)
# Text splitter - crucial for handling large documents
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # Small enough for context window
chunk_overlap=50, # Overlap prevents losing context at boundaries
separators=["\n\n", "\n", ". ", " "] # Smart splitting
)
# Vector store - where we keep our searchable documents
self.vector_store = FAISS.from_texts(["init"], self.embeddings)
def load_financial_documents(self, ticker: str):
"""
In production, this would load real 10-Ks, 10-Qs, earnings calls.
For now, I'm using sample documents to demonstrate the concept.
"""
# Imagine these are real SEC filings
documents = [
{
"content": f"""
{ticker} Q3 2024 Earnings Report
Revenue: $94.9 billion, up 6% year over year
iPhone revenue: $46.2 billion
Services revenue: $23.3 billion (all-time record)
Gross margin: 45.2%
Operating cash flow: $28.7 billion
CEO Tim Cook: "We're incredibly pleased with our record
September quarter results and strong momentum heading into
the holiday season."
""",
"metadata": {
"source": "10-Q Filing",
"date": "2024-10-31",
"document_type": "earnings_report",
"ticker": ticker
}
},
# ... more documents
]
# Process each document
for doc in documents:
# Split into chunks
chunks = self.text_splitter.split_text(doc["content"])
# Create document objects with metadata
for i, chunk in enumerate(chunks):
metadata = doc["metadata"].copy()
metadata["chunk_id"] = i
metadata["total_chunks"] = len(chunks)
# Add to vector store
self.vector_store.add_texts(
texts=[chunk],
metadatas=[metadata]
)
print(f"? Loaded {len(documents)} documents for {ticker}")
def answer_with_sources(self, question: str) -> Dict[str, Any]:
"""
This is where RAG shines - every answer comes with sources
"""
# Find relevant document chunks
relevant_docs = self.vector_store.similarity_search_with_score(
question,
k=5 # Top 5 most relevant chunks
)
# Build context from retrieved documents
context_parts = []
sources = []
for doc, score in relevant_docs:
# Only use highly relevant documents (score < 0.5)
if score < 0.5:
context_parts.append(doc.page_content)
sources.append({
"content": doc.page_content[:100] + "...",
"source": doc.metadata.get("source"),
"date": doc.metadata.get("date"),
"relevance_score": float(score)
})
context = "\n\n---\n\n".join(context_parts)
# Generate answer grounded in retrieved context
prompt = f"""Based on the following verified documents, answer the question.
If the answer is not in the documents, say "I don't have that information."
Documents:
{context}
Question: {question}
Answer (cite sources):"""
response = self.llm.invoke(prompt)
return {
"answer": response,
"sources": sources,
"confidence": len(sources) / 5 # Simple confidence metric
}
MCP-Style Tools: Extending AI Capabilities Beyond Text
Model Context Protocol (MCP) helped me to build a flexible tool system. Instead of hardcoding every capability, we give the AI tools it can discover and use:
class BaseTool(ABC):
"""
Every tool self-describes its capabilities.
This is like giving the AI an instruction manual for each tool.
"""
@abstractmethod
def get_schema(self) -> ToolSchema:
"""Define what this tool does and how to use it"""
pass
@abstractmethod
def execute(self, **kwargs) -> Any:
"""Actually run the tool"""
pass
class StockDataTool(BaseTool):
"""
Real example: This tool replaced my entire market data microservice
"""
def get_schema(self) -> ToolSchema:
return ToolSchema(
name="stock_data",
description="Fetch real-time stock market data including price, volume, and fundamentals",
category=ToolCategory.DATA_RETRIEVAL,
parameters=[
ToolParameter(
name="ticker",
type="string",
description="Stock symbol like AAPL or GOOGL",
required=True
),
ToolParameter(
name="metrics",
type="array",
description="Specific metrics to retrieve",
required=False,
default=["price", "volume", "pe_ratio"],
enum=["price", "volume", "pe_ratio", "market_cap",
"dividend_yield", "beta", "rsi", "moving_avg_50"]
)
],
returns="Dictionary containing requested stock metrics",
examples=[
{"ticker": "AAPL", "metrics": ["price", "pe_ratio"]},
{"ticker": "TSLA", "metrics": ["price", "volume", "rsi"]}
]
)
def execute(self, **kwargs) -> Dict[str, Any]:
"""
This connects to real market data APIs.
In my old system, this was a 500-line service.
"""
ticker = kwargs["ticker"].upper()
metrics = kwargs.get("metrics", ["price", "volume"])
# Using yfinance for real market data
import yfinance as yf
stock = yf.Ticker(ticker)
info = stock.info
result = {"ticker": ticker, "timestamp": datetime.now().isoformat()}
# Fetch requested metrics
metric_mapping = {
"price": lambda: info.get("currentPrice", stock.history(period="1d")['Close'].iloc[-1]),
"volume": lambda: info.get("volume", 0),
"pe_ratio": lambda: info.get("trailingPE", 0),
"market_cap": lambda: info.get("marketCap", 0),
"dividend_yield": lambda: info.get("dividendYield", 0) * 100,
"beta": lambda: info.get("beta", 1.0),
"rsi": lambda: self._calculate_rsi(stock),
"moving_avg_50": lambda: stock.history(period="50d")['Close'].mean()
}
for metric in metrics:
if metric in metric_mapping:
try:
result[metric] = metric_mapping[metric]()
except Exception as e:
result[metric] = f"Error: {str(e)}"
return result
class ToolParameter(BaseModel):
"""Actual parameter definition from project"""
name: str
type: str # "string", "number", "boolean", "object", "array"
description: str
required: bool = True
default: Any = None
enum: Optional[List[Any]] = None
class CalculatorTool(BaseTool):
"""Actual calculator implementation from project"""
def execute(self, **kwargs) -> float:
"""Safely evaluate mathematical expression"""
self.validate_input(**kwargs)
expression = kwargs["expression"]
precision = kwargs.get("precision", 2)
try:
# Security: Remove dangerous operations
safe_expr = expression.replace("__", "").replace("import", "")
# Define allowed functions (from actual code)
safe_dict = {
"abs": abs, "round": round, "min": min, "max": max,
"sum": sum, "pow": pow, "len": len
}
# Add math functions
import math
for name in ["sqrt", "log", "log10", "sin", "cos", "tan", "pi", "e"]:
if hasattr(math, name):
safe_dict[name] = getattr(math, name)
result = eval(safe_expr, {"__builtins__": {}}, safe_dict)
return round(result, precision)
except Exception as e:
raise ValueError(f"Calculation error: {e}")
Orchestrating Everything with LangGraph
This is where all the pieces come together. LangGraph allows coordinating multiple agents and tools in sophisticated workflows:
class FinancialAnalysisWorkflow:
"""
This workflow replaces what used to be multiple microservices,
message queues, and orchestration layers. It's beautiful.
"""
def _build_graph(self) -> StateGraph:
"""
Define how different analysis components work together
"""
workflow = StateGraph(AgentState)
# Add all our analysis nodes
workflow.add_node("collect_data", self.collect_market_data)
workflow.add_node("technical_analysis", self.run_technical_analysis)
workflow.add_node("fundamental_analysis", self.run_fundamental_analysis)
workflow.add_node("sentiment_analysis", self.analyze_sentiment)
workflow.add_node("options_analysis", self.analyze_options)
workflow.add_node("portfolio_optimization", self.optimize_portfolio)
workflow.add_node("rag_research", self.search_documents)
workflow.add_node("react_reasoning", self.reason_about_data)
workflow.add_node("generate_report", self.create_final_report)
# Entry point
workflow.set_entry_point("collect_data")
# Define the flow - some parallel, some sequential
workflow.add_edge("collect_data", "technical_analysis")
workflow.add_edge("collect_data", "fundamental_analysis")
workflow.add_edge("collect_data", "sentiment_analysis")
# These can run in parallel
workflow.add_conditional_edges(
"collect_data",
self.should_run_options, # Only if options are relevant
{
"yes": "options_analysis",
"no": "rag_research"
}
)
# Everything feeds into reasoning
workflow.add_edge(["technical_analysis", "fundamental_analysis",
"sentiment_analysis", "options_analysis"],
"react_reasoning")
# Reasoning leads to report
workflow.add_edge("react_reasoning", "generate_report")
# End
workflow.add_edge("generate_report", END)
return workflow
def analyze_stock_comprehensive(self, ticker: str, investment_amount: float = 10000):
"""
This single function replaces what used to be an entire team's
worth of manual analysis.
"""
initial_state = {
"ticker": ticker,
"investment_amount": investment_amount,
"timestamp": datetime.now(),
"messages": [],
"market_data": {},
"technical_indicators": {},
"fundamental_metrics": {},
"sentiment_scores": {},
"options_data": {},
"portfolio_recommendation": {},
"documents_retrieved": [],
"reasoning_trace": [],
"final_report": "",
"errors": []
}
# Run the workflow
try:
result = self.app.invoke(initial_state)
return self._format_comprehensive_report(result)
except Exception as e:
# Graceful degradation
return self._run_basic_analysis(ticker, investment_amount)
class WorkflowNodes:
"""Collection of workflow nodes from actual project"""
def collect_market_data(self, state: AgentState) -> AgentState:
"""Node: Collect market data using tools"""
print("? Collecting market data...")
ticker = state["ticker"]
try:
# Use actual stock data tool from project
tool = self.tool_registry.get_tool("stock_data")
market_data = tool.execute(
ticker=ticker,
metrics=["price", "volume", "market_cap", "pe_ratio", "52_week_high", "52_week_low"]
)
state["market_data"] = market_data
# Add message to history
state["messages"].append(
AIMessage(content=f"Collected market data for {ticker}")
)
except Exception as e:
state["error"] = f"Failed to collect market data: {str(e)}"
state["market_data"] = {}
return state
Here is a screenshot from the example showing workflow analysis:
Production Considerations: From Tutorial to Trading Floor
This tutorial demonstrates core concepts, but let me be clear – production deployment in financial services requires significantly more rigor. Having deployed similar systems in regulated environments, here’s what you’ll need to consider:
The Reality of Production Deployment
Production financial systems require months of parallel running and validation. In my experience, you’ll need:
class ProductionValidation:
"""
Always run new systems parallel to existing ones
"""
def validate_against_legacy(self, ticker: str):
# Run both systems
legacy_result = self.legacy_system.analyze(ticker)
agent_result = self.agent_system.analyze(ticker)
# Compare results
discrepancies = self.compare_results(legacy_result, agent_result)
# Log everything for audit
self.audit_log.record({
"ticker": ticker,
"timestamp": datetime.now(),
"legacy": legacy_result,
"agent": agent_result,
"discrepancies": discrepancies,
"approved": len(discrepancies) == 0
})
# Require human review for discrepancies
if discrepancies:
return self.escalate_to_human(discrepancies)
return agent_result
Integrating Traditional Financial Algorithms
While this tutorial uses general-purpose LLMs, production systems should combine AI with proven financial algorithms:
class HybridAnalyzer:
"""
Combine traditional algorithms with AI reasoning
"""
def analyze_options(self, ticker: str, strike: float, expiry: str):
# Use traditional Black-Scholes for pricing
traditional_price = self.black_scholes_pricer.calculate(
ticker, strike, expiry
)
# Use AI for market context
ai_context = self.agent.analyze_market_conditions(ticker)
# Combine both
if ai_context["volatility_regime"] == "high":
# AI detected unusual conditions, adjust model
adjusted_price = traditional_price * (1 + ai_context["vol_adjustment"])
confidence = "low - unusual market conditions"
else:
adjusted_price = traditional_price
confidence = "high - normal market conditions"
return {
"model_price": traditional_price,
"adjusted_price": adjusted_price,
"confidence": confidence,
"reasoning": ai_context["reasoning"]
}
Fitness Functions for Financial Accuracy
Financial data cannot tolerate hallucinations. Implement strict validation:
class FinancialFitnessValidator:
"""
Reject hallucinated or impossible financial data
"""
def validate_metrics(self, ticker: str, metrics: Dict):
validations = {
"pe_ratio": lambda x: -100 < x < 1000,
"price": lambda x: x > 0,
"market_cap": lambda x: x > 0,
"dividend_yield": lambda x: 0 <= x <= 20,
"revenue_growth": lambda x: -100 < x < 200
}
for metric, validator in validations.items():
if metric in metrics:
value = metrics[metric]
if not validator(value):
raise ValueError(f"Invalid {metric}: {value} for {ticker}")
# Cross-validation
if "pe_ratio" in metrics and "earnings" in metrics:
calculated_pe = metrics["price"] / metrics["earnings"]
if abs(calculated_pe - metrics["pe_ratio"]) > 1:
raise ValueError("P/E ratio doesn't match price/earnings")
return True
Leverage Your Existing Data
If you have years of financial data in databases, you don’t need to start over. Use RAG to make it searchable:
# Convert your SQL database to vector-searchable documents
existing_data = sql_query("SELECT * FROM financial_reports")
rag_engine.add_documents([
{"content": row.text, "metadata": {"date": row.date, "ticker": row.ticker}}
for row in existing_data
])
Human-in-the-Loop
No matter how sophisticated your agents become, financial decisions affecting real money require human oversight. Build it in from day one:
Confidence thresholds that trigger human review
Clear audit trails showing agent reasoning
Easy override mechanisms
Gradual automation based on proven accuracy
class HumanInTheLoopWorkflow:
"""
Ensure human review for critical decisions
"""
def execute_trade_recommendation(self, recommendation: Dict):
# Auto-approve only for low-risk, small trades
if (recommendation["risk_score"] < 0.3 and
recommendation["amount"] < 10000):
return self.execute(recommendation)
# Require human approval for everything else
approval_request = {
"recommendation": recommendation,
"agent_reasoning": recommendation["reasoning_trace"],
"confidence": recommendation["confidence_score"],
"risk_assessment": self.assess_risks(recommendation)
}
# Send to human reviewer
human_decision = self.request_human_review(approval_request)
if human_decision["approved"]:
return self.execute(recommendation)
else:
self.log_rejection(human_decision["reason"])
Cost Management and Budget Controls
During development, Ollama gives you free local inference. In production, costs add up quickly so you need to build proper controls for calculating cost of analysis:
GPT-4: ~$30 per million tokens
Claude-3: ~$20 per million tokens
Local Llama: Free but needs GPU infrastructure
class CostController:
"""
Prevent runway costs in production
"""
def __init__(self, daily_budget: float = 100.0):
self.daily_budget = daily_budget
self.costs_today = 0.0
self.cost_per_token = {
"gpt-4": 0.00003, # $0.03 per 1K tokens
"claude-3": 0.00002,
"llama-local": 0.0 # Free but has compute cost
}
def check_budget(self, estimated_tokens: int, model: str):
estimated_cost = estimated_tokens * self.cost_per_token.get(model, 0)
if self.costs_today + estimated_cost > self.daily_budget:
# Switch to local model or cache
return "use_local_model"
return "proceed"
def track_usage(self, tokens_used: int, model: str):
cost = tokens_used * self.cost_per_token.get(model, 0)
self.costs_today += cost
# Alert if approaching limit
if self.costs_today > self.daily_budget * 0.8:
self.send_alert(f"80% of daily budget used: ${self.costs_today:.2f}")
Caching Is Essential
Caching is crucial for both performance and cost effectiveness when running expensive analysis using LLMs.
class CachedRAGEngine(RAGEngine):
"""
Caching reduced our costs by 70% and improved response time by 5x
"""
def __init__(self):
super().__init__()
self.cache = Redis(host='localhost', port=6379, db=0)
self.cache_ttl = 3600 # 1 hour for financial data
def retrieve_with_cache(self, query: str, k: int = 5):
# Create cache key from query
cache_key = f"rag:{hashlib.md5(query.encode()).hexdigest()}"
# Check cache first
cached = self.cache.get(cache_key)
if cached:
return json.loads(cached)
# If not cached, retrieve and cache
docs = self.vector_store.similarity_search(query, k=k)
# Cache the results
self.cache.setex(
cache_key,
self.cache_ttl,
json.dumps([doc.to_dict() for doc in docs])
)
return docs
Fallback Strategies
A Cascading Fallback can help execute a task using a sequence of operations, ordered from the most preferred (highest quality/cost) to the least preferred (lowest quality/safest default).
class ResilientAgent:
"""
Production agents need multiple fallback options
"""
def analyze_with_fallbacks(self, ticker: str):
strategies = [
("primary", self.run_full_analysis),
("fallback_1", self.run_simplified_analysis),
("fallback_2", self.run_basic_analysis),
("emergency", self.return_cached_or_default)
]
for strategy_name, strategy_func in strategies:
try:
result = strategy_func(ticker)
result["strategy_used"] = strategy_name
return result
except Exception as e:
logger.warning(f"Strategy {strategy_name} failed: {e}")
continue
return {"error": "All strategies failed", "ticker": ticker}
Observability and Monitoring
Track token usage, latency, accuracy, and costs immediately. What you don’t measure, you can’t improve.
class ObservableWorkflow:
"""
You need to know what your AI is doing in production
"""
def __init__(self):
self.metrics = PrometheusMetrics()
self.tracer = JaegerTracer()
def execute_with_observability(self, state: AgentState):
with self.tracer.start_span("workflow_execution") as span:
span.set_tag("ticker", state["ticker"])
# Track token usage
tokens_start = self.llm.get_num_tokens(state)
# Execute workflow
result = self.workflow.invoke(state)
# Record metrics
tokens_used = self.llm.get_num_tokens(result) - tokens_start
self.metrics.record_tokens(tokens_used)
self.metrics.record_latency(span.duration)
# Log for debugging
logger.info(f"Workflow completed", extra={
"ticker": state["ticker"],
"tokens": tokens_used,
"duration": span.duration,
"strategy": result.get("strategy_used", "primary")
})
return result
Closing Thoughts
This tutorial demonstrates how agentic AI transforms financial analysis from rigid pipelines to adaptive, thinking systems. The combination of ReAct reasoning, RAG grounding, tool use, and workflow orchestration creates capabilities that surpass traditional approaches in flexibility and ease of development.
Start Simple, Build Incrementally:
Week 1: Basic ReAct agent to understand reasoning loops
Week 2: Add tools for external capabilities
Week 3: Implement RAG to ground responses in real data
Week 4: Orchestrate with workflows
Develop everything locally with Ollama first – it’s free and private
The point of agentic AI is automation. Here’s the pragmatic approach:
Automate in Tiers:
Tier 1 (Fully Automated): Data collection, technical calculations, report generation
Instead of permanent human-in-the-loop, use RL to train agents that learn from feedback:
class ReinforcementLearningLoop:
"""
Gradually reduce human involvement through learning
"""
def ai_based_reinforcement(self, decision, outcome):
"""AI learns from market outcomes directly"""
# Did the prediction match reality?
reward = self.calculate_reward(decision, outcome)
if decision["action"] == "buy" and outcome["price_change"] > 0.02:
reward = 1.0 # Good decision
elif decision["action"] == "hold" and abs(outcome["price_change"]) < 0.01:
reward = 0.5 # Correct to avoid volatility
else:
reward = -0.5 # Poor decision
# Update agent weights/prompts based on reward
self.agent.update_policy(decision["context"], reward)
def human_feedback_learning(self, decision, human_override=None):
"""Learn from human corrections when they occur"""
if human_override:
# Human disagreed - strong learning signal
self.agent.record_correction(
agent_decision=decision,
human_decision=human_override,
weight=10.0 # Human feedback weighted heavily
)
else:
# Human agreed (implicitly by not overriding)
self.agent.reinforce_decision(decision, weight=1.0)
def adaptive_automation_threshold(self):
"""Dynamically adjust when human review is needed"""
recent_accuracy = self.get_recent_accuracy(days=30)
if recent_accuracy > 0.95:
self.confidence_threshold *= 0.9 # Require less human review
elif recent_accuracy < 0.85:
self.confidence_threshold *= 1.1 # Require more human review
return self.confidence_threshold
This approach reduces human involvement over time: use that feedback to train, gradually automate decisions where the agent consistently agrees with humans, and only escalate novel situations or low-confidence decisions.